In [1]:
# Tu będę importował potrzebne biblioteki
import pandas as pd
import numpy as np

#### Case study

Dane: HappyCustomerBank
Opis problemu i danych:

About Company

Happy Customer Bank is a mid-sized private bank which deals in all kinds of loans. They have presence across all major cities in India and focus on lending products. They have a digital arm which sources customers from the internet.

Problem

Digital arms of banks today face challenges with lead conversion, they source leads through mediums like search, display, email campaigns and via affiliate partners. Here Happy Customer Bank faces same challenge of low conversion ratio. They have given a problem to identify the customers segments having higher conversion ratio for a specific loan product so that they can specifically target these customers, here they have provided a partial data set for salaried customers only from the last 3 months. They also capture basic details about customers like gender, DOB, existing EMI, employer Name, Loan Amount Required, Monthly Income, City, Interaction data and many others. Let’s look at the process at Happy Customer Bank.

In above process, customer applications can drop majorly at two stages, at login and approval/ rejection by bank. Here we need to identify the segment of customers having higher disbursal rate in next 30 days.

Data Set

We have train and test data set, train data set has both input and output variable(s). Need to predict probability of disbursal for test data set.

Input variables:

    ID - Unique ID (can not be used for predictions)
    Gender- Sex
    City - Current City
    Monthly_Income - Monthly Income in rupees
    DOB - Date of Birth
    Lead_Creation_Date - Lead Created on date
    Loan_Amount_Applied - Loan Amount Requested (INR)
    Loan_Tenure_Applied - Loan Tenure Requested (in years)
    Existing_EMI - EMI of Existing Loans (INR)
    Employer_Name - Employer Name
    Salary_Account- Salary account with Bank
    Mobile_Verified - Mobile Verified (Y/N)
    Var5- Continuous classified variable
    Var1- Categorical variable with multiple levels
    Loan_Amount_Submitted- Loan Amount Revised and Selected after seeing Eligibility
    Loan_Tenure_Submitted- Loan Tenure Revised and Selected after seeing Eligibility (Years)
    Interest_Rate- Interest Rate of Submitted Loan Amount
    Processing_Fee- Processing Fee of Submitted Loan Amount (INR)
    EMI_Loan_Submitted- EMI of Submitted Loan Amount (INR)
    Filled_Form- Filled Application form post quote
    Device_Type- Device from which application was made (Browser/ Mobile)
    Var2- Categorical Variable with multiple Levels
    Source- Categorical Variable with multiple Levels
    Var4- Categorical Variable with multiple Levels

Outcomes:

    LoggedIn- Application Logged (Variable for understanding the problem – cannot be used in prediction)
    Disbursed- Loan Disbursed (Target Variable)

Źródło:

https://discuss.analyticsvidhya.com/t/hackathon-3-x-predict-customer-worth-for-happy-customer-bank/3802
Interesują nas dwie miary jakości rozwiązania:

    AUC

    Zysk, liczony w następujący sposób:
        zaklasyfikowanie obserwacji jako 1 kosztuje nas 100 zł (tzn. nasz model przypisuje obserwacji klasę 1),
        trafienie predykcją w klasę 1 przynosi nam 1000 zł zarobku.

Cel: osiągnąć jak największy zysk.


## propozycja kolejności pracy :
1. wczytanie danych 
2. sprawdzenie danych 
3. wybranie co jest y(Target Variable) i sprawdzenie co to za dane (what object it is)
4. wybranie kolumn, które nie będą potrzebne i wyrzucenie ich:
    - sprawdzenie co to za obiekty 
    - i ile jest unikalnych obiektów 
5. Wymyślenie co zrobić z nanami (jak są reprezentowane braki) i jak je uzupełnić 

6. Na tym co zostanie zrobić one-hot encoder aby mieć gotową dataframe do modelowania
7. Podzielić dane na zbiór testowy i treningowy :
    - czy występuje problem niezbalansowanych klas jak sobie z nim poradzić 
8. Modelowanie (wybieranie najlepszych modeli na podstawie zarobek = zysk - koszt i/lub AUC)
9. Wybieranie odpowiednich parametrów 
10. Wybranie najlepszego modelu
11. Modelowanie na danych z rozwiązanym problemem niezbalansowanych klas
12. wybranie najlepszego modelu.


Work Content:
1. Loading the File
2. Looking at data:
    - Checking what objects are there
    - How many unique objects are there
    - Check how many missing values are there and how are they represented in data.
3. Choosing which column is y (Target Variable):
    - checking what object it is
    - checking what distribution it has 
    - figuring out imbalanced data problem 
4. Choosing columns which I don't need in it and deleting them.
5. Discovering what to do with missing data.
6. Making one-hot encoder to be dataframe ready to modeling.
7. Import all fuction that need to modeling.
8. Spliting data to train and test data. 
8. Choosing classifiers to model data 
9. Hyperparameter-optimization:
    - RandomizedSearchCV 
    - GridSearchCV
10. Choosing the best model 
11. Model data with imblalance class problem solved and with regroup city column.
12. Choosing best model.

### 1. Wczytanie danych 
Przy wczytywaniu danych należy zwrócić uwagę na odpowiednie dobranie parametrów (hyperparam). W naszym przypadku ważne jest odpowiednie kodowanie(latin1)- należało sprawdzić w pliku jak jest on zakodowany 
Plik czytamy za pomocą pd.read_csv (zamienia pliki csv na DataFrame) 
Następnie patrzymy na pierwsze 5 wierszy aby dowiedzieć jak wygląda nasza ramka danych.

1. Loading the File
In this case i need to pay attention to choose good encoding parameter in read_csv function. Becouse this file is coded as latin1. Also this function reads this file as dataframe

In [2]:
train_data = pd.read_csv('C:\\Users\\piotr\\Data Science bootcamp ML\dane_3_4\Dataset\Train_nyOWmfK.csv', encoding='latin1') #
test_data = pd.read_csv('C:\\Users\\piotr\\Data Science bootcamp ML\dane_3_4\Dataset\Test_bCtAN1w.csv', encoding='latin1' )
# looking at 5 first rows of data
train_data.head()
# train data - dane z Target Variable, na których będziemy uczyć i testować dane 

# test_data - dane, na których sprawdzany będzie model bez Target Variable 

Unnamed: 0,ID,Gender,City,Monthly_Income,DOB,Lead_Creation_Date,Loan_Amount_Applied,Loan_Tenure_Applied,Existing_EMI,Employer_Name,...,Interest_Rate,Processing_Fee,EMI_Loan_Submitted,Filled_Form,Device_Type,Var2,Source,Var4,LoggedIn,Disbursed
0,ID000002C20,Female,Delhi,20000,23-May-78,15-May-15,300000.0,5.0,0.0,CYBOSOL,...,,,,N,Web-browser,G,S122,1,0,0
1,ID000004E40,Male,Mumbai,35000,07-Oct-85,04-May-15,200000.0,2.0,0.0,TATA CONSULTANCY SERVICES LTD (TCS),...,13.25,,6762.9,N,Web-browser,G,S122,3,0,0
2,ID000007H20,Male,Panchkula,22500,10-Oct-81,19-May-15,600000.0,4.0,0.0,ALCHEMIST HOSPITALS LTD,...,,,,N,Web-browser,B,S143,1,0,0
3,ID000008I30,Male,Saharsa,35000,30-Nov-87,09-May-15,1000000.0,5.0,0.0,BIHAR GOVERNMENT,...,,,,N,Web-browser,B,S143,3,0,0
4,ID000009J40,Male,Bengaluru,100000,17-Feb-84,20-May-15,500000.0,2.0,25000.0,GLOBAL EDGE SOFTWARE,...,,,,N,Web-browser,B,S134,3,1,0


In [3]:
test_data.head()

Unnamed: 0,ID,Gender,City,Monthly_Income,DOB,Lead_Creation_Date,Loan_Amount_Applied,Loan_Tenure_Applied,Existing_EMI,Employer_Name,...,Loan_Amount_Submitted,Loan_Tenure_Submitted,Interest_Rate,Processing_Fee,EMI_Loan_Submitted,Filled_Form,Device_Type,Var2,Source,Var4
0,ID000026A10,Male,Dehradun,21500,03-Apr-87,05-May-15,100000.0,3.0,0.0,APTARA INC,...,100000.0,3.0,20.0,1000.0,2649.39,N,Web-browser,B,S122,3
1,ID000054C40,Male,Mumbai,42000,12-May-80,01-May-15,0.0,0.0,0.0,ATUL LTD,...,690000.0,5.0,24.0,13800.0,19849.9,Y,Mobile,C,S133,5
2,ID000066O10,Female,Jaipur,10000,19-Sep-89,01-May-15,300000.0,2.0,0.0,SHAREKHAN PVT LTD,...,,,,,,N,Web-browser,B,S133,1
3,ID000110G00,Female,Chennai,14650,15-Aug-91,01-May-15,0.0,0.0,0.0,MAERSK GLOBAL SERVICE CENTRES,...,,,,,,N,Mobile,C,S133,1
4,ID000113J30,Male,Chennai,23400,22-Jul-87,01-May-15,100000.0,1.0,5000.0,SCHAWK,...,100000.0,2.0,,,,N,Web-browser,B,S143,1


###  wnioski
- train_data zawiera kolumnę Disbursed(Target Variable)- będzie ona y 
- test_data nie zawiera Disbursed(Target Variable) (W naszym problemie te dane są zbędne ( nie mamy jak sprawdzić czy zrobiliśmy dobrą predykcję)
- kolumny do wyrzucenia na pierwszym miejscu:
    -LoggedIn (wiemy o ty z treści zadania)
    -ID (Id klienta w banku)
- z kolumn DOB i Lead_Creation_Date wyciągnę wiek każdego klienta


## conclusions
- train_data has Disbursed column to be Target Variable - y 
- test_data doesn't contain the Disbursed column (in our case this will be unnecessary dataset)(I can't check if i have good model on this dataset
- First, I will delete the Loggedin nad ID columns
- Then, I will pull age of clients from columns: DOB and Lead_Creation_Date

### 2.Podejrzenie i opisanie danych 
- train_data.describe: Robimy tabelkę gdzie opisaną mamy każdą kolumnę(cechę)(feature) dzięki czemu możemy przeanalizować unikalną liczbę wartości w każdej kolumnie, zobaczyć gdzie brakuje danych( pod warunkiem, że są to nan lub None(chyba).

- train_data.dtypes: Sprawdzamy jaki jest są typy obiektów w danych (co musimy zamienić tak, żeby mieć float lub integer w każdej kolumnie

- train_data.isnull().sum(): sprawdzamy w jakich kolumnach i ile mamy wartość nan(brak danych)

Przy preprocessingu danych bardzo ważnym zadaniem jest zrozumienie danych, aby mieć odpowiednie podejście do radzenia sobie z brakiem danych( missing data) (ale nie tylko). W tym przypadku głównie będę zamieniał nan(missing data) na 0.   

### 2. Checking and describing data 
- function describes making a table which shows count and unique values in all columns 
- function dtypes shows which objects are in which columns
- Function isnull.sum shows sum of missing values in each column

There is a difficult problem in preprocessing data to understand business in the dataset. This is important to manage those missing values. In this dataset i will be replacing missing values to 0 value and to strings. 

In [4]:
# describes dataset 
train_data.describe(include = 'all')

Unnamed: 0,ID,Gender,City,Monthly_Income,DOB,Lead_Creation_Date,Loan_Amount_Applied,Loan_Tenure_Applied,Existing_EMI,Employer_Name,...,Interest_Rate,Processing_Fee,EMI_Loan_Submitted,Filled_Form,Device_Type,Var2,Source,Var4,LoggedIn,Disbursed
count,87020,87020,86017,87020.0,87020,87020,86949.0,86949.0,86949.0,86949.0,...,27726.0,27420.0,27726.0,87020,87020,87020,87020,87020.0,87020.0,87020.0
unique,87020,2,697,,11345,92,,,,43567.0,...,,,,2,2,7,30,,,
top,ID115326Q10,Male,Delhi,,11-Nov-80,03-Jul-15,,,,0.0,...,,,,N,Web-browser,B,S122,,,
freq,1,49848,12527,,306,2315,,,,4914.0,...,,,,67530,64316,37280,38567,,,
mean,,,,58849.97,,,230250.7,2.131399,3696.228,,...,19.197474,5131.150839,10999.528377,,,,,2.949805,0.02935,0.014629
std,,,,2177511.0,,,354206.8,2.014193,39810.21,,...,5.834213,4725.837644,7512.32305,,,,,1.69772,0.168785,0.120062
min,,,,0.0,,,0.0,0.0,0.0,,...,11.99,200.0,1176.41,,,,,0.0,0.0,0.0
25%,,,,16500.0,,,0.0,0.0,0.0,,...,15.25,2000.0,6491.6,,,,,1.0,0.0,0.0
50%,,,,25000.0,,,100000.0,2.0,0.0,,...,18.0,4000.0,9392.97,,,,,3.0,0.0,0.0
75%,,,,40000.0,,,300000.0,4.0,3500.0,,...,20.0,6250.0,12919.04,,,,,5.0,0.0,0.0


### wnioski
Zauważam, że kolumna Employer_name ma 43567 unikalnych wartości na 86567 wszystkich, na tej podstawie odrzucam tę kolumnę. 
Możemy zastanowić się nad pogrupowaniem tej kolumny pod względem czy jakaś firma jest państwowa czy nie ( ale jest to bardzo dużo pracy niekoniecznie sensownej)
Mamy 697 miast(City) (tu grupowanie na regiony może mieć sens dlatego, spróbuję to zrobić).


### Conclusion 
I noticed that the Employer_name column has 43567 unique values. I also discovered that this column contains names of companies.
I am wondering, whether i should group this column by private or government sectors, but i think this is unncessary.
I also think to regroup the city column in some way for it to make sense. (I will work on it at end of my work)

In [5]:
# dtyptes function 
train_data.dtypes

ID                        object
Gender                    object
City                      object
Monthly_Income             int64
DOB                       object
Lead_Creation_Date        object
Loan_Amount_Applied      float64
Loan_Tenure_Applied      float64
Existing_EMI             float64
Employer_Name             object
Salary_Account            object
Mobile_Verified           object
Var5                       int64
Var1                      object
Loan_Amount_Submitted    float64
Loan_Tenure_Submitted    float64
Interest_Rate            float64
Processing_Fee           float64
EMI_Loan_Submitted       float64
Filled_Form               object
Device_Type               object
Var2                      object
Source                    object
Var4                       int64
LoggedIn                   int64
Disbursed                  int64
dtype: object

### wnioski
Jest dużo kolumn typu object, które będe zamieniał na integer lub float
### Conclusion 
There are a lot of object type columns, which i need to change to either integer or float

In [6]:
# sprawdzamy w jakich kolumnach i ile mamy wartość nan(brak danych)
# checking in which columns there are missing values 
train_data.isnull().sum()

ID                           0
Gender                       0
City                      1003
Monthly_Income               0
DOB                          0
Lead_Creation_Date           0
Loan_Amount_Applied         71
Loan_Tenure_Applied         71
Existing_EMI                71
Employer_Name               71
Salary_Account           11764
Mobile_Verified              0
Var5                         0
Var1                         0
Loan_Amount_Submitted    34613
Loan_Tenure_Submitted    34613
Interest_Rate            59294
Processing_Fee           59600
EMI_Loan_Submitted       59294
Filled_Form                  0
Device_Type                  0
Var2                         0
Source                       0
Var4                         0
LoggedIn                     0
Disbursed                    0
dtype: int64

In [7]:
# 5 first rows from Salary_Account column.
train_data['Salary_Account'].head(10)

0              HDFC Bank
1             ICICI Bank
2    State Bank of India
3    State Bank of India
4              HDFC Bank
5                   HSBC
6               Yes Bank
7                    NaN
8    State Bank of India
9             Kotak Bank
Name: Salary_Account, dtype: object

### Wnioski 
W kolumnie Salary_Account zawierającej nazwy banków, brakujące dane zamienię na 'no bank'
W kolumnie City brakujące dane zamienię na 'no city'
W pozostałych kolumnach gdzie są brakujące dane zamienię je na 0 (oprócz Employer Name, którą usunę)


### Conclusion 
In the Salary_Account column with names of banks, i will change missing data to 'no bank' string. 
In City column i will change missing data to 'no city' - and later i will regroup this column to 'no city', 'big city', 'small city'.
In each column, i need to change missing values to '0'.

### 3. Wybranie Target variable
Z treści zadania wynika, że Disbursed to y
Sprawdzam liczności 0 i 1, żeby dowiedzieć się czy klasy będą nierównomierne

### 3. Choosing Target Variable- y
From the content of the instruction, i know to choose Disbursed column as target variable. In order to count of 0 and 1 in this column to deduce that this feature has imbalanced class.

In [46]:
licznosc_1 = np.sum(train_data['Disbursed'] == 1)
licznosc_0 = np.sum(train_data['Disbursed'] == 0)
# percentiel of 1- class in data 
procent_1 = (licznosc_1/(licznosc_0 + licznosc_1))
print(('Percent of ones in Disbursed(target variable): {:.2%} ').format(procent_1))
print(('Percent of zeros in Disbursed(target variable): {:.2%} ').format(1 -procent_1))

Percent of ones in Disbursed(target variable): 1.46% 
Percent of zeros in Disbursed(target variable): 98.54% 


In [9]:

y = train_data.Disbursed
X = train_data.drop(['Disbursed'], axis = 1)

In [10]:
X

Unnamed: 0,ID,Gender,City,Monthly_Income,DOB,Lead_Creation_Date,Loan_Amount_Applied,Loan_Tenure_Applied,Existing_EMI,Employer_Name,...,Loan_Tenure_Submitted,Interest_Rate,Processing_Fee,EMI_Loan_Submitted,Filled_Form,Device_Type,Var2,Source,Var4,LoggedIn
0,ID000002C20,Female,Delhi,20000,23-May-78,15-May-15,300000.0,5.0,0.0,CYBOSOL,...,,,,,N,Web-browser,G,S122,1,0
1,ID000004E40,Male,Mumbai,35000,07-Oct-85,04-May-15,200000.0,2.0,0.0,TATA CONSULTANCY SERVICES LTD (TCS),...,2.0,13.25,,6762.90,N,Web-browser,G,S122,3,0
2,ID000007H20,Male,Panchkula,22500,10-Oct-81,19-May-15,600000.0,4.0,0.0,ALCHEMIST HOSPITALS LTD,...,4.0,,,,N,Web-browser,B,S143,1,0
3,ID000008I30,Male,Saharsa,35000,30-Nov-87,09-May-15,1000000.0,5.0,0.0,BIHAR GOVERNMENT,...,5.0,,,,N,Web-browser,B,S143,3,0
4,ID000009J40,Male,Bengaluru,100000,17-Feb-84,20-May-15,500000.0,2.0,25000.0,GLOBAL EDGE SOFTWARE,...,2.0,,,,N,Web-browser,B,S134,3,1
5,ID000010K00,Male,Bengaluru,45000,21-Apr-82,20-May-15,300000.0,5.0,15000.0,COGNIZANT TECHNOLOGY SOLUTIONS INDIA PVT LTD,...,5.0,13.99,1500.0,6978.92,N,Web-browser,B,S143,3,1
6,ID000011L10,Female,Sindhudurg,70000,23-Oct-87,01-May-15,6.0,5.0,0.0,CARNIVAL CRUISE LINE,...,,,,,N,Web-browser,B,S133,1,0
7,ID000012M20,Male,Bengaluru,20000,25-Jul-75,20-May-15,200000.0,5.0,2597.0,GOLDEN TULIP FLORITECH PVT. LTD,...,5.0,,,,N,Web-browser,B,S159,3,0
8,ID000013N30,Male,Kochi,75000,26-Jan-72,02-May-15,0.0,0.0,0.0,SIIS PVT LTD,...,5.0,14.85,26000.0,30824.65,Y,Mobile,C,S122,5,0
9,ID000014O40,Female,Mumbai,30000,12-Sep-89,03-May-15,300000.0,3.0,0.0,SOUNDCLOUD.COM,...,3.0,18.25,1500.0,10883.38,N,Web-browser,B,S133,1,0


### Wnioski
Liczność 1 wynosi 1.5 % (czyli występuje problem nierównomierności klas)
"Zaproponować jakieś rozwiązanie problemu"

### Conclusion 
Counts 1 is 1.5% and 0 is 98.5%. I assume this data has imbalanced class problem beacouse of that distribution. (i will solve this problem later in this notebook)

### 4. Usunięcie kolumn
Usuwam kolumny:
    - Employer Name 
    - LoggedIn (wiemy o ty z treści zadania)
    - ID (Id klienta w banku)
    - po wyciągnięciu wieku usunę DOB i Lead_Creation_Date

### 4. Columns Removal
I am removing columns:
    - Employer Name 
    - LoggedIn (i know to remove this column from the content)
    - ID (customers ID)
    - i will take age from DOB and Lead_Creation_Date. Then i will remove this column

In [11]:
X = X.drop(['Employer_Name', 'LoggedIn', 'ID', 'City'],axis = 1) 


stworzenie kolumny age: 115(dane dla klientów z 2015) - dwia ostanie znaki z kolumny(DOB) data urodzenia(rok) 

I am creating new column age. I took age from two last marks from DOB column and substract this from 115.

In [12]:
X['age'] = [115-int(s[-2:]) for s in X.DOB]
X = X.drop(['DOB', 'Lead_Creation_Date'],axis = 1)

In [13]:
X

Unnamed: 0,Gender,Monthly_Income,Loan_Amount_Applied,Loan_Tenure_Applied,Existing_EMI,Salary_Account,Mobile_Verified,Var5,Var1,Loan_Amount_Submitted,Loan_Tenure_Submitted,Interest_Rate,Processing_Fee,EMI_Loan_Submitted,Filled_Form,Device_Type,Var2,Source,Var4,age
0,Female,20000,300000.0,5.0,0.0,HDFC Bank,N,0,HBXX,,,,,,N,Web-browser,G,S122,1,37
1,Male,35000,200000.0,2.0,0.0,ICICI Bank,Y,13,HBXA,200000.0,2.0,13.25,,6762.90,N,Web-browser,G,S122,3,30
2,Male,22500,600000.0,4.0,0.0,State Bank of India,Y,0,HBXX,450000.0,4.0,,,,N,Web-browser,B,S143,1,34
3,Male,35000,1000000.0,5.0,0.0,State Bank of India,Y,10,HBXX,920000.0,5.0,,,,N,Web-browser,B,S143,3,28
4,Male,100000,500000.0,2.0,25000.0,HDFC Bank,Y,17,HBXX,500000.0,2.0,,,,N,Web-browser,B,S134,3,31
5,Male,45000,300000.0,5.0,15000.0,HSBC,Y,17,HAXM,300000.0,5.0,13.99,1500.0,6978.92,N,Web-browser,B,S143,3,33
6,Female,70000,6.0,5.0,0.0,Yes Bank,N,0,HBXX,,,,,,N,Web-browser,B,S133,1,28
7,Male,20000,200000.0,5.0,2597.0,,Y,3,HBXX,200000.0,5.0,,,,N,Web-browser,B,S159,3,40
8,Male,75000,0.0,0.0,0.0,State Bank of India,Y,13,HAXB,1300000.0,5.0,14.85,26000.0,30824.65,Y,Mobile,C,S122,5,43
9,Female,30000,300000.0,3.0,0.0,Kotak Bank,Y,0,HBXC,300000.0,3.0,18.25,1500.0,10883.38,N,Web-browser,B,S133,1,26


## 5. Uzupełnienie brakujących danych 
W następujących kolumnach zamieniam brakujące dane (nan) na 0 :
Loan_Amount_Submitted, Loan_Tenure_Submitted, Interest_Rate, Processing_Fee, EMI_Loan_Submitted, Existing_EMI, Loan_Tenure_Applied, Loan_Amount_Applied.
W kolumnie Salary_Account braki danych zamieniam na 'no bank', w kolumnie City na 'no City'

### 5. Filling missing data
In the following columns, i am replacing missing data to '0': Loan_Amount_Submitted, Loan_Tenure_Submitted, Interest_Rate, Processing_Fee, EMI_Loan_Submitted, Existing_EMI, Loan_Tenure_Applied, Loan_Amount_Applied. 
In Salary_Account missing data i am replacing it to 'no bank' string and in city column to 'no city'.

In [14]:
X['Loan_Amount_Submitted'] = X['Loan_Amount_Submitted'].fillna(0)
X['Loan_Tenure_Submitted'] = X['Loan_Tenure_Submitted'].fillna(0)
X['Interest_Rate'] = X['Interest_Rate'].fillna(0)
X['Processing_Fee'] = X['Processing_Fee'].fillna(0)
X['EMI_Loan_Submitted'] = X['EMI_Loan_Submitted'].fillna(0)
X['Loan_Tenure_Applied'] = X['Loan_Tenure_Applied'].fillna(0)
X['Loan_Amount_Applied'] = X['Loan_Amount_Applied'].fillna(0)
X['Existing_EMI'] = X['Existing_EMI'].fillna(0)

In [15]:
# sprawdzamy w jakich kolumnach i ile mamy wartość nan(brak danych)
X.isnull().sum()

Gender                       0
Monthly_Income               0
Loan_Amount_Applied          0
Loan_Tenure_Applied          0
Existing_EMI                 0
Salary_Account           11764
Mobile_Verified              0
Var5                         0
Var1                         0
Loan_Amount_Submitted        0
Loan_Tenure_Submitted        0
Interest_Rate                0
Processing_Fee               0
EMI_Loan_Submitted           0
Filled_Form                  0
Device_Type                  0
Var2                         0
Source                       0
Var4                         0
age                          0
dtype: int64

In [16]:
X['Salary_Account'] = X['Salary_Account'].fillna('no bank')


In [17]:
X.isnull().sum()

Gender                   0
Monthly_Income           0
Loan_Amount_Applied      0
Loan_Tenure_Applied      0
Existing_EMI             0
Salary_Account           0
Mobile_Verified          0
Var5                     0
Var1                     0
Loan_Amount_Submitted    0
Loan_Tenure_Submitted    0
Interest_Rate            0
Processing_Fee           0
EMI_Loan_Submitted       0
Filled_Form              0
Device_Type              0
Var2                     0
Source                   0
Var4                     0
age                      0
dtype: int64

### Wnioski
 Moim zdaniem wyrzuciłem nie potrzebne kolumny, mam dane bez brakujących wartości. 



### Conclusion 
I removed unnecessary columns, therefore i received all columns without missing values.


In [18]:
import matplotlib.pyplot as plt
%matplotlib inline


In [19]:
X1 = X 

### 6. Zamieniamy wszystkie dane typu objcect, one-hot encoder
W tym celu używamy funkcji get_dummies 

### 6. Replacing the data type object with one-hot encoder
In order to do this, i will use the get_dummies function

In [20]:
X1 = pd.get_dummies(X1, drop_first = True)

In [21]:
# data prepered to modelling
X1

Unnamed: 0,Monthly_Income,Loan_Amount_Applied,Loan_Tenure_Applied,Existing_EMI,Var5,Loan_Amount_Submitted,Loan_Tenure_Submitted,Interest_Rate,Processing_Fee,EMI_Loan_Submitted,...,Source_S153,Source_S154,Source_S155,Source_S156,Source_S157,Source_S158,Source_S159,Source_S160,Source_S161,Source_S162
0,20000,300000.0,5.0,0.0,0,0.0,0.0,0.00,0.0,0.00,...,0,0,0,0,0,0,0,0,0,0
1,35000,200000.0,2.0,0.0,13,200000.0,2.0,13.25,0.0,6762.90,...,0,0,0,0,0,0,0,0,0,0
2,22500,600000.0,4.0,0.0,0,450000.0,4.0,0.00,0.0,0.00,...,0,0,0,0,0,0,0,0,0,0
3,35000,1000000.0,5.0,0.0,10,920000.0,5.0,0.00,0.0,0.00,...,0,0,0,0,0,0,0,0,0,0
4,100000,500000.0,2.0,25000.0,17,500000.0,2.0,0.00,0.0,0.00,...,0,0,0,0,0,0,0,0,0,0
5,45000,300000.0,5.0,15000.0,17,300000.0,5.0,13.99,1500.0,6978.92,...,0,0,0,0,0,0,0,0,0,0
6,70000,6.0,5.0,0.0,0,0.0,0.0,0.00,0.0,0.00,...,0,0,0,0,0,0,0,0,0,0
7,20000,200000.0,5.0,2597.0,3,200000.0,5.0,0.00,0.0,0.00,...,0,0,0,0,0,0,1,0,0,0
8,75000,0.0,0.0,0.0,13,1300000.0,5.0,14.85,26000.0,30824.65,...,0,0,0,0,0,0,0,0,0,0
9,30000,300000.0,3.0,0.0,0,300000.0,3.0,18.25,1500.0,10883.38,...,0,0,0,0,0,0,0,0,0,0


### 7. Import all function that i need to modeling

In [22]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import (GridSearchCV, RandomizedSearchCV)
from scipy.stats import randint as sp_randint
from sklearn.model_selection import StratifiedKFold
from imblearn.pipeline import Pipeline
from sklearn.metrics import make_scorer
from imblearn.over_sampling import RandomOverSampler
from collections import Counter
from imblearn.under_sampling import NearMiss
from imblearn.combine import SMOTEENN
from sklearn.preprocessing import MaxAbsScaler
from sklearn.svm import SVC

### 8. Dzielimy dane na treningowe i testowe 
Ilość danych testowych wynosi 20000, random_state służy do ustalenia ziarna losowości 
### 8. Split data to test and train 
Test date will have 20000 rows

In [23]:

X1_train, X1_test, y_train, y_test = train_test_split(X1, y, test_size = 20000 ,random_state = 23)

### 9. Modelling (choosing best models based on earnings = profit - expense or/and on AUC)
I will fit the model on the training data. In this case, i am interested with parameters earnings=profit-expense and AUC, not the accuracy itself, since it is inbalanced data. In the next step, i will find the best model with the best hyperparameters and compare diffrent results, with earnings. 

### 9. Modelowanie (wybieranie najlepszych modeli na podstawie zysk = przychód - koszt i/lub AUC)
Fitujemy model na danych treningowych i testowych. Jako metody miary użyję zysk = przychód - koszt.
W dalszym w etapie poszukamy odpowiednich parametrów dla modelu, a także spróbujemy porównać różne modele, tak aby zysk wyszedł jak największy.


#### Function which i measure scoring of models without getting probabilites from fitting models  

In [84]:
# funkcja licząca zysk 
def earnings_function(y_test, y_hat):
    expense = 100 * sum(y_hat)
    profit = 1000 * sum(np.array(y_hat ==1)  & np.array(y_test == 1))
    earnings = profit - expense
    return earnings

#### Function which i measue scoring of models with getting probabilites from fitting models and fit best threshold for this models

In [85]:
# funkcja do make_scorer, żeby policzyć jaki przyjąć threshold

def earnings_function_proba(y_test, y_hat, threshold):
    '''arguments: 
       y_test - true values of target variable
       y_hat - predict values of target variable 
       # próg dla którego dla prawdopodobieństw większych niż on klasyfikujemy je jako 1 
       threshold - threshold from which probabilites are greater than threshold, then we classify it as 1 
       output: earnings  
    '''

    
    expense = 100 * sum(np.array(y_hat[:,1] > threshold))
    profit = 1000 * sum(np.array(y_hat[:,1] > threshold)  & np.array(y_test == 1))
    earnings = profit - expense
    return earnings

Function to build DataFrame with best threshold and scoring 

In [131]:
def df_results(X, y_test, models_fun, threshold):
    earnings = []
    num_of_ones = []
    
    for i in range(len(threshold)):
        y_hat = models_fun[i].predict_proba(X) 
        earnings.append(earnings_function_proba(y_test, y_hat, threshold[i]))
        num_of_ones.append(sum(y_hat[:,1] > threshold[i]))

    df = pd.DataFrame([earnings, num_of_ones])
    df = df.rename(
    columns = {i : threshold[i] for i in range(len(threshold))}
    )
        
    df = df.rename(index = {0: 'profit', 1:'number of 1 classified'})
    return df
        

### Modelling
I will be comparing classifiers using earnings function in two ways.
1. Using score function with resampling and using class weight hyperparameter in random forest classifier. 
2. Using threshold probabilites. Also with resampling or not but without class_weight. 

But first i will be using RandomizedSearch cv to find optimal hyperparameter

In [None]:
#I am using RandomizedSearchCv to find good hyperparameters to gridsearchcv
zysk_scorer = make_scorer(earnings_function)

clf = RandomForestClassifier(n_jobs=-1)

param_dist = {'class_weight':[{0:1, 1:18}, {0:1, 1:20}, {0:1, 1:25}],
              'n_estimators': [int(x) for x in np.linspace(start = 50, stop = 200, num = 10)],
              "max_depth": [120,200,220],
              "min_samples_split": sp_randint(2, 10),
              "min_samples_leaf": sp_randint(1, 10),
              "bootstrap": [True],
              "criterion": ["entropy"]}

# Randomized Search Cv, n_inter number of iteration 
rs = RandomizedSearchCV(estimator = clf, param_distributions=param_dist, n_iter = 10, scoring=zysk_scorer, cv=StratifiedKFold(3) ,verbose= 2)
rs.fit(X1_train, y_train)
y_hat = rs.predict(X1_test)
print(earnings_function(y_test, y_hat))
rs.best_params_

In [157]:
rs.best_params_

{'bootstrap': True,
 'class_weight': {0: 1, 1: 20},
 'criterion': 'entropy',
 'max_depth': 200,
 'min_samples_leaf': 8,
 'min_samples_split': 7,
 'n_estimators': 50}

In [158]:
# prepere scorer witch score model with earnings_function
earnings_scorer = make_scorer(earnings_function)
# pipeline with 1. oversampling and ExtraTreesClassifier
#               2. Smoteenn - upsampling method and RandomForestClassifier
#               3. RandomForestClassifier 

pipeline = [Pipeline([('oversampling', RandomOverSampler(random_state=42)), ('Extra', ExtraTreesClassifier(n_jobs=-1, criterion='entropy'))]), 
            Pipeline([('Smoteenn', SMOTEENN(random_state=42 )) , ('Forest', RandomForestClassifier(n_jobs=-1, criterion='entropy'))]),
            Pipeline([('Forest',  RandomForestClassifier(n_jobs=-1))])
            ]
# hyperparameters grid to gridsearchCv.
# class_weight hyperparameter is important in imbalanced class problem
param_grid = [{'Extra__n_estimators' : [50, 100], 'Extra__max_depth' : [ 150, 200], 'Extra__min_samples_split': [7] },
              {'Forest__n_estimators': [50,100], 'Forest__max_depth' : [ 150, 200], 'Forest__min_samples_leaf': [8]},
              {'Forest__class_weight': [{0:1, 1:20}, {0:1, 1:30},{0:1, 1:25}], 'Forest__criterion': ['entropy'],'Forest__min_samples_split':[7],  'Forest__n_estimators': [50,75], 'Forest__max_depth' : [200, 220], 'Forest__min_samples_leaf': [7,8]}]

best_model = []
best_score = []
for pipe, grid   in zip(pipeline, param_grid):
    # GridSearchCv finds the best hyperparameters from param_grid basing on earnings_scorer,
    # StratifiedKFold makes cross validation where in each fold is the same proportion of Zeros and Ones form Target Variable
    gs = GridSearchCV(estimator = pipe , param_grid= grid, scoring =earnings_scorer, verbose = 3, cv = StratifiedKFold(3) )
    # fits best model from gs on train data 
    gs.fit(X1_train, y_train)
    # append best model and score 
    best_model.append(gs.best_estimator_)
    best_score.append(gs.best_score_)

Fitting 3 folds for each of 27 candidates, totalling 81 fits
[CV] Extra__max_depth=None, Extra__min_samples_split=0.4, Extra__n_estimators=50 
[CV]  Extra__max_depth=None, Extra__min_samples_split=0.4, Extra__n_estimators=50, score=-483800, total=   3.2s
[CV] Extra__max_depth=None, Extra__min_samples_split=0.4, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.5s remaining:    0.0s


[CV]  Extra__max_depth=None, Extra__min_samples_split=0.4, Extra__n_estimators=50, score=-483900, total=   3.1s
[CV] Extra__max_depth=None, Extra__min_samples_split=0.4, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    6.9s remaining:    0.0s


[CV]  Extra__max_depth=None, Extra__min_samples_split=0.4, Extra__n_estimators=50, score=-568500, total=   3.0s
[CV] Extra__max_depth=None, Extra__min_samples_split=0.4, Extra__n_estimators=100 
[CV]  Extra__max_depth=None, Extra__min_samples_split=0.4, Extra__n_estimators=100, score=-488500, total=   5.7s
[CV] Extra__max_depth=None, Extra__min_samples_split=0.4, Extra__n_estimators=100 
[CV]  Extra__max_depth=None, Extra__min_samples_split=0.4, Extra__n_estimators=100, score=-493500, total=   5.4s
[CV] Extra__max_depth=None, Extra__min_samples_split=0.4, Extra__n_estimators=100 
[CV]  Extra__max_depth=None, Extra__min_samples_split=0.4, Extra__n_estimators=100, score=-473100, total=   6.0s
[CV] Extra__max_depth=None, Extra__min_samples_split=0.4, Extra__n_estimators=150 
[CV]  Extra__max_depth=None, Extra__min_samples_split=0.4, Extra__n_estimators=150, score=-567300, total=   8.2s
[CV] Extra__max_depth=None, Extra__min_samples_split=0.4, Extra__n_estimators=150 
[CV]  Extra__max_dept

[CV]  Extra__max_depth=100, Extra__min_samples_split=2, Extra__n_estimators=50, score=-300, total=   9.6s
[CV] Extra__max_depth=100, Extra__min_samples_split=2, Extra__n_estimators=50 
[CV]  Extra__max_depth=100, Extra__min_samples_split=2, Extra__n_estimators=50, score=-200, total=   9.4s
[CV] Extra__max_depth=100, Extra__min_samples_split=2, Extra__n_estimators=50 
[CV]  Extra__max_depth=100, Extra__min_samples_split=2, Extra__n_estimators=50, score=-700, total=   9.4s
[CV] Extra__max_depth=100, Extra__min_samples_split=2, Extra__n_estimators=100 
[CV]  Extra__max_depth=100, Extra__min_samples_split=2, Extra__n_estimators=100, score=-1400, total=  18.7s
[CV] Extra__max_depth=100, Extra__min_samples_split=2, Extra__n_estimators=100 
[CV]  Extra__max_depth=100, Extra__min_samples_split=2, Extra__n_estimators=100, score=1300, total=  18.2s
[CV] Extra__max_depth=100, Extra__min_samples_split=2, Extra__n_estimators=100 
[CV]  Extra__max_depth=100, Extra__min_samples_split=2, Extra__n_esti

[Parallel(n_jobs=1)]: Done  81 out of  81 | elapsed: 13.6min finished


Fitting 3 folds for each of 27 candidates, totalling 81 fits
[CV] Forest__max_depth=None, Forest__min_samples_leaf=5, Forest__n_estimators=50 
[CV]  Forest__max_depth=None, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=3300, total=  30.7s
[CV] Forest__max_depth=None, Forest__min_samples_leaf=5, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   31.1s remaining:    0.0s


[CV]  Forest__max_depth=None, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=-5000, total=  29.4s
[CV] Forest__max_depth=None, Forest__min_samples_leaf=5, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.0min remaining:    0.0s


[CV]  Forest__max_depth=None, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=-400, total=  29.8s
[CV] Forest__max_depth=None, Forest__min_samples_leaf=5, Forest__n_estimators=100 
[CV]  Forest__max_depth=None, Forest__min_samples_leaf=5, Forest__n_estimators=100, score=6100, total=  34.2s
[CV] Forest__max_depth=None, Forest__min_samples_leaf=5, Forest__n_estimators=100 
[CV]  Forest__max_depth=None, Forest__min_samples_leaf=5, Forest__n_estimators=100, score=-600, total=  34.7s
[CV] Forest__max_depth=None, Forest__min_samples_leaf=5, Forest__n_estimators=100 
[CV]  Forest__max_depth=None, Forest__min_samples_leaf=5, Forest__n_estimators=100, score=1200, total=  33.7s
[CV] Forest__max_depth=None, Forest__min_samples_leaf=5, Forest__n_estimators=150 
[CV]  Forest__max_depth=None, Forest__min_samples_leaf=5, Forest__n_estimators=150, score=6000, total=  38.5s
[CV] Forest__max_depth=None, Forest__min_samples_leaf=5, Forest__n_estimators=150 
[CV]  Forest__max_depth=None, Forest

[CV]  Forest__max_depth=100, Forest__min_samples_leaf=25, Forest__n_estimators=50, score=-3800, total=  30.0s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=25, Forest__n_estimators=50 
[CV]  Forest__max_depth=100, Forest__min_samples_leaf=25, Forest__n_estimators=50, score=-12500, total=  30.0s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=25, Forest__n_estimators=50 
[CV]  Forest__max_depth=100, Forest__min_samples_leaf=25, Forest__n_estimators=50, score=-8800, total=  29.1s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=25, Forest__n_estimators=100 
[CV]  Forest__max_depth=100, Forest__min_samples_leaf=25, Forest__n_estimators=100, score=-5700, total=  34.3s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=25, Forest__n_estimators=100 
[CV]  Forest__max_depth=100, Forest__min_samples_leaf=25, Forest__n_estimators=100, score=-12400, total=  34.0s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=25, Forest__n_estimators=100 
[CV]  Forest__max_depth=100, Fore

[Parallel(n_jobs=1)]: Done  81 out of  81 | elapsed: 46.9min finished


Fitting 3 folds for each of 24 candidates, totalling 72 fits
[CV] Forest__class_weight={0: 1, 1: 20}, Forest__criterion=entropy, Forest__max_depth=200, Forest__min_samples_leaf=7, Forest__min_samples_split=7, Forest__n_estimators=50 
[CV]  Forest__class_weight={0: 1, 1: 20}, Forest__criterion=entropy, Forest__max_depth=200, Forest__min_samples_leaf=7, Forest__min_samples_split=7, Forest__n_estimators=50, score=3800, total=   1.7s
[CV] Forest__class_weight={0: 1, 1: 20}, Forest__criterion=entropy, Forest__max_depth=200, Forest__min_samples_leaf=7, Forest__min_samples_split=7, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.1s remaining:    0.0s


[CV]  Forest__class_weight={0: 1, 1: 20}, Forest__criterion=entropy, Forest__max_depth=200, Forest__min_samples_leaf=7, Forest__min_samples_split=7, Forest__n_estimators=50, score=-100, total=   1.6s
[CV] Forest__class_weight={0: 1, 1: 20}, Forest__criterion=entropy, Forest__max_depth=200, Forest__min_samples_leaf=7, Forest__min_samples_split=7, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    4.1s remaining:    0.0s


[CV]  Forest__class_weight={0: 1, 1: 20}, Forest__criterion=entropy, Forest__max_depth=200, Forest__min_samples_leaf=7, Forest__min_samples_split=7, Forest__n_estimators=50, score=8800, total=   1.5s
[CV] Forest__class_weight={0: 1, 1: 20}, Forest__criterion=entropy, Forest__max_depth=200, Forest__min_samples_leaf=7, Forest__min_samples_split=7, Forest__n_estimators=75 
[CV]  Forest__class_weight={0: 1, 1: 20}, Forest__criterion=entropy, Forest__max_depth=200, Forest__min_samples_leaf=7, Forest__min_samples_split=7, Forest__n_estimators=75, score=-1100, total=   2.4s
[CV] Forest__class_weight={0: 1, 1: 20}, Forest__criterion=entropy, Forest__max_depth=200, Forest__min_samples_leaf=7, Forest__min_samples_split=7, Forest__n_estimators=75 
[CV]  Forest__class_weight={0: 1, 1: 20}, Forest__criterion=entropy, Forest__max_depth=200, Forest__min_samples_leaf=7, Forest__min_samples_split=7, Forest__n_estimators=75, score=-1700, total=   2.3s
[CV] Forest__class_weight={0: 1, 1: 20}, Forest__cri

[CV]  Forest__class_weight={0: 1, 1: 30}, Forest__criterion=entropy, Forest__max_depth=200, Forest__min_samples_leaf=7, Forest__min_samples_split=7, Forest__n_estimators=50, score=-7300, total=   1.7s
[CV] Forest__class_weight={0: 1, 1: 30}, Forest__criterion=entropy, Forest__max_depth=200, Forest__min_samples_leaf=7, Forest__min_samples_split=7, Forest__n_estimators=50 
[CV]  Forest__class_weight={0: 1, 1: 30}, Forest__criterion=entropy, Forest__max_depth=200, Forest__min_samples_leaf=7, Forest__min_samples_split=7, Forest__n_estimators=50, score=-7300, total=   1.6s
[CV] Forest__class_weight={0: 1, 1: 30}, Forest__criterion=entropy, Forest__max_depth=200, Forest__min_samples_leaf=7, Forest__min_samples_split=7, Forest__n_estimators=50 
[CV]  Forest__class_weight={0: 1, 1: 30}, Forest__criterion=entropy, Forest__max_depth=200, Forest__min_samples_leaf=7, Forest__min_samples_split=7, Forest__n_estimators=50, score=-8100, total=   1.6s
[CV] Forest__class_weight={0: 1, 1: 30}, Forest__cr

[CV]  Forest__class_weight={0: 1, 1: 30}, Forest__criterion=entropy, Forest__max_depth=220, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=75, score=-13200, total=   2.3s
[CV] Forest__class_weight={0: 1, 1: 30}, Forest__criterion=entropy, Forest__max_depth=220, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=75 
[CV]  Forest__class_weight={0: 1, 1: 30}, Forest__criterion=entropy, Forest__max_depth=220, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=75, score=-2400, total=   2.3s
[CV] Forest__class_weight={0: 1, 1: 25}, Forest__criterion=entropy, Forest__max_depth=200, Forest__min_samples_leaf=7, Forest__min_samples_split=7, Forest__n_estimators=50 
[CV]  Forest__class_weight={0: 1, 1: 25}, Forest__criterion=entropy, Forest__max_depth=200, Forest__min_samples_leaf=7, Forest__min_samples_split=7, Forest__n_estimators=50, score=4000, total=   1.6s
[CV] Forest__class_weight={0: 1, 1: 25}, Forest__cr

[CV]  Forest__class_weight={0: 1, 1: 25}, Forest__criterion=entropy, Forest__max_depth=220, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50, score=5600, total=   1.9s
[CV] Forest__class_weight={0: 1, 1: 25}, Forest__criterion=entropy, Forest__max_depth=220, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=75 
[CV]  Forest__class_weight={0: 1, 1: 25}, Forest__criterion=entropy, Forest__max_depth=220, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=75, score=900, total=   2.5s
[CV] Forest__class_weight={0: 1, 1: 25}, Forest__criterion=entropy, Forest__max_depth=220, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=75 
[CV]  Forest__class_weight={0: 1, 1: 25}, Forest__criterion=entropy, Forest__max_depth=220, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=75, score=-5400, total=   2.3s
[CV] Forest__class_weight={0: 1, 1: 25}, Forest__crite

[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:  3.0min finished


In [161]:
# Best model on train data
best_model_from_gs = best_model[np.argmax(best_score)]

In [160]:
results = []
num_of_one = []
for model in best_model:
    y_hat = model.predict(X1_test)
    num_of_one.append(sum(y_hat))
    results.append(earnings_function(y_test, y_hat))
tabela =pd.DataFrame([results, num_of_one])

tabela = tabela.rename(
columns = {0 : 'extra with resampling', 1 : 'Random Forest with smoteenn', 2: ' Random forest with class_weight'})
        
tabela = tabela.rename(index = {0: 'profit', 1:'number of 1 classified '})
tabela

Unnamed: 0,extra with resampling,Random Forest with smoteenn,Random forest with class_weight
profit,200,300,14700
number of 1 classified,38,77,183


In [163]:
y_hat = best_model_from_gs.predict(X1_test)
num_of_one = sum(y_hat)
sc = earnings_function(y_test, y_hat)
print(('With this model {} i get {} earning and {} rows classified as 1').format(best_model_from_gs, sc,num_of_one ))

With this model Pipeline(memory=None,
     steps=[('Forest', RandomForestClassifier(bootstrap=True, class_weight={0: 1, 1: 20},
            criterion='entropy', max_depth=220, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=8,
            min_samples_split=7, min_weight_fraction_leaf=0.0,
            n_estimators=50, n_jobs=-1, oob_score=False, random_state=None,
            verbose=0, warm_start=False))]) i get 14700 earning and 183 rows classified as 1


### Conclunsion
I got 14700 with class_weight hyperparameter in Random Forest Classifier that is better then using Smoteenn and oversampling. Those two methods overfit. 


In [107]:
# This is to check that my approach is valid. 
threshold = [0.11, 0.115, 0.1145, 0.12, 0.125]
models = []
scores = []
param_grid = {'n_estimators': [50, 75, 100],
              'min_samples_leaf': [5, 7,10]}
for i in threshold:
    earnings_scorer_proba = make_scorer(earnings_function_proba , needs_proba = True, threshold = i)
    

    gs = GridSearchCV(estimator = RandomForestClassifier(n_jobs=-1, max_depth=200, criterion='entropy'), param_grid= param_grid, scoring =earnings_scorer_proba, verbose = 1, cv = StratifiedKFold(3) )
    gs.fit(X1_train, y_train)
    
    models.append(gs.best_estimator_)
    scores.append(gs.best_score_)


Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  1.8min finished


Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  1.8min finished


Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  1.8min finished


Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  1.9min finished


Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  1.9min finished


In [166]:
best_model_thd = models[np.argmax(scores)]
y_hat = best_model_thd.predict_proba(X1_test)

thd = threshold[np.argmax(scores)]
num_of_1 = sum(y_hat[:,1] > thd)
sc1 = earnings_function_proba(y_test, y_hat, thd)
print(('for this model {} i got {} score and this {} number of ones').format(best_model_thd, sc1, num_of_1))

for this model RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=200, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=15, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False) i got 5600 score and this 34 number of ones


Second approach:


In [173]:
najlepsze_modele = []
najlepsze_wyniki = []
najlepsze_progi = []
thresholds = [0.11, 0.09, 0.1, 0.012]

pipeline = [Pipeline([('Smoteenn', SMOTEENN(random_state=42)), ('Extra', ExtraTreesClassifier(n_jobs = -1, criterion='entropy'))]), 
            Pipeline([('Forest', RandomForestClassifier(n_jobs=-1, criterion='entropy'))]),
            Pipeline([('Forest',  RandomForestClassifier(n_jobs=-1, criterion='entropy'))])]


param_grid = [{'Extra__n_estimators' : [50, 100], 'Extra__max_depth' : [100, 200], 'Extra__min_samples_split': [6,7] },
              {'Forest__n_estimators': [50,100], 'Forest__max_depth' : [100, 200], 'Forest__min_samples_leaf': [5,8]},
              {'Forest__n_estimators': [50,100], 'Forest__max_depth' : [200, 100], 'Forest__min_samples_leaf': [8], 'Forest__min_samples_split':[6,7]
              }]



for i in thresholds: 
    
    earnings_scorer_proba = make_scorer(earnings_function_proba , needs_proba = True, threshold = i)
    
    for model, grid in zip(pipeline, param_grid):
        
        gs = GridSearchCV(estimator = model , param_grid= grid, scoring =earnings_scorer_proba, verbose = 10, cv = StratifiedKFold(3) )
        gs.fit(X1_train, y_train)      
        najlepsze_modele.append(gs.best_estimator_)
        najlepsze_wyniki.append(gs.best_score_)
        najlepsze_progi.append(i)

Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50 
[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50, score=-93800, total=  41.3s
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   42.0s remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50, score=-90300, total=  44.5s
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.5min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50, score=-82300, total=  34.2s
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  2.0min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100, score=-93100, total=  48.8s
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  2.9min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100, score=-92900, total=  45.4s
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  3.7min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100, score=-77000, total=  45.8s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  4.4min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50, score=-105500, total=  39.1s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:  5.1min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50, score=-100200, total=  43.1s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:  5.8min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50, score=-93000, total=  41.5s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100 


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  6.5min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100, score=-103700, total=  45.5s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100 
[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100, score=-99400, total=  42.3s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100 
[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100, score=-92800, total=  40.6s
[CV] Extra__max_depth=200, Extra__min_samples_split=6, Extra__n_estimators=50 
[CV]  Extra__max_depth=200, Extra__min_samples_split=6, Extra__n_estimators=50, score=-93200, total=  34.5s
[CV] Extra__max_depth=200, Extra__min_samples_split=6, Extra__n_estimators=50 
[CV]  Extra__max_depth=200, Extra__min_samples_split=6, Extra__n_estimators=50, score=-93200, total=  33.4s
[CV] Extra__max_depth=200, Extra__min_samples_split=6, Extra__n_estimators=50 
[CV]  Extra__max_depth=200, Extra__min_samples_split=6, Ext

[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed: 16.6min finished


Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50 
[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=7900, total=   2.7s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.5s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=3800, total=   2.7s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    7.0s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=9100, total=   2.6s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   10.6s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100, score=3600, total=   4.9s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   16.5s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100, score=1700, total=   4.6s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   22.0s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100, score=7900, total=   4.9s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   27.9s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50, score=4500, total=   2.8s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:   31.4s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50, score=-1700, total=   2.9s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:   36.1s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50, score=5700, total=   2.7s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:   39.4s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100, score=5900, total=   4.8s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100 
[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100, score=4400, total=   4.5s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100 
[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100, score=10300, total=   4.7s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=5, Forest__n_estimators=50 
[CV]  Forest__max_depth=200, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=6200, total=   2.5s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=5, Forest__n_estimators=50 
[CV]  Forest__max_depth=200, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=-2400, total=   2.5s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=5, Forest__n_estimators=50 
[CV]  Forest__max_depth=200, Forest__min_samples

[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:  1.8min finished


Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50 
[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50, score=6700, total=   2.8s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.4s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50, score=4000, total=   2.5s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    6.6s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50, score=5500, total=   2.6s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   10.1s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100, score=2200, total=   4.7s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   15.7s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100, score=700, total=   4.6s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   21.2s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100, score=5500, total=   4.8s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   26.9s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50, score=4800, total=   2.5s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:   30.1s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50, score=4000, total=   2.5s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:   33.3s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50, score=6600, total=   2.5s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:   36.4s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100, score=6300, total=   4.6s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100 
[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100, score=2800, total=   4.5s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100 
[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100, score=6200, total=   5.0s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50 
[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50, score=3700, total=   3.3s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50 
[CV]  F

[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:  1.8min finished


Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50 
[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50, score=-133000, total=  51.2s
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   52.2s remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50, score=-130600, total=  49.3s
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.7min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50, score=-113900, total=  49.6s
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  2.6min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100, score=-122800, total= 1.0min
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  3.6min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100, score=-124100, total=  59.4s
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  4.6min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100, score=-125000, total=  59.8s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  5.6min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50, score=-134000, total=  51.1s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:  6.5min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50, score=-130800, total=  45.3s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:  7.3min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50, score=-127000, total=  42.6s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100 


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  8.0min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100, score=-129900, total= 1.0min
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100 
[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100, score=-132000, total=  55.4s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100 
[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100, score=-124500, total=  47.0s
[CV] Extra__max_depth=200, Extra__min_samples_split=6, Extra__n_estimators=50 
[CV]  Extra__max_depth=200, Extra__min_samples_split=6, Extra__n_estimators=50, score=-125600, total=  35.5s
[CV] Extra__max_depth=200, Extra__min_samples_split=6, Extra__n_estimators=50 
[CV]  Extra__max_depth=200, Extra__min_samples_split=6, Extra__n_estimators=50, score=-120000, total=  34.0s
[CV] Extra__max_depth=200, Extra__min_samples_split=6, Extra__n_estimators=50 
[CV]  Extra__max_depth=200, Extra__min_samples_split=6,

[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed: 18.6min finished


Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50 
[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=10800, total=   1.7s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.2s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=-1100, total=   1.8s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    4.5s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=11700, total=   1.6s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    6.7s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100, score=11900, total=   3.1s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   10.6s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100, score=-900, total=   3.1s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   14.4s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100, score=7700, total=   3.2s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   18.3s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50, score=4200, total=   1.6s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:   20.5s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50, score=5400, total=   1.9s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:   23.2s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50, score=10200, total=   1.8s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:   25.6s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100, score=11400, total=   3.3s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100 
[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100, score=2000, total=   3.1s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100 
[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100, score=10500, total=   3.1s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=5, Forest__n_estimators=50 
[CV]  Forest__max_depth=200, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=10800, total=   1.7s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=5, Forest__n_estimators=50 
[CV]  Forest__max_depth=200, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=-900, total=   1.6s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=5, Forest__n_estimators=50 
[CV]  Forest__max_depth=200, Forest__min_sample

[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:  1.2min finished


Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50 
[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50, score=7200, total=   3.1s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.9s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50, score=-900, total=   2.2s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    6.8s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50, score=5200, total=   1.8s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    9.2s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100, score=15000, total=17.5min
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed: 17.7min remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100, score=1200, total=   4.9s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed: 17.8min remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100, score=8800, total=   5.5s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 17.9min remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50, score=13900, total=   2.4s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed: 18.0min remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50, score=-5200, total=   2.2s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed: 18.0min remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50, score=11500, total=   2.2s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed: 18.1min remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100, score=13800, total=   3.6s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100 
[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100, score=1700, total=   3.6s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100 
[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100, score=9900, total=   3.5s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50 
[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50, score=10900, total=   1.6s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50 
[CV] 

[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed: 18.9min finished


Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50 
[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50, score=-113800, total=  42.9s
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   44.1s remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50, score=-103300, total=  38.2s
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.4min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50, score=-100700, total=  36.3s
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  2.0min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100, score=-117200, total=  49.6s
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  2.8min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100, score=-107300, total=  47.5s
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  3.7min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100, score=-88700, total=  49.9s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  4.5min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50, score=-112800, total=  40.9s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:  5.2min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50, score=-113600, total=  39.5s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:  5.9min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50, score=-105700, total=  42.5s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100 


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  6.6min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100, score=-118900, total=  51.7s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100 
[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100, score=-110500, total=  48.4s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100 
[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100, score=-109000, total=  48.8s
[CV] Extra__max_depth=200, Extra__min_samples_split=6, Extra__n_estimators=50 
[CV]  Extra__max_depth=200, Extra__min_samples_split=6, Extra__n_estimators=50, score=-97700, total=  40.1s
[CV] Extra__max_depth=200, Extra__min_samples_split=6, Extra__n_estimators=50 
[CV]  Extra__max_depth=200, Extra__min_samples_split=6, Extra__n_estimators=50, score=-106300, total=  33.3s
[CV] Extra__max_depth=200, Extra__min_samples_split=6, Extra__n_estimators=50 
[CV]  Extra__max_depth=200, Extra__min_samples_split=6, 

[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed: 17.9min finished


Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50 
[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=7300, total=   2.0s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.5s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=900, total=   1.9s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    5.2s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=6200, total=   2.1s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    7.9s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100, score=12500, total=   3.8s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   12.6s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100, score=7500, total=   3.6s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   17.0s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100, score=8400, total=   3.5s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   21.1s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50, score=4400, total=   1.6s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:   23.3s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50, score=-5900, total=   1.6s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:   25.4s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50, score=4700, total=   1.7s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:   27.7s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100, score=6100, total=   3.3s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100 
[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100, score=2400, total=   3.6s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100 
[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100, score=8900, total=   3.2s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=5, Forest__n_estimators=50 
[CV]  Forest__max_depth=200, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=3600, total=   1.7s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=5, Forest__n_estimators=50 
[CV]  Forest__max_depth=200, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=4700, total=   2.0s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=5, Forest__n_estimators=50 
[CV]  Forest__max_depth=200, Forest__min_samples_l

[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:  1.3min finished


Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50 
[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50, score=5600, total=   2.8s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.5s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50, score=1400, total=   2.7s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    6.9s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50, score=4900, total=   2.5s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   10.4s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100, score=8600, total=   4.5s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   15.7s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100, score=1000, total=   5.0s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   21.5s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100, score=8700, total=   3.9s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   26.3s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50, score=9100, total=   1.9s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:   29.1s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50, score=0, total=   2.5s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:   32.3s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50, score=6100, total=   2.6s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:   35.7s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100, score=5200, total=   4.4s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100 
[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100, score=-500, total=   4.3s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100 
[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100, score=6800, total=   5.4s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50 
[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50, score=11900, total=   2.3s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50 
[CV]  

[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:  1.7min finished


Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50 
[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50, score=-571400, total=  42.9s
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   43.5s remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50, score=-544600, total=  41.4s
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.4min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=50, score=-563400, total=  41.3s
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  2.1min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100, score=-562200, total=  50.2s
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  3.0min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100, score=-547700, total=  42.2s
[CV] Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  3.7min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=6, Extra__n_estimators=100, score=-579700, total=  42.4s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  4.4min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50, score=-593200, total=  39.0s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:  5.1min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50, score=-574100, total=  37.9s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:  5.7min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=50, score=-575800, total=  41.8s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100 


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  6.5min remaining:    0.0s


[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100, score=-605100, total=  51.7s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100 
[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100, score=-570300, total=  51.2s
[CV] Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100 
[CV]  Extra__max_depth=100, Extra__min_samples_split=7, Extra__n_estimators=100, score=-589900, total=  53.8s
[CV] Extra__max_depth=200, Extra__min_samples_split=6, Extra__n_estimators=50 
[CV]  Extra__max_depth=200, Extra__min_samples_split=6, Extra__n_estimators=50, score=-564400, total=  42.7s
[CV] Extra__max_depth=200, Extra__min_samples_split=6, Extra__n_estimators=50 
[CV]  Extra__max_depth=200, Extra__min_samples_split=6, Extra__n_estimators=50, score=-551400, total=  40.8s
[CV] Extra__max_depth=200, Extra__min_samples_split=6, Extra__n_estimators=50 
[CV]  Extra__max_depth=200, Extra__min_samples_split=6,

[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed: 18.4min finished


Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50 
[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=-501200, total=   2.7s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.4s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=-498600, total=   2.5s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    6.8s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=-517200, total=   2.7s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   10.5s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100, score=-517300, total=   4.8s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   16.2s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100, score=-518900, total=   5.4s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   22.8s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=5, Forest__n_estimators=100, score=-518400, total=   3.7s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   27.3s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50, score=-533300, total=   1.8s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:   29.7s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50, score=-525900, total=   1.8s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:   32.1s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=50, score=-530000, total=   1.8s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:   34.4s remaining:    0.0s


[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100, score=-537300, total=   3.5s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100 
[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100, score=-533200, total=   3.6s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100 
[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__n_estimators=100, score=-533900, total=   4.0s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=5, Forest__n_estimators=50 
[CV]  Forest__max_depth=200, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=-501800, total=   2.1s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=5, Forest__n_estimators=50 
[CV]  Forest__max_depth=200, Forest__min_samples_leaf=5, Forest__n_estimators=50, score=-500000, total=   2.1s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=5, Forest__n_estimators=50 
[CV]  Forest__max_depth=200, Forest

[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:  1.5min finished


Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50 
[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50, score=-543000, total=   1.8s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.3s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50, score=-536200, total=   1.8s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    4.7s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50, score=-520600, total=   1.8s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    7.1s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100, score=-540700, total=   3.4s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   11.3s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100, score=-527100, total=   4.0s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   16.1s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=100, score=-533400, total=   3.5s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   20.3s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50, score=-544900, total=   2.2s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:   23.0s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50, score=-527200, total=   1.9s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:   25.5s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=50, score=-524300, total=   2.0s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100 


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:   28.0s remaining:    0.0s


[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100, score=-547000, total=   3.4s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100 
[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100, score=-535000, total=   3.4s
[CV] Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100 
[CV]  Forest__max_depth=200, Forest__min_samples_leaf=8, Forest__min_samples_split=7, Forest__n_estimators=100, score=-539300, total=   3.4s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50 
[CV]  Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators=50, score=-525000, total=   1.7s
[CV] Forest__max_depth=100, Forest__min_samples_leaf=8, Forest__min_samples_split=6, Forest__n_estimators

[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:  1.4min finished


In [174]:
y_hat = najlepsze_modele[np.argmax(najlepsze_wyniki)].predict_proba(X1_test)
thd = najlepsze_progi[np.argmax(scores)]
num_of_1 = sum(y_hat[:,1] > thd)
sc1 = earnings_function_proba(y_test, y_hat, thd)
earnings_function_proba(y_test, y_hat, thd)
print(('for this model {} i got {} score and this {} number of ones').format(najlepsze_modele[np.argmax(najlepsze_wyniki)], sc1, num_of_1))

for this model Pipeline(memory=None,
     steps=[('Forest', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=100, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]) i got 7200 score and this 118 number of ones


Najlepszy zysk uzyskałem używająć podejścia gdzie sprawdzałem ręcznie ustawione progi dla Extra Tree Classifier jest to 14700. 
Próbowałem używać modeli Logistic Regression i SVC ale bardzo długo się liczyło i wyniki były nie zadawalające. 
Użyłem trzech podejść do rozwiązania problemu niezbalansowanych klas:
1. Ustawienie hyperparameter class_weight - dało to dobre rezultaty.
2. Użycie biblioteki imblearn i sztuczne stworzenie równolicznych klas ( przy standardowym ustawieniu parametrów prowadzi to do przeuczenia modelu).
3. Predykcja prawdopodobiństw z jaką model klasyfikuje i wybranie własnego progu od jakiego będzie klasyfikacja.

The best earnings i got using class_weight hyperparameter for ExtraTreeClassifier to solve imbalanced class problem. The score is 14700.
I tried to use Logistic Regression and SVC but those models took a lot of time to process and score was inconlusive.
I used three methods to solve imbalanced class problem:
1. Finding appropriate hyperparameter class_weight - this approach gives the best results. 
2. Using library Imblearn with sampling to artificially create equipotent class (  use of standard hyperparametres leads to overfit model) - which didn't achive good score on test data.
3. Prediction of probabilites with which model classifies (0 or 1) and chooses the best threshold for this model - this approach gives good results. 