![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)

# Lab | Handling Data Imbalance in Classification Models

For this lab and in the next lessons we will use the dataset 'Healthcare For All' building a model to predict who will donate (TargetB) and how much they will give (TargetD) (will be used for lab on Friday). You will be using `files_for_lab/learningSet.csv` file which you have already downloaded from class.

### Scenario

You are revisiting the Healthcare for All Case Study. You are provided with this historical data about Donors and how much they donated. Your task is to build a machine learning model that will help the company identify people who are more likely to donate and then try to predict the donation amount.

### Instructions

In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned in the class.  You should continue in the same notebook from Monday.

Here is the list of steps to be followed (building a simple model without balancing the data):


**These steps should have been completed in Monday's labs:**
- Import the required libraries and modules that you would need.
- Read that data into Python and call the dataframe `donors`.
- Check the datatypes of all the columns in the data. 
- Split the data into numerical and catagorical.
- Check for null values in the dataframe. Replace the null values using the methods learned in class.
- Treat the data using techniques learned in class.
 

**Begin the Modeling here**
- Look critically at the dtypes of numerical and categorical columns and make changes where appropriate.
- Concatenate numerical and categorical back together again for your X dataframe.  Designate the TargetB as y.
  - Split the data into a training set and a test set.
  - Split further into train_num and train_cat.  Also test_num and test_cat.
  - Scale the features either by using MinMax Scaler or a Standard Scaler. (train_num, test_num)
  - Encode the categorical features using One-Hot Encoding or Ordinal Encoding.  (train_cat, test_cat)
      - **fit** only on train data, transform both train and test
      - again re-concatenate train_num and train_cat as X_train as well as test_num and test_cat as X_test
  - Fit a logistic regression model on the training data.
  - Check the accuracy on the test data.

**Note**: So far we have not balanced the data.

Managing imbalance in the dataset

- Check for the imbalance.
- Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
- Each time fit the model and see how the accuracy of the model has changed.

In [279]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

In [280]:
# import data
numerical = pd.read_csv('C:/Ironhack/Week7/7.1, 7.2 Feature Selection (Review)/lab-revisiting-machine-learning/numerical7_02.csv')
categorical = pd.read_csv('C:/Ironhack/Week7/7.1, 7.2 Feature Selection (Review)/lab-revisiting-machine-learning/categorical7_02.csv')
target = pd.read_csv('C:/Ironhack/Week7/7.1, 7.2 Feature Selection (Review)/lab-revisiting-machine-learning/target7_02.csv')

In [281]:
# checking data types
numerical.dtypes

Unnamed: 0      int64
ODATEDW         int64
TCODE           int64
DOB             int64
AGE           float64
               ...   
AVGGIFT       float64
CONTROLN        int64
HPHONE_D        int64
RFA_2F          int64
CLUSTER2      float64
Length: 332, dtype: object

In [282]:
# checking columns
numerical

Unnamed: 0.1,Unnamed: 0,ODATEDW,TCODE,DOB,AGE,NUMCHLD,INCOME,WEALTH1,HIT,MBCRAFT,...,CARDGIFT,LASTGIFT,LASTDATE,FISTDATE,TIMELAG,AVGGIFT,CONTROLN,HPHONE_D,RFA_2F,CLUSTER2
0,0,8901,0,3712,60.000000,0.0,5.0,5.0,0,0.0,...,14,10.0,9512,8911,4.0,7.741935,95515,0,4,39.0
1,1,9401,1,5202,46.000000,1.0,6.0,9.0,16,0.0,...,1,25.0,9512,9310,18.0,15.666667,148535,0,2,1.0
2,2,9001,1,0,61.611649,0.0,3.0,1.0,2,0.0,...,14,5.0,9512,9001,12.0,7.481481,15078,1,4,60.0
3,3,8701,0,2801,70.000000,0.0,1.0,4.0,2,0.0,...,7,10.0,9512,8702,9.0,6.812500,172556,1,4,41.0
4,4,8601,0,2001,78.000000,1.0,3.0,2.0,60,1.0,...,8,15.0,9601,7903,14.0,6.864865,7112,1,2,26.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95407,95407,9601,1,0,61.611649,0.0,5.0,5.0,0,0.0,...,0,25.0,9602,9602,9.0,25.000000,184568,0,1,12.0
95408,95408,9601,1,5001,48.000000,1.0,7.0,9.0,1,0.0,...,0,20.0,9603,9603,9.0,20.000000,122706,1,1,2.0
95409,95409,9501,1,3801,60.000000,0.0,5.0,5.0,0,0.0,...,4,10.0,9610,9410,3.0,8.285714,189641,1,3,34.0
95410,95410,8601,0,4005,58.000000,0.0,7.0,5.0,0,0.0,...,18,18.0,9701,8612,4.0,12.146341,4693,1,4,11.0


In [283]:
# checking data types
categorical.dtypes

Unnamed: 0     int64
STATE         object
CLUSTER        int64
HOMEOWNR      object
GENDER        object
DATASRCE       int64
SOLIH          int64
VETERANS      object
RFA_2R        object
RFA_2A        object
GEOCODE2      object
DOMAIN_A      object
DOMAIN_B       int64
dtype: object

In [284]:
categorical

Unnamed: 0.1,Unnamed: 0,STATE,CLUSTER,HOMEOWNR,GENDER,DATASRCE,SOLIH,VETERANS,RFA_2R,RFA_2A,GEOCODE2,DOMAIN_A,DOMAIN_B
0,0,IL,36,N,F,0,0,N,L,E,C,T,2
1,1,CA,14,H,M,3,0,N,L,G,A,S,1
2,2,NC,43,U,M,3,0,N,L,E,C,R,2
3,3,CA,44,U,F,3,0,N,L,E,C,R,2
4,4,FL,16,H,F,3,12,N,L,F,A,S,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95407,95407,other,27,N,M,0,0,N,L,G,C,C,2
95408,95408,TX,24,H,M,3,0,N,L,F,A,C,1
95409,95409,MI,30,N,M,0,0,N,L,E,B,C,3
95410,95410,CA,24,H,F,2,12,N,L,F,A,C,1


In [285]:
# drop solih and veterans columns 
categorical = categorical.drop(columns= ['SOLIH','VETERANS','Unnamed: 0'])

In [286]:
numerical = numerical.drop(columns=['Unnamed: 0'])

In [287]:
categorical['DATASRCE'].value_counts()

3    43549
2    23455
0    21280
1     7128
Name: DATASRCE, dtype: int64

In [288]:
#Differentiating between continuous and discrete variables
#The method takes in a numerical values only dataframe and loops through each column
#The decision taken is the if the number of unique values < 2% of the total values, then they are discrete
#Else it's continuous.

def diff_variable(x):
    continuous_df = pd.DataFrame()
    discrete_df = pd.DataFrame()
    
    for column in x.columns:
        length = len(x[column])
        uni_length = len(x[column].unique())
        #print (uni_length/length)
        
        if (uni_length/length)*100 <= 2:
            discrete_df[column] = x[column]
        else:
            continuous_df[column] = x[column]
    
    return continuous_df, discrete_df

In [289]:
# call the function to separate discrete and continuous
continuous_df, discrete_df = diff_variable(numerical)

  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[column] = x[column]
  discrete_df[

In [290]:
continuous_df

Unnamed: 0,POP901,POP902,POP903,HV1,HV2,IC5,AVGGIFT,CONTROLN
0,992,264,332,479,635,12883,7.741935,95515
1,3611,940,998,5468,5218,36175,15.666667,148535
2,7001,2040,2669,497,546,11576,7.481481,15078
3,640,160,219,1000,1263,15130,6.812500,172556
4,2520,627,761,576,594,9836,6.864865,7112
...,...,...,...,...,...,...,...,...
95407,27380,7252,10037,988,1025,18807,25.000000,184568
95408,1254,322,361,1679,1723,26538,20.000000,122706
95409,552,131,205,376,377,12178,8.285714,189641
95410,1746,432,508,2421,2459,15948,12.146341,4693


In [291]:
discrete_df


Unnamed: 0,ODATEDW,TCODE,DOB,AGE,NUMCHLD,INCOME,WEALTH1,HIT,MBCRAFT,MBGARDEN,...,NUMPRM12,NGIFTALL,CARDGIFT,LASTGIFT,LASTDATE,FISTDATE,TIMELAG,HPHONE_D,RFA_2F,CLUSTER2
0,8901,0,3712,60.000000,0.0,5.0,5.0,0,0.0,0.0,...,14,31,14,10.0,9512,8911,4.0,0,4,39.0
1,9401,1,5202,46.000000,1.0,6.0,9.0,16,0.0,0.0,...,13,3,1,25.0,9512,9310,18.0,0,2,1.0
2,9001,1,0,61.611649,0.0,3.0,1.0,2,0.0,0.0,...,14,27,14,5.0,9512,9001,12.0,1,4,60.0
3,8701,0,2801,70.000000,0.0,1.0,4.0,2,0.0,0.0,...,14,16,7,10.0,9512,8702,9.0,1,4,41.0
4,8601,0,2001,78.000000,1.0,3.0,2.0,60,1.0,0.0,...,25,37,8,15.0,9601,7903,14.0,1,2,26.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95407,9601,1,0,61.611649,0.0,5.0,5.0,0,0.0,0.0,...,12,1,0,25.0,9602,9602,9.0,0,1,12.0
95408,9601,1,5001,48.000000,1.0,7.0,9.0,1,0.0,0.0,...,8,1,0,20.0,9603,9603,9.0,1,1,2.0
95409,9501,1,3801,60.000000,0.0,5.0,5.0,0,0.0,0.0,...,17,7,4,10.0,9610,9410,3.0,1,3,34.0
95410,8601,0,4005,58.000000,0.0,7.0,5.0,0,0.0,0.0,...,31,41,18,18.0,9701,8612,4.0,1,4,11.0


In [292]:
discrete_df['ODATEDW'].value_counts()

9501    15358
8601    14596
9401    12065
9601    10122
9101     8552
9001     7718
9201     7539
8801     6669
8901     5342
9301     3921
8701     3451
9701       15
9509        4
9209        4
9212        3
9410        3
9510        3
8912        2
9109        2
9310        2
8501        2
9506        2
9309        2
8910        2
9009        2
9202        2
9302        2
9003        1
9205        1
8909        1
9402        1
9011        1
8707        1
9012        1
8612        1
8604        1
9312        1
9303        1
8401        1
9103        1
8609        1
8702        1
9512        1
8704        1
9010        1
8611        1
8711        1
9102        1
8608        1
9111        1
9511        1
8810        1
8804        1
8306        1
Name: ODATEDW, dtype: int64

In [293]:
# concat numerical and categorical columns
X = pd.concat([numerical,categorical], axis=1)

In [294]:
# drop target d
y=target.drop(columns= ['TARGET_D'])
y = pd.DataFrame(y)

In [295]:
# drop unnamed column 

y = y.drop(columns=['Unnamed: 0'])
y = y.astype(int)

In [296]:
X

Unnamed: 0,ODATEDW,TCODE,DOB,AGE,NUMCHLD,INCOME,WEALTH1,HIT,MBCRAFT,MBGARDEN,...,STATE,CLUSTER,HOMEOWNR,GENDER,DATASRCE,RFA_2R,RFA_2A,GEOCODE2,DOMAIN_A,DOMAIN_B
0,8901,0,3712,60.000000,0.0,5.0,5.0,0,0.0,0.0,...,IL,36,N,F,0,L,E,C,T,2
1,9401,1,5202,46.000000,1.0,6.0,9.0,16,0.0,0.0,...,CA,14,H,M,3,L,G,A,S,1
2,9001,1,0,61.611649,0.0,3.0,1.0,2,0.0,0.0,...,NC,43,U,M,3,L,E,C,R,2
3,8701,0,2801,70.000000,0.0,1.0,4.0,2,0.0,0.0,...,CA,44,U,F,3,L,E,C,R,2
4,8601,0,2001,78.000000,1.0,3.0,2.0,60,1.0,0.0,...,FL,16,H,F,3,L,F,A,S,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95407,9601,1,0,61.611649,0.0,5.0,5.0,0,0.0,0.0,...,other,27,N,M,0,L,G,C,C,2
95408,9601,1,5001,48.000000,1.0,7.0,9.0,1,0.0,0.0,...,TX,24,H,M,3,L,F,A,C,1
95409,9501,1,3801,60.000000,0.0,5.0,5.0,0,0.0,0.0,...,MI,30,N,M,0,L,E,B,C,3
95410,8601,0,4005,58.000000,0.0,7.0,5.0,0,0.0,0.0,...,CA,24,H,F,2,L,F,A,C,1


In [297]:
# train test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [298]:
# numerical train test split
numericals_train = X_train.select_dtypes(np.number)
numericals_test = X_test.select_dtypes(np.number)

# categorical train test split
categoricals_train= X_train.select_dtypes(object)
categoricals_test= X_test.select_dtypes(object)

In [299]:
# scaling numerical values using min max scaler
from sklearn.preprocessing import MinMaxScaler

transformer = MinMaxScaler().fit(numericals_train)
numericals_train_scaled = transformer.transform(numericals_train)
numericals_test_scaled = transformer.transform(numericals_test)

# converting array into dataframe 
numericals_test_scaled= pd.DataFrame(numericals_test_scaled).reset_index(drop=True)
numericals_train_scaled= pd.DataFrame(numericals_train_scaled).reset_index(drop=True)

In [300]:
# encoding non categorical variables using onehot encoder

from sklearn.preprocessing import OneHotEncoder

# Create an instance of OneHotEncoder
encoder = OneHotEncoder(handle_unknown='error').fit(categoricals_train)

# Encode categorical features for both the training and test datasets
categoricals_train_encoded = encoder.transform(categoricals_train).toarray().copy()
categoricals_test_encoded = encoder.transform(categoricals_test).toarray().copy()

categoricals_train_encoded= pd.DataFrame(categoricals_train_encoded).reset_index(drop=True)
categoricals_test_encoded= pd.DataFrame(categoricals_test_encoded).reset_index(drop=True)

In [301]:
X_train=pd.concat([numericals_train_scaled ,categoricals_train_encoded], axis=1)
X_test=pd.concat([numericals_test_scaled ,categoricals_test_encoded], axis=1)

In [302]:
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

In [303]:
X_train

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,0.211470,0.000017,0.236972,0.762887,0.428571,0.500000,0.666667,0.008299,0.000000,0.00,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.928315,0.000000,0.463543,0.536082,0.000000,0.666667,0.555556,0.000000,0.000000,0.00,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.354839,0.000017,0.381977,0.608247,0.000000,0.666667,0.111111,0.020747,0.000000,0.00,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.426523,0.000017,0.216375,0.783505,0.000000,0.833333,0.666667,0.037344,0.333333,0.00,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.784946,0.000052,0.442945,0.556701,0.000000,0.666667,0.222222,0.087137,0.333333,0.25,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76324,0.211470,0.000000,0.278888,0.711340,0.000000,0.333333,1.000000,0.020747,0.000000,0.00,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
76325,0.928315,0.000034,0.329660,0.670103,0.000000,0.333333,0.666667,0.000000,0.000000,0.00,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
76326,0.283154,0.000017,0.000000,0.624862,0.000000,0.666667,0.555556,0.000000,0.000000,0.00,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
76327,0.211470,0.000017,0.216993,0.773196,0.000000,0.666667,0.333333,0.004149,0.000000,0.00,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


In [304]:
from sklearn.linear_model import LogisticRegression
classification = LogisticRegression(random_state=1337, solver='lbfgs',
                  multi_class='multinomial').fit(X_train, y_train)

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [305]:
predictions = classification.predict(X_test)
predictions
classification.score(X_test, y_test)

0.9475973379447676

In [306]:
print("precision: ",precision_score(y_test,predictions))
print("recall: ",recall_score(y_test,predictions))
print("f1: ",f1_score(y_test,predictions))

precision:  0.0
recall:  0.0
f1:  0.0


  _warn_prf(average, modifier, msg_start, len(result))


## Downsampling (undersampling)

In [307]:
from sklearn.utils import resample
train=pd.concat([X_train, y_train], axis = 1)

category_0 = train[train['TARGET_B'] == 0]
category_1 = train[train['TARGET_B'] == 1]

In [308]:
category_0_undersampled = resample(category_0,
                                   replace=False,
                                   n_samples = len(category_1))

In [309]:
print(category_0_undersampled.shape)
print(category_1.shape)

(3843, 367)
(3843, 367)


In [310]:
data_downsampled = pd.concat([category_0_undersampled, category_1], axis=0)

In [311]:
data_downsampled

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,23,24,25,26,27,28,29,30,31,TARGET_B
4888,0.354839,0.000017,0.339959,0.659794,0.000000,0.666667,0.777778,0.045643,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0
24289,0.856631,0.000000,0.473841,0.525773,0.142857,0.500000,0.888889,0.000000,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
75909,0.211470,0.000034,0.196601,0.793814,0.000000,0.666667,0.555556,0.000000,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
75682,0.856631,0.000000,0.000000,0.624862,0.000000,0.666667,0.555556,0.000000,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
28817,0.784946,0.000017,0.000000,0.624862,0.000000,0.000000,0.333333,0.004149,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76261,0.211470,0.000017,0.638620,0.360825,0.285714,0.500000,0.444444,0.087137,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
76276,0.784946,0.000000,0.206076,0.793814,0.000000,0.500000,0.555556,0.000000,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1
76300,0.211470,0.000034,0.227497,0.762887,0.000000,0.666667,0.555556,0.000000,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1
76301,0.784946,0.000483,0.000000,0.624862,0.000000,0.666667,0.555556,0.000000,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1


In [312]:
data_downsampled['TARGET_B'].value_counts()

0    3843
1    3843
Name: TARGET_B, dtype: int64

In [313]:
X_train_down = data_downsampled.drop(['TARGET_B'],axis=1)
y_train_down = data_downsampled['TARGET_B']

In [314]:
from sklearn.linear_model import LogisticRegression
classification_down = LogisticRegression(random_state=1337, solver='lbfgs',
                  multi_class='multinomial').fit(X_train_down, y_train_down)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [315]:
predictions_down = classification_down.predict(X_test)
predictions
classification_down.score(X_test, y_test)

0.5833464339988471

In [316]:
print("precision: ",precision_score(y_test,predictions_down))
print("recall: ",recall_score(y_test,predictions_down))
print("f1: ",f1_score(y_test,predictions_down))

precision:  0.06842170619644852
recall:  0.551
f1:  0.12172760410913509


## Upsampling (oversampling)

In [317]:
category_1_oversampled = resample(category_1,
                                  replace=True,
                                  n_samples = len(category_0))

In [318]:
print(category_0.shape)
print(category_1_oversampled.shape)

(72486, 367)
(72486, 367)


In [319]:
data_upsampled = pd.concat([category_0, category_1_oversampled], axis=0)

In [320]:
data_upsampled['TARGET_B'].value_counts()

0    72486
1    72486
Name: TARGET_B, dtype: int64

In [321]:
X_train_up = data_upsampled.drop(['TARGET_B'],axis=1)
y_train_up = data_upsampled['TARGET_B']

In [322]:
from sklearn.linear_model import LogisticRegression
classification_up = LogisticRegression(random_state=1337, solver='lbfgs',
                  multi_class='multinomial').fit(X_train_up, y_train_up)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [323]:
predictions_up = classification_up.predict(X_test)
predictions
classification_up.score(X_test, y_test)

0.6132159513703296

In [324]:
print("precision: ",precision_score(y_test,predictions_up))
print("recall: ",recall_score(y_test,predictions_up))
print("f1: ",f1_score(y_test,predictions_up))

precision:  0.07260549229738782
recall:  0.542
f1:  0.12805670407560546
