# Income Cluster Prediction with Pipelines and Cross Validation

In this notebook, we will look at creating a pipelines in effort to streamline our code, but also, to make the productionisation of the models far easier. 

We will be using the income dataset. We want to predict if a person earns more than $50k per year.

In [103]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import cross_val_score, train_test_split, cross_validate
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, MaxAbsScaler
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn import metrics

import pandas as pd
import numpy as np
import seaborn as sns

In [104]:
df = pd.read_csv(r"incomedata.csv")
df.head()

Unnamed: 0,ID,age,workclass,fnlwgt,education,educationNum,maritalStatus,occupation,relationship,race,sex,capitalGain,capitalLoss,hoursPerWeek,nativeCountry,income
0,100001,39.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,100002,50.0,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,100003,38.0,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,100004,53.0,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,100005,28.0,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [105]:
df.drop(['nativeCountry'], axis=1, inplace=True)

In [106]:
print(df.shape)
df.isnull().sum()

(32561, 15)


ID                0
age              20
workclass         0
fnlwgt            0
education         0
educationNum      0
maritalStatus     0
occupation        0
relationship      0
race              0
sex               0
capitalGain       0
capitalLoss       0
hoursPerWeek      0
income            0
dtype: int64

In [107]:
df.dtypes

ID                 int64
age              float64
workclass         object
fnlwgt             int64
education         object
educationNum       int64
maritalStatus     object
occupation        object
relationship      object
race              object
sex               object
capitalGain        int64
capitalLoss        int64
hoursPerWeek       int64
income            object
dtype: object

In [108]:
df['income'] = df['income'].astype('category')
df['income'] = df['income'].cat.codes

Deal with the nulls.

In [109]:
df[df['maritalStatus']==' Divorced']['age'].mean()

43.01147356580427

In [110]:
divorced_df = df[df['maritalStatus']==' Divorced']
divorced_df['age'].mean()

43.01147356580427

In [111]:
df[(df['maritalStatus']==' Divorced')&(df['education']==' HS-grad')]['age'].mean()

42.31497524752475

In [112]:
divorced_df[divorced_df['education']==' HS-grad']

Unnamed: 0,ID,age,workclass,fnlwgt,education,educationNum,maritalStatus,occupation,relationship,race,sex,capitalGain,capitalLoss,hoursPerWeek,income
2,100003,38.0,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,0
24,100025,59.0,Private,109015,HS-grad,9,Divorced,Tech-support,Unmarried,White,Female,0,0,40,0
28,100029,39.0,Private,367260,HS-grad,9,Divorced,Exec-managerial,Not-in-family,White,Male,0,0,80,0
54,100055,47.0,Self-emp-inc,109832,HS-grad,9,Divorced,Exec-managerial,Not-in-family,White,Male,0,0,60,0
84,100085,44.0,Private,343591,HS-grad,9,Divorced,Craft-repair,Not-in-family,White,Female,14344,0,40,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32422,132423,47.0,Private,161950,HS-grad,9,Divorced,Other-service,Unmarried,White,Female,0,0,32,0
32427,132428,41.0,Private,206878,HS-grad,9,Divorced,Other-service,Unmarried,White,Female,0,0,32,0
32491,132492,33.0,Private,63079,HS-grad,9,Divorced,Adm-clerical,Unmarried,Black,Female,0,0,40,0
32492,132493,42.0,Self-emp-not-inc,217597,HS-grad,9,Divorced,Sales,Own-child,White,Male,0,0,50,0


In [113]:
#df['age'] = df['age'].fillna(df['age'].mean())

In [114]:
df.groupby(['maritalStatus' , 'education'])['age'].transform('mean')

0        30.515895
1        42.646931
2        42.314975
3        42.618644
4        42.646931
           ...    
32556    40.644880
32557    42.845009
32558    59.452785
32559    28.534500
32560    42.845009
Name: age, Length: 32561, dtype: float64

In [115]:
df.groupby(['maritalStatus' , 'education'])['age'].mean()

maritalStatus  education    
 Divorced       10th            44.092437
                11th            41.230769
                12th            41.333333
                1st-4th         62.700000
                5th-6th         53.700000
                                  ...    
 Widowed        HS-grad         59.452785
                Masters         58.341463
                Preschool       58.666667
                Prof-school     55.600000
                Some-college    57.523256
Name: age, Length: 101, dtype: float64

In [116]:
df.groupby(['maritalStatus' , 'education'])['age'].transform('mean')

0        30.515895
1        42.646931
2        42.314975
3        42.618644
4        42.646931
           ...    
32556    40.644880
32557    42.845009
32558    59.452785
32559    28.534500
32560    42.845009
Name: age, Length: 32561, dtype: float64

In [117]:
df['age'] = df['age'].fillna(df.groupby(['maritalStatus' , 'education'])['age'].transform('mean'))

In [118]:
print(df.shape)
df.isnull().sum()

(32561, 15)


ID               0
age              0
workclass        0
fnlwgt           0
education        0
educationNum     0
maritalStatus    0
occupation       0
relationship     0
race             0
sex              0
capitalGain      0
capitalLoss      0
hoursPerWeek     0
income           0
dtype: int64

### OneHotEncoding

When dealing with nominal variables, One hot encoding method may work more efficiently.

In the Income dataset, we have several categorical columns. We will do One Hot Encoding for all of them.

In [119]:
ohe = OneHotEncoder(sparse=False) # will return an array instead of sparse matrix

Select the categorical columns and do the encoding.

In [120]:
categoical = df.select_dtypes(include=['object', 'category']).columns
ohe.fit_transform(df[categoical])

array([[0., 0., 0., ..., 1., 0., 1.],
       [0., 0., 0., ..., 1., 0., 1.],
       [0., 0., 0., ..., 1., 0., 1.],
       ...,
       [0., 0., 0., ..., 1., 1., 0.],
       [0., 0., 0., ..., 1., 0., 1.],
       [0., 0., 0., ..., 1., 1., 0.]])

View the categories

In [121]:
ohe.categories_

[array([' ?', ' Federal-gov', ' Local-gov', ' Never-worked', ' Private',
        ' Self-emp-inc', ' Self-emp-not-inc', ' State-gov', ' Without-pay'],
       dtype=object),
 array([' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th',
        ' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate',
        ' HS-grad', ' Masters', ' Preschool', ' Prof-school',
        ' Some-college'], dtype=object),
 array([' Divorced', ' Married-AF-spouse', ' Married-civ-spouse',
        ' Married-spouse-absent', ' Never-married', ' Separated',
        ' Widowed'], dtype=object),
 array([' ?', ' Adm-clerical', ' Armed-Forces', ' Craft-repair',
        ' Exec-managerial', ' Farming-fishing', ' Handlers-cleaners',
        ' Machine-op-inspct', ' Other-service', ' Priv-house-serv',
        ' Prof-specialty', ' Protective-serv', ' Sales', ' Tech-support',
        ' Transport-moving'], dtype=object),
 array([' Husband', ' Not-in-family', ' Other-relative', ' Own-child',
        ' Unmarried'

### Column Transformer

Let's create a column transformer. This will create a transformer object, which will be used to do the category encodings, but also preserve the encodings for use later on.

What this does is it uses the OneHotEncoder() method, and it will apply it to the columns whose names appear in the variable called **categorical**. The next argument **remainder** tells the column transformer what to do with the rest of the columns. In this case, we are passing them through unaltered.

In [122]:
col_transform = make_column_transformer((OneHotEncoder(), categoical), remainder='passthrough')

We will now fit this instance of the column tansformer to out dataframe.

In [123]:
col_transform.fit(df)

We will now create our arrtibutes set and our labels set.

In [124]:
x = df.drop('income', axis=1)

# y = df.loc[:,['income']]
y = df['income']

In [125]:
x_train, x_val, y_train, y_val = train_test_split(x, y, random_state=0)

### Pipelines

Now that we have our datasets, we will build a pipeline. The first thing we will do is call the column transformer we created in the previous section. We will then apply feature scaling and then, finally, we will apply the Logistic Regression algorithm. 

Pipleline uses a list of tuples as steps

In [126]:
pipeline = Pipeline(steps=[('transform', col_transform), 
                           ('MinMax', MaxAbsScaler()),
                           ('logreg', LogisticRegression(solver='lbfgs',max_iter=200))])
pipeline

Just as before, when training a machine learning model, we fit the data to the pipeline. The data will pass through the pipeline, and the transformations we have applied will be effected on the data. 

The pipeline will first run the transformation, then it will run the feature scaling and finally the logistic regression. 

In [127]:
pipeline.fit(x_train, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


And also, just as with running predictions on the sci-kit learn model, we can run predictions on the pipeline as well, using the **.predict()** method.

In [128]:
pred = pipeline.predict(x_val)

In [129]:
print(f'Accuracy = ' , metrics.accuracy_score(y_val, pred))
print(f'Precision = ' ,metrics.precision_score(y_val, pred))
print(f'Recall = ' ,metrics.recall_score(y_val, pred))
print(f'F1 = ' ,metrics.f1_score(y_val, pred))
print(f'Confusion Matrix = ' ,metrics.confusion_matrix(y_val, pred))

Accuracy =  0.8480530647340622
Precision =  0.7309361438313701
Recall =  0.594853683148335
F1 =  0.6559109874826148
Confusion Matrix =  [[5725  434]
 [ 803 1179]]


### Cross Validation

In order to prevent our model to be overfit, we will be using Cross Validation.
This divides our data into different batches and train and test the model with adifferent batches each time.

In [130]:
metrics_ = cross_validate(pipeline, cv=5, X=x_train, y=y_train, scoring=['accuracy', 'precision', 'recall','f1'])
# cross_validate(pipeline, cv=5, X=x_train, y=y_train, scoring=['accuracy', 'precision', 'f1'])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [131]:
metrics_

{'fit_time': array([0.471843  , 0.40367293, 0.43201447, 0.4547627 , 0.39700127]),
 'score_time': array([0.02299929, 0.02299833, 0.02400327, 0.02299905, 0.02298307]),
 'test_accuracy': array([0.8548321 , 0.8495086 , 0.84807535, 0.84357084, 0.84152334]),
 'test_precision': array([0.75609756, 0.7289916 , 0.73493976, 0.71764706, 0.701417  ]),
 'test_recall': array([0.5824082 , 0.59265585, 0.57301452, 0.57301452, 0.59129693]),
 'test_f1': array([0.6579836 , 0.6537918 , 0.64395393, 0.63722697, 0.64166667])}

Is the f1 and accuracy score for cv(average) and the generalized model (without cv) similar? If yes, this is a good sign that the model is not overfit

In [132]:
metrics_['test_f1'].mean()

0.6469245949336051

In [133]:
metrics_['test_accuracy'].mean()

0.8475020475020475

In [134]:
metrics_['test_precision'].mean()

0.7278185959045045