## The Predictive Machine Learning Model for predicitng the Salary in a Population - v2

The aim of this project is to predict the salary of an individual using the Linear model and measure the accuracy for the given adult_census dataset from the 1994 US census.

##### Load the libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

##### Load the dataset

In [2]:
data = pd.read_csv('../datafolder/adult-census.csv')
data.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [3]:
# Make a copy of the dataset - Good Practice
df = data.copy()
df.head(3)

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K


In [4]:
# Removing the duplicated column - we have shown in our previous project that 'education' and 'education-num'
# are correlated, hence we are allowed to choose one of it.

df = df.drop(columns='education-num')
df.head(3)

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K


In [5]:
# Select the target

target_col = 'class'
target = df[target_col]
target.head()

0     <=50K
1     <=50K
2      >50K
3      >50K
4     <=50K
Name: class, dtype: object

In [6]:
# Select the remaining features columns

df = df.drop(columns=[target_col])
df.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States


###### Dynamically selecting the numerical and categorical columns

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   education       48842 non-null  object
 3   marital-status  48842 non-null  object
 4   occupation      48842 non-null  object
 5   relationship    48842 non-null  object
 6   race            48842 non-null  object
 7   sex             48842 non-null  object
 8   capital-gain    48842 non-null  int64 
 9   capital-loss    48842 non-null  int64 
 10  hours-per-week  48842 non-null  int64 
 11  native-country  48842 non-null  object
dtypes: int64(4), object(8)
memory usage: 4.5+ MB


In [8]:
from sklearn.compose import make_column_selector as col_sel

num_col_sel = col_sel(dtype_exclude=object)
cat_col_sel = col_sel(dtype_include=object)

In [9]:
num_cols = num_col_sel(df)
num_cols

['age', 'capital-gain', 'capital-loss', 'hours-per-week']

In [10]:
cat_cols = cat_col_sel(df)
cat_cols

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']

##### Preprocessing Steps

In [11]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler

num_preprocessor = StandardScaler()   # Initializing
cat_preprocessor = OneHotEncoder(sparse_output=False, handle_unknown='ignore')   # Initializing

In [12]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    [('stdScaler-numerical', num_preprocessor, num_cols),
     ('oneHot-categorical', cat_preprocessor, cat_cols)
    ]
)

In [13]:
preprocessor

##### Building Pipelines for different models

###### Pipeline for KNeighborsClassifier

In [14]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

KNN_model = make_pipeline(preprocessor, KNeighborsClassifier())
KNN_model

###### Pipeline for LogisticRegression

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

LogReg_model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))
LogReg_model

###### Pipeline for More Powerful Model - HistGradientBoostingClassifier

In [16]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import make_pipeline

HGBC_model = make_pipeline(preprocessor, HistGradientBoostingClassifier())
HGBC_model

We would examine the three models to predict the salary of above 50K and test the generalization of each using the cross_validate to examine the performances.

##### Split the dataset into Train_test sets

In [17]:
df.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States


In [18]:
target.head()

0     <=50K
1     <=50K
2      >50K
3      >50K
4     <=50K
Name: class, dtype: object

In [19]:
from sklearn.model_selection import train_test_split

df_train, df_test, target_train, target_test = train_test_split(
    df, target, test_size=0.25, random_state=42)

Be aware that we use train_test_split here for didactic purposes in other for us to show the scikit-learn API. In a real setting one might prefer to use cross-validation to also be able to evaluate the uncertainty of our estimation of the generalization performance of a model.

##### Training dataset - df_train, target_train - both goes in for `.fit`.

##### Test dataset - df_test, target_test - df_test goes in for `.predict` and compare the output of the predictions to the target_test.

##### The df_test and target_test goes in to score the performance of the model using the `.score`.

##### Evaluation of each model

#### For KNN

In [20]:
#KNN_model.fit(df_train, target_train)

In [21]:
#KNN_predict = KNN_model.predict(df_test)
#KNN_predict[:5]

### For LogisticRegression

In [22]:
LogReg_model.fit(df_train, target_train)

In [23]:
LogReg_predict = LogReg_model.predict(df_test)
LogReg_predict[:5]

array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' >50K'], dtype=object)

### For HistGradientBoostingClassifier

In [24]:
#HGBC_model.fit(df_train, target_train)

In [25]:
#HGBC_predict = HGBC_model.predict(df_test)
#HGBC_predict[:5]

##### Evaluating and Scoring the LogisticRegression

In [26]:
LogReg_predict[:10] == target_test[:10]

print("Number of correct predictions using the logistic-regression model is: "
    f"{(target_test[:10] == LogReg_predict[:10]).sum()}/10")

Number of correct predictions using the logistic-regression model is: 9/10


In [27]:
print(f"The mean of the predictions using the logistic-regression is {(target_test == LogReg_predict).mean()}")

The mean of the predictions using the logistic-regression is 0.8575874211776268


In [28]:
# scoring the model

score = LogReg_model.score(df_test, target_test)
score
print(f"Accuracy of logistic regression: {score:.3f}")

Accuracy of logistic regression: 0.858


##### Evaluating the model wiht cross_validation

In [29]:
from sklearn.model_selection import cross_validate

val_result = cross_validate(LogReg_model, df_train, target_train, cv=5)
val_result

{'fit_time': array([1.07463002, 1.09485078, 1.07962203, 1.02211022, 0.96125174]),
 'score_time': array([0.03878903, 0.04000616, 0.03969979, 0.03476191, 0.03829908]),
 'test_score': array([0.8464583 , 0.85476385, 0.85107835, 0.84821185, 0.84957685])}

In [30]:
scores = val_result["test_score"]
print(
    "The mean cross-validation accuracy is: "
    f"{scores.mean():.3f} ± {scores.std():.3f}"
)

The mean cross-validation accuracy is: 0.850 ± 0.003


We can observe that we get significantly higher accuracies with the Logistic Regression model. The mean is about 0.85 and a standard deviation of about 0.03 having informative features (e.g. less than 1000) with a mix of numerical and categorical variables.

This explains why the Logisitc Regression Machines are very popular among datascience practitioners who work with tabular data.