The goal is to compare the generalization performance using the accuracy of the following predictive models using a 10-fold cross-validation:

1. a **DummyClassifier** predicting the most frequent class using only the numerical features of the data set
2. a linear model composed of a **StandardScaler and a LogisticRegression** using only the **numerical features** of the data set
3. a linear model composed of **preprocessors (StandardScaler, OneHotEncoder) and a LogisticRegression** using **both numerical and categorical features** of the data set

We will compare the cross-validation test scores of all models fold-to-fold.

In [6]:
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")
target = adult_census["class"]
data = adult_census.select_dtypes(["integer", "floating"])
data = data.drop(columns=["education-num"])

### Model 1: DummyClassifier


In [24]:
from sklearn import set_config
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
from sklearn.dummy import DummyClassifier


model1 = make_pipeline(DummyClassifier(strategy="most_frequent"))
cv_results1 = cross_validate(model1, data, target, cv=10,
                             return_estimator=True )
print("Accuracy:",cv_results1['test_score'].mean())

Accuracy: 0.7607182352166999


In [8]:
target3.value_counts()

 <=50K    37155
 >50K     11687
Name: class, dtype: int64

### Model 2 StandardScaler(), LogisticRegression(), numerical features

In [25]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

model2 = make_pipeline(StandardScaler(), LogisticRegression())
cv_results2 = cross_validate(model2, data, target, cv=10,
                             return_estimator=True )
print("Accuracy:",cv_results2['test_score'].mean())

Accuracy: 0.7998445658834604


#### Comparing model 2 and model 1

In [17]:
m1 = cv_results1['test_score']
m2 = cv_results2['test_score']
print(f"Model 2 is better than model 1 in: "f"  {( m2 > m1).sum()} /10 of the folds ")

Model 2 is better than model 1 in:   10 /10 of the folds 


### Model 3: preprocessed numerical and categorical features

In [18]:
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

adult_census = pd.read_csv("C:/Users/lavinia/Desktop/scikit_fun_mooc/Module 1/adult-census.csv")
target2 = adult_census["class"]
data2 = adult_census.drop(columns=["class", "education-num"])

numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

numerical_columns = numerical_columns_selector(data2)
categorical_columns = categorical_columns_selector(data2)

categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()

preprocessor = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, categorical_columns),
    ('standard_scaler', numerical_preprocessor, numerical_columns)])


model3 = make_pipeline(preprocessor, LogisticRegression(max_iter=500))
model3

In [26]:
cv_results3 = cross_validate(model3, data2, target2, cv=10,
                             return_estimator=True )
print("Accuracy:",cv_results3['test_score'].mean())

Accuracy: 0.8515007498426126


#### Comparing model 3 and model 2

In [20]:
m2 = cv_results2['test_score']
m3 = cv_results3['test_score']
print(f"Model 3 is better than model 2 in: "f"  {( m3 > m2).sum()} /10 of the folds ")

Model 3 is better than model 2 in:   10 /10 of the folds 


### Conclusion 

A model using scaled features, both numerical and categorical, is better than a model using only numerical features for 10 CV iterations out of 10.