# Assignment 3

In the third lab assignment, you are tasked with creating a machine learning pipeline using the [Adult dataset](https://archive.ics.uci.edu/dataset/2/adult).  
The objective of this dataset is to classify whether a particular individual has an income of more than $50,000 per year based on some features.

During the assignment, you are asked to create a complete machine learning pipeline for this dataset. Specifically, the assignment consists of two parts. In the first part of the assignment, you are asked to perform some exploratory data analysis along with some data preprocessing steps to transform your data into the right format. The second part of the assignment involves selecting the best model for your problem using a model selection method introduced in the second lab of the course.

You can download the dataset from the website.   
The folder, it contains:  
1. The data that you are going to use to develop your model (or development set): `adult.data`
2. The test set: `adult.test`
3. A description file for the dataset: `adult.names`

You have to complete the current notebook and submit it through the course website.

## Assignment Steps:

### A. Data Analysis and Preprocessing

1. **Load the development data and provide some basic information about the dataset using the `adult.data` file:**
   - Provide a short description of the dataset (a few sentences about the dataset and the features).
   - How many examples (rows) are in the dataset?
   - How many features do you have in your dataset?
   - What are the types of the different features?
   - Do you have any missing data in your dataset? Hint: check for different missing value identifiers.

2. **Transform your data into the right format to be handled by your model:**
    - Handle missing values, and explain your decision.
    - Handle categorical features, and explain your decision.
    - Do you need to scale your data to transform the different features into the same scale?
    
3. Load the test set and perform the same steps as before. Is the test set consistent with the training set?

### B. Model Selection

In the model selection phase, you need to choose the best model between k-nearest neighbors (KNN) and logistic regression classifiers for different model hyperparameters.

1. Explain the different algorithms (provide a few sentences about each model).
2. Define the different hyperparameters and their respective values that you want to explore for each algorithm. For the logistic regression, try only the fit_intercept hyperparameter while keeping fixed the penalty equal to None, as the rest will be explained later in the course.

   - Example: `logistic_regression = LogisticRegression(penalty=None, fit_intercept=setting)`

3. Choose a model selection method and explain your decision.
4. Perform model selection and find the best model. Choose between "train/val" and cross validation method.
5. Test your best algorithm on both test-set and training-set and explain the results.
6. Compare your results with the benchmarks on the [website](https://archive.ics.uci.edu/dataset/2/adult).

Remember to create different cells and use Markdown to write text in the notebook.

##### Useful Material:
1. [Model Selection and KNN - Lab 2](https://github.com/olethrosdc/machine-learning-MSc/blob/main/src/Generalisation/Lab_2_Generalization.ipynb) 
2. [Data Preprocessing and Logistic Regression - Lab 3](https://github.com/olethrosdc/machine-learning-MSc/blob/main/src/LogisticRegression/Lab_3_Preprocessing_Logistic_Regression.ipynb)
3. Course Book (Chapters 1, 2, 4, 5)
4. [Markdown Guide](https://www.edlitera.com/en/blog/posts/jupyter-markdown-tutorial)


# **Solution**

### A. Data Analysis and Preprocessing

In the adult.names file there is description of the dataset.

More spesificaly the objective of the dataset is predict whether a particular individual has an income of more the 50.000 dollars based on some features.

The are different feature of the dataset, we provide the following list with a sort description about our features.

1. age: age of the individuals
    values: continuous.
2. workclass: type of the employment
    values: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
3. fnlwgt: a weight mentaned from the database publiser. Peaple with similar demographic should metain similar weights
    values: continuous.
4. education: edycation level.
    values: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
5. education-num: a numerical feature to feather describe the educational level, will there is not enough information in the data description.
    values: continuous.
6. marital-status: feature to describe family status
    values: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
7. occupation: work profesion.
    values: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
7. relationship: feature to describe family status
    values: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
8. race: 
    Values: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
9. sex: 
    Values: Female, Male.
10. capital-gain: the increase in the value of a capital asset when it is sold.
    values: continuous.
11. capital-loss: the loss incurred when a capital asset, such as an investment or real estate, decreases in value
    values: continuous.
12. hours-per-week: 
    values: continuous.
13. native-country:  
    values: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

In [142]:
import pandas as pd
import numpy as np

features =[ "age",
            "workclass",
            "final_weight",  # originally it was called fnlwgt
            "education",
            "education_num",
            "marital_status",
            "occupation",
            "relationship",
            "race",
            "sex",
            "capital_gain",
            "capital_loss",
            "hours_per_week",
            "native_country"]

target = ["income_class"]

TRAIN_DATA_PATH = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
raw_train_data = pd.read_csv(
                         TRAIN_DATA_PATH,
                         names= features + target ,
                         skipinitialspace=True,  # Skip spaces after delimiter ", "
                         na_values = "?"
                        ) 

In [143]:
num_of_training_examples = raw_train_data.shape[0]
num_of_features = raw_train_data[features].shape[1]

In [144]:
print("Number of training examples:", num_of_training_examples)
print("Number of features:", num_of_features)

Number of training examples: 32561
Number of features: 14


In [145]:
raw_train_data

Unnamed: 0,age,workclass,final_weight,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income_class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [146]:
raw_train_data.dtypes

age                int64
workclass         object
final_weight       int64
education         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hours_per_week     int64
native_country    object
income_class      object
dtype: object

### Feature selection

When working with a dataset, it is essential to conduct a preliminary analysis of the features to determine their relevance for our application.

For instance, consider a weight feature within the dataset, describing demographic information provided by the data publisher. If this feature lacks sufficient details, it may be advisable to exclude it.

Furthermore, in an effort to prevent discrimination based on different nationalities, we may choose to exclude nationality-related features from the analysis. However, it's important to note that excluding features does not guarantee algorithmic fairness, as other features, such as education, could still be correlated with nationalities. 

[Algorithmic fairness](https://arxiv.org/pdf/2001.09784.pdf) is a subfield of machine learning that addresses concerns related to model fairness and bias.

In [147]:
features_model =[ 
                  "age",
                  "workclass",
                  "education",
                  "education_num",
                  "marital_status",
                  "occupation",
                  "relationship",
                  "race",
                  "sex",
                  "capital_gain",
                  "capital_loss",
                  "hours_per_week"
                 ]

#### Missing data

In [57]:
# percentage of missing data per feature
print("Missing percentage per column")
raw_train_data[features_model].isna().mean()

Missing percentage per column


age               0.000000
workclass         0.056386
education         0.000000
education_num     0.000000
marital_status    0.000000
occupation        0.056601
relationship      0.000000
race              0.000000
sex               0.000000
capital_gain      0.000000
capital_loss      0.000000
hours_per_week    0.000000
dtype: float64

In [149]:
# percentage of rows that cotrain at least one missing values
missing_row_percentage = raw_train_data[features_model].isna().any(axis=1).sum()/raw_train_data.shape[0]
print("Missing percentage per row:", missing_row_percentage*100,"%")

Missing percentage per row: 5.660145572924664 %


#### Handle categorical features 

To handle categorical feature we can use the label encoding for each feature.
To encode the different feature we can make use of the unique values of each categorical feature as descripted in the adult.names file.

In [151]:
categ_dictionary = {
              
              "workclass": ["Private", "Self-emp-not-inc", "Self-emp-inc", "Federal-gov", "Local-gov", "State-gov",
                            "Without-pay", "Never-worked"],
    
              "education":  ["Bachelors","Some-college","11th","HS-grad","Prof-school","Assoc-acdm","Assoc-voc",
                             "9th","7th-8th", "12th", "Masters","1st-4th","10th","Doctorate","5th-6th","Preschool"],

              "marital_status": [ "Married-civ-spouse", "Divorced", "Never-married", "Separated", "Widowed", 
                                  "Married-spouse-absent", "Married-AF-spouse"],
              
              "occupation": ["Tech-support","Craft-repair","Other-service","Sales","Exec-managerial",
                             "Prof-specialty","Handlers-cleaners","Machine-op-inspct","Adm-clerical",
                             "Farming-fishing","Transport-moving","Priv-house-serv", "Protective-serv",
                             "Armed-Forces"],

              "relationship": ["Wife", "Own-child", "Husband", "Not-in-family", "Other-relative", "Unmarried"],
    
              "race": [ "White", "Asian-Pac-Islander", "Amer-Indian-Eskimo", "Other", "Black"],
    
              "sex": ["Female", "Male"]
}

In [152]:
def create_categorical_maping(categ_dictionary):
    map_dict = {}
    for feature in categ_dictionary.keys():
        tmp_list_categories = categ_dictionary[feature]
        tmp_dict = {cat:i for i, cat in enumerate(tmp_list_categories)}
        map_dict[feature] = tmp_dict
    return map_dict

In [153]:
label_enc_dict = create_categorical_maping(categ_dictionary)

In [154]:
label_enc_dict

{'workclass': {'Private': 0,
  'Self-emp-not-inc': 1,
  'Self-emp-inc': 2,
  'Federal-gov': 3,
  'Local-gov': 4,
  'State-gov': 5,
  'Without-pay': 6,
  'Never-worked': 7},
 'education': {'Bachelors': 0,
  'Some-college': 1,
  '11th': 2,
  'HS-grad': 3,
  'Prof-school': 4,
  'Assoc-acdm': 5,
  'Assoc-voc': 6,
  '9th': 7,
  '7th-8th': 8,
  '12th': 9,
  'Masters': 10,
  '1st-4th': 11,
  '10th': 12,
  'Doctorate': 13,
  '5th-6th': 14,
  'Preschool': 15},
 'marital_status': {'Married-civ-spouse': 0,
  'Divorced': 1,
  'Never-married': 2,
  'Separated': 3,
  'Widowed': 4,
  'Married-spouse-absent': 5,
  'Married-AF-spouse': 6},
 'occupation': {'Tech-support': 0,
  'Craft-repair': 1,
  'Other-service': 2,
  'Sales': 3,
  'Exec-managerial': 4,
  'Prof-specialty': 5,
  'Handlers-cleaners': 6,
  'Machine-op-inspct': 7,
  'Adm-clerical': 8,
  'Farming-fishing': 9,
  'Transport-moving': 10,
  'Priv-house-serv': 11,
  'Protective-serv': 12,
  'Armed-Forces': 13},
 'relationship': {'Wife': 0,
  'Ow

In [156]:
raw_train_data[features_model].replace(label_enc_dict)

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week
0,39,5.0,0,13,2,8.0,3,0,1,2174,0,40
1,50,1.0,0,13,0,4.0,2,0,1,0,0,13
2,38,0.0,3,9,1,6.0,3,0,1,0,0,40
3,53,0.0,2,7,0,6.0,2,4,1,0,0,40
4,28,0.0,0,13,0,5.0,0,4,0,0,0,40
...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,0.0,5,12,0,0.0,0,0,0,0,0,38
32557,40,0.0,3,9,0,7.0,2,0,1,0,0,40
32558,58,0.0,3,9,4,8.0,5,0,0,0,0,40
32559,22,0.0,3,9,2,8.0,1,0,1,0,0,20


### Lets create with all cleaning step for our data
In general is more convinient to create a function to make data cleaning and data preprosessing

In [157]:
def clean_data(raw_data, features, rename_dictionery):
    # select features
    clean_data = raw_data[features]
    
    # drop missing
    clean_data = clean_data.dropna()
    clean_data = clean_data.reset_index(drop=True)
    
    # label encode data
    clean_data = clean_data.replace(label_enc_dict)
    return clean_data

In [158]:
train_data = clean_data(raw_data = raw_train_data,
                        features = features_model + target,
                        rename_dictionery = label_enc_dict)

In [159]:
train_data

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,income_class
0,39,5,0,13,2,8,3,0,1,2174,0,40,<=50K
1,50,1,0,13,0,4,2,0,1,0,0,13,<=50K
2,38,0,3,9,1,6,3,0,1,0,0,40,<=50K
3,53,0,2,7,0,6,2,4,1,0,0,40,<=50K
4,28,0,0,13,0,5,0,4,0,0,0,40,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...
30713,27,0,5,12,0,0,0,0,0,0,0,38,<=50K
30714,40,0,3,9,0,7,2,0,1,0,0,40,>50K
30715,58,0,3,9,4,8,5,0,0,0,0,40,<=50K
30716,22,0,3,9,2,8,1,0,1,0,0,20,<=50K


### Data Scalling
As we already mention different algorithm is affected by the different feature scales.  
So we have to normalise (scalling) our data.

The most common method to scale your dataset are listed bellow.
1. min-max scaling
    * $x_{scaled} = (x - min) / (max-min)$
    * $x_{scaled} \in [0,1]$
    * not good technique when you have outliers
    
    
2. standard scaling
   * $x_{scaled} = (x - mean) / std$
   * $mean(x_{scaled}) = 0$
   * $std(x_{scaled}) = 1$
   * robust to outliers

In [160]:
from sklearn.preprocessing import StandardScaler

standard_scaler = StandardScaler()
standard_scaler.fit(train_data[features_model])

train_data_norm = standard_scaler.transform(train_data[features_model])

# sklearn return a numpy so we can convert back to pandas
train_data_norm = pd.DataFrame(train_data_norm, columns=features_model)
train_data_norm[target[0]] = train_data[target[0]]

In [161]:
train_data_norm

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,income_class
0,0.042416,2.921147,-0.981585,1.119909,0.805738,1.097488,0.492328,-0.376320,0.691144,0.142438,-0.219179,-0.079207,<=50K
1,0.880958,0.181057,-0.981585,1.119909,-0.898501,-0.247158,-0.320798,-0.376320,0.691144,-0.147516,-0.219179,-2.331988,<=50K
2,-0.033815,-0.503966,-0.111430,-0.441111,-0.046381,0.425165,0.492328,-0.376320,0.691144,-0.147516,-0.219179,-0.079207,<=50K
3,1.109651,-0.503966,-0.401482,-1.221621,-0.898501,0.425165,-0.320798,2.944026,0.691144,-0.147516,-0.219179,-0.079207,<=50K
4,-0.796125,-0.503966,-0.981585,1.119909,-0.898501,0.089003,-1.947051,2.944026,-1.446877,-0.147516,-0.219179,-0.079207,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...
30713,-0.872356,-0.503966,0.468674,0.729654,-0.898501,-1.591804,-1.947051,-0.376320,-1.446877,-0.147516,-0.219179,-0.246080,<=50K
30714,0.118647,-0.503966,-0.111430,-0.441111,-0.898501,0.761326,-0.320798,-0.376320,0.691144,-0.147516,-0.219179,-0.079207,>50K
30715,1.490806,-0.503966,-0.111430,-0.441111,2.509978,1.097488,2.118581,-0.376320,-1.446877,-0.147516,-0.219179,-0.079207,<=50K
30716,-1.253512,-0.503966,-0.111430,-0.441111,0.805738,1.097488,-1.133924,-0.376320,0.691144,-0.147516,-0.219179,-1.747934,<=50K


## B. Model selection

1. **k-Nearest Neighbors**: KNN is a simple and intuitive machine learning algorithm used for classification and regression. It makes predictions based on labels of the 'k' nearest data points from the training set.

2. **Logistic Regression**: Logistic Regression is a linear machine learning method used for binary classification. The model is trained to adjust its parameters to create an optimal hyperplane for separating data points based on certain features

Cross-validation provides a more reliable estimate of a model's performance by testing it on multiple subsets of the data. This helps to mitigate the risk of getting a good or bad result by chance when splitting data into a single training and validation set.


In general, cross-validation is highly efficient when dealing with limited data. However, it does come with additional computational requirements. So, if we have the computational resources available, performing model selection with cross-validation best choice.

In [162]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

In [75]:
different_k = [1, 10, 20, 30, 50, 200]
different_fit_intercept = [True, False]

In [163]:
from sklearn.model_selection import cross_val_score

number_of_folder=10

cv_average_results = {}
for k in different_k:        
    cv_scores = cross_val_score(estimator=KNeighborsClassifier(n_neighbors=k),
                                X=train_data_norm[features_model],
                                y=train_data_norm[target[0]],
                                cv=number_of_folder,
                                n_jobs=-1)
    cv_average_results[f"knn-k={k}"] = np.mean(cv_scores)

In [164]:
for fit_intercept in different_fit_intercept:        
    cv_scores = cross_val_score(estimator=LogisticRegression(penalty=None, fit_intercept=fit_intercept),
                                X=train_data_norm[features_model],
                                y=train_data_norm[target[0]],
                                cv=number_of_folder,
                                n_jobs=-1)
    cv_average_results[f"logistic_regresion-b={fit_intercept}"] = np.mean(cv_scores)

In [165]:
performance = pd.DataFrame([cv_average_results], index = ["average_cv_performance"]).T
performance

Unnamed: 0,average_cv_performance
knn-k=1,0.800931
knn-k=10,0.837197
knn-k=20,0.837001
knn-k=30,0.838141
knn-k=50,0.834234
knn-k=200,0.828667
logistic_regresion-b=True,0.837294
logistic_regresion-b=False,0.820138


We can observe that the different models yield similar results. Additionally, the fact that logistic regression performs well suggests that the data is separable by a linear hyperplane in N-dimensional space. In the case of k-nearest neighbors (KNN), we notice that performance degrades after k=30, which appears to be the optimal choice for the KNN model.

In [167]:
best_model = performance.idxmax()["average_cv_performance"]
best_model

'knn-k=30'

In [168]:
best_k = int(best_model.split("=")[-1])

### Retrain using all available data

In [169]:
final_model = KNeighborsClassifier(n_neighbors = best_k)
final_model = final_model.fit(train_data_norm[features_model], train_data_norm[ target[0] ] )

### Estimate future preformance based on testset

In [170]:
TEST_DATA_PATH = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'
raw_test_data = pd.read_csv(
                         TEST_DATA_PATH,
                         names= features + target ,
                         skipinitialspace=True,  # Skip spaces after delimiter ","
                         na_values = "?"
                        ) 

In [171]:
raw_test_data

Unnamed: 0,age,workclass,final_weight,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income_class
0,|1x3 Cross validator,,,,,,,,,,,,,,
1,25,Private,226802.0,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,40.0,United-States,<=50K.
2,38,Private,89814.0,HS-grad,9.0,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,50.0,United-States,<=50K.
3,28,Local-gov,336951.0,Assoc-acdm,12.0,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,40.0,United-States,>50K.
4,44,Private,160323.0,Some-college,10.0,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688.0,0.0,40.0,United-States,>50K.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16277,39,Private,215419.0,Bachelors,13.0,Divorced,Prof-specialty,Not-in-family,White,Female,0.0,0.0,36.0,United-States,<=50K.
16278,64,,321403.0,HS-grad,9.0,Widowed,,Other-relative,Black,Male,0.0,0.0,40.0,United-States,<=50K.
16279,38,Private,374983.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Husband,White,Male,0.0,0.0,50.0,United-States,<=50K.
16280,44,Private,83891.0,Bachelors,13.0,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455.0,0.0,40.0,United-States,<=50K.


In [172]:
# remove the first row
raw_test_data = pd.read_csv(
                         TEST_DATA_PATH,
                         names= features + target ,
                         skipinitialspace=True,  # Skip spaces after delimiter ","
                         na_values = "?", 
                         skiprows=1,
                        ) 
raw_test_data

Unnamed: 0,age,workclass,final_weight,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income_class
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K.
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K.
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K.
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K.
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16276,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K.
16277,64,,321403,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States,<=50K.
16278,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K.
16279,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K.


In [86]:
# incosistency between label class
raw_test_data["income_class"]

0        <=50K.
1        <=50K.
2         >50K.
3         >50K.
4        <=50K.
          ...  
16276    <=50K.
16277    <=50K.
16278    <=50K.
16279    <=50K.
16280     >50K.
Name: income_class, Length: 16281, dtype: object

In [174]:
fix_income_class = {"<=50K.": "<=50K",
                    ">50K.": ">50K"}
raw_test_data = raw_test_data.replace(fix_income_class)

In [175]:
raw_test_data

Unnamed: 0,age,workclass,final_weight,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income_class
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16276,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
16277,64,,321403,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States,<=50K
16278,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
16279,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K


In [176]:
test_data = clean_data(raw_data = raw_test_data,
                       features=features_model+target,
                       rename_dictionery=label_enc_dict)

In [177]:
test_data

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,income_class
0,25,0,2,7,2,7,1,4,1,0,0,40,<=50K
1,38,0,3,9,0,9,2,0,1,0,0,50,<=50K
2,28,4,5,12,0,12,2,0,1,0,0,40,>50K
3,44,0,1,10,0,7,2,4,1,7688,0,40,>50K
4,34,0,12,6,2,2,3,0,1,0,0,30,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15310,33,0,0,13,2,5,1,0,1,0,0,40,<=50K
15311,39,0,0,13,1,5,3,0,0,0,0,36,<=50K
15312,38,0,0,13,0,5,2,0,1,0,0,50,<=50K
15313,44,0,0,13,1,8,1,1,1,5455,0,40,<=50K


### Estimate Perfomance on test set

In [178]:
# note that we dont have to train the scaler again
test_data_norm = standard_scaler.transform(test_data[features_model])
test_data_norm = pd.DataFrame(test_data_norm, columns=features_model)
test_data_norm[target[0]] = test_data[target[0]]

In [179]:
test_data_norm

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,income_class
0,-1.024819,-0.503966,-0.401482,-1.221621,0.805738,0.761326,-1.133924,2.944026,0.691144,-0.147516,-0.219179,-0.079207,<=50K
1,-0.033815,-0.503966,-0.111430,-0.441111,-0.898501,1.433649,-0.320798,-0.376320,0.691144,-0.147516,-0.219179,0.755156,<=50K
2,-0.796125,2.236124,0.468674,0.729654,-0.898501,2.442134,-0.320798,-0.376320,0.691144,-0.147516,-0.219179,-0.079207,>50K
3,0.423571,-0.503966,-0.691533,-0.050856,-0.898501,0.761326,-0.320798,2.944026,0.691144,0.877859,-0.219179,-0.079207,>50K
4,-0.338739,-0.503966,2.499036,-1.611876,0.805738,-0.919481,0.492328,-0.376320,0.691144,-0.147516,-0.219179,-0.913570,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15310,-0.414970,-0.503966,-0.981585,1.119909,0.805738,0.089003,-1.133924,-0.376320,0.691144,-0.147516,-0.219179,-0.079207,<=50K
15311,0.042416,-0.503966,-0.981585,1.119909,-0.046381,0.089003,0.492328,-0.376320,-1.446877,-0.147516,-0.219179,-0.412953,<=50K
15312,-0.033815,-0.503966,-0.981585,1.119909,-0.898501,0.089003,-0.320798,-0.376320,0.691144,-0.147516,-0.219179,0.755156,<=50K
15313,0.423571,-0.503966,-0.981585,1.119909,-0.046381,1.097488,-1.133924,0.453767,0.691144,0.580036,-0.219179,-0.079207,<=50K


In [180]:
from sklearn.metrics import accuracy_score

predictions = final_model.predict(test_data_norm[features_model])

test_acc = accuracy_score(y_true=test_data_norm[target[0]],
                          y_pred=predictions)

In [182]:
print("Accuracy of final model on test set:",test_acc)

Accuracy of final model on test set: 0.8382631407117206


# Extra Sklearn Pipeline to perforn model selection

Sklearn has a lot of additinal usefull tolls to create machine learning pipeline and perfom model selection.  
In the following section we will make a quick demostrate of them.

# 1. Single Model selection
The method tha we search of all different model and their parameter called gridsearch or exustive search.  
instead of diffing the custum training loops sklearn has method to efficiently implement gridsearch on different pipelines.

In [94]:
from sklearn.model_selection import GridSearchCV

In [183]:
parameters = {"n_neighbors":different_k}
grid_search = GridSearchCV(KNeighborsClassifier(), parameters, cv =10)

In [184]:
parameters

{'n_neighbors': [1, 10, 20, 30, 50, 200]}

In [185]:
grid_search.fit(train_data_norm[features_model],train_data_norm[target[0]])

In [186]:
cv_results = pd.DataFrame(grid_search.cv_results_)
cv_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.020175,0.004076,0.181139,0.002795,1,{'n_neighbors': 1},0.79362,0.798503,0.804362,0.799154,0.801758,0.800781,0.803385,0.805339,0.808857,0.793553,0.800931,0.004666,6
1,0.018654,0.0001,0.309462,0.00275,10,{'n_neighbors': 10},0.827474,0.833008,0.84375,0.822266,0.837565,0.830404,0.845703,0.85026,0.844676,0.836861,0.837197,0.008505,2
2,0.018704,0.000205,0.370817,0.001949,20,{'n_neighbors': 20},0.822266,0.833008,0.839518,0.831055,0.835612,0.835938,0.84375,0.848633,0.843373,0.836861,0.837001,0.007066,3
3,0.018808,0.000482,0.415738,0.002874,30,{'n_neighbors': 30},0.827148,0.834635,0.836589,0.83138,0.835612,0.838542,0.844727,0.850911,0.844025,0.837838,0.838141,0.00655,1
4,0.018622,0.000165,0.480974,0.001752,50,{'n_neighbors': 50},0.823568,0.833008,0.835612,0.830729,0.832357,0.83431,0.835938,0.847005,0.838815,0.831,0.834234,0.005748,4
5,0.018679,9.2e-05,0.761572,0.011751,200,{'n_neighbors': 200},0.817383,0.823893,0.832682,0.819661,0.830078,0.83724,0.829753,0.838542,0.827743,0.829697,0.828667,0.006507,5


In [188]:
grid_search.best_params_

{'n_neighbors': 30}

# 2. define pipelines

In [107]:
from sklearn.pipeline import Pipeline

ml_pipe = Pipeline(
                    [
                    ('scaler', StandardScaler()),
                    ('classifier', KNeighborsClassifier())
                    ]
                )

In [108]:
ml_pipe.fit(train_data[features_model], train_data[target[0]])

In [189]:
predictions = ml_pipe.predict(test_data[features_model])
test_acc = accuracy_score(y_true=test_data[target[0]],
                          y_pred=predictions)

In [190]:
print("accuracy of a pipeline",test_acc)

accuracy of a pipeline 0.8286647078028077


## 3 Sklearn pipeline + gridsearch

In [191]:
parameters = {"classifier__n_neighbors": different_k}
grid_search = GridSearchCV(ml_pipe, parameters, n_jobs=-1)

In [192]:
parameters

{'classifier__n_neighbors': [1, 10, 20, 30, 50, 200]}

In [193]:
grid_search.fit(train_data_norm[features_model], train_data_norm[target[0]])

In [194]:
grid_search.best_params_

{'classifier__n_neighbors': 30}

In [195]:
cv_results = pd.DataFrame(grid_search.cv_results_)
cv_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_classifier__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.036137,0.011344,0.782562,0.070113,1,{'classifier__n_neighbors': 1},0.796387,0.800618,0.799479,0.802214,0.797493,0.799238,0.002097,6
1,0.036192,0.006742,1.400751,0.025249,10,{'classifier__n_neighbors': 10},0.829915,0.830892,0.833496,0.844538,0.839167,0.835602,0.005505,3
2,0.027993,0.003777,1.448927,0.111072,20,{'classifier__n_neighbors': 20},0.826497,0.833171,0.836426,0.846655,0.838841,0.836318,0.006627,2
3,0.03314,0.006293,1.548224,0.069313,30,{'classifier__n_neighbors': 30},0.828939,0.831706,0.836914,0.84519,0.839329,0.836416,0.005724,1
4,0.031462,0.000237,1.726712,0.104724,50,{'classifier__n_neighbors': 50},0.828288,0.831055,0.8361,0.841771,0.838841,0.835211,0.004944,4
5,0.035868,0.008918,2.161008,0.190482,200,{'classifier__n_neighbors': 200},0.818685,0.82487,0.831055,0.832818,0.829237,0.827333,0.005068,5


## 4 compare different pipelines

In [199]:
from sklearn.base import BaseEstimator
from sklearn.preprocessing import MinMaxScaler

ml_pipe_1 = Pipeline([('scaler', "passthrough"),
                      ('classifier',"passthrough")])

params = [
                # first set of parameters for Knn
                { 
                   "scaler": [StandardScaler(), MinMaxScaler()],
                   "classifier": [KNeighborsClassifier()],
                   "classifier__n_neighbors": different_k
                }
            ,
                # Second set of parameters for logistic regression
                  { 
                   "scaler": [StandardScaler()],
                   "classifier": [LogisticRegression(penalty=None)],
                   "classifier__fit_intercept": different_fit_intercept
                  }
        
]

grid_search = GridSearchCV(ml_pipe_1, params, n_jobs=-1, scoring="accuracy", cv=10)
grid_search.fit(train_data[features_model], train_data[target[0]])

In [200]:
pd.DataFrame(grid_search.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_classifier,param_classifier__n_neighbors,param_scaler,param_classifier__fit_intercept,params,split0_test_score,...,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.038095,0.009701,0.406493,0.036925,KNeighborsClassifier(n_neighbors=30),1.0,StandardScaler(),,{'classifier': KNeighborsClassifier(n_neighbor...,0.791016,...,0.79362,0.800781,0.804036,0.800456,0.801758,0.807229,0.793553,0.799824,0.005339,13
1,0.037365,0.006008,0.187769,0.052625,KNeighborsClassifier(n_neighbors=30),1.0,MinMaxScaler(),,{'classifier': KNeighborsClassifier(n_neighbor...,0.783203,...,0.783203,0.794922,0.794271,0.793294,0.791667,0.801042,0.788994,0.792142,0.006495,14
2,0.036755,0.010178,0.64479,0.081424,KNeighborsClassifier(n_neighbors=30),10.0,StandardScaler(),,{'classifier': KNeighborsClassifier(n_neighbor...,0.822266,...,0.82194,0.838867,0.833984,0.845052,0.848307,0.844676,0.832628,0.837164,0.009131,4
3,0.034572,0.00351,0.309497,0.041303,KNeighborsClassifier(n_neighbors=30),10.0,MinMaxScaler(),,{'classifier': KNeighborsClassifier(n_neighbor...,0.811523,...,0.820638,0.833984,0.830078,0.836263,0.840169,0.831651,0.82579,0.829676,0.008354,8
4,0.035699,0.006253,0.792543,0.073835,KNeighborsClassifier(n_neighbors=30),20.0,StandardScaler(),,{'classifier': KNeighborsClassifier(n_neighbor...,0.818685,...,0.832031,0.839518,0.839193,0.84375,0.847331,0.845002,0.833605,0.837685,0.007919,2
5,0.03261,0.003037,0.385646,0.05157,KNeighborsClassifier(n_neighbors=30),20.0,MinMaxScaler(),,{'classifier': KNeighborsClassifier(n_neighbor...,0.813477,...,0.829753,0.829427,0.832682,0.840495,0.842448,0.834907,0.824813,0.831174,0.00835,7
6,0.038471,0.004572,0.865268,0.047538,KNeighborsClassifier(n_neighbors=30),30.0,StandardScaler(),,{'classifier': KNeighborsClassifier(n_neighbor...,0.822917,...,0.831706,0.837565,0.843099,0.845703,0.847656,0.845979,0.835233,0.838694,0.007186,1
7,0.033523,0.00253,0.451726,0.029507,KNeighborsClassifier(n_neighbors=30),30.0,MinMaxScaler(),,{'classifier': KNeighborsClassifier(n_neighbor...,0.816406,...,0.823568,0.832031,0.833008,0.834635,0.842773,0.841094,0.824487,0.831662,0.007749,6
8,0.037452,0.005987,0.986883,0.057247,KNeighborsClassifier(n_neighbors=30),50.0,StandardScaler(),,{'classifier': KNeighborsClassifier(n_neighbor...,0.820638,...,0.829753,0.835286,0.835286,0.838867,0.845378,0.838489,0.828395,0.834625,0.006547,5
9,0.037429,0.007583,0.536585,0.055022,KNeighborsClassifier(n_neighbors=30),50.0,MinMaxScaler(),,{'classifier': KNeighborsClassifier(n_neighbor...,0.817057,...,0.820638,0.831706,0.831055,0.835938,0.840495,0.831325,0.824813,0.829676,0.006801,9


In [201]:
grid_search.best_params_

{'classifier': KNeighborsClassifier(n_neighbors=30),
 'classifier__n_neighbors': 30,
 'scaler': StandardScaler()}