# Homework 2, Problem 4

The goal of this problem is for you to try and classify whether or not an individual is likely to make more or less than 50K per year.  Carry out this task.  Try at least five machine algorithms, report precision, recall and f1 score on a test set.

For each of the parts, report your preformance in terms not of just numbers but in terms of graphs. When you have training and validation data, please show the curves as the training progresses. You should know when you are overfitting or underfitting. Don't just report bare numbers. **You are free to add implmentation or markdown cells to make your notebook clearer!!**

## Data:

The following dataset was taken from the first dataset repository: http://archive.ics.uci.edu/ml/datasets/Adult

As the original task of the dataset lays out, 
Please note:
* the continuous variable fnlwgt represents final weight, which is the number of units in the target population that the responding unit represents.

## Part 1: Dealing with Missing Values

What should you do about dealing with missing values - do you just drop those rows?

One of the most common problems we come accross in working with data "in the wild" is missing data. Often we will have observations (rows) that have only some of the needed attributes. Different rows will have different attributes missing. There are a number of strategies for dealing with the missing values. Clearly one could be dropping the column (attribute), or row (observation). Unfortuntely if you drop columns you may lose critical information that is helpful for classification and may be present in most (many) of the rows. You can drop rows but if many rows have at least one missing value, you may loose too much data. Do you try to impute (i. e. fill in) the missing data?  If so how?  

Explain why you chose the strategy you did.

*Hint - '?' denotes a missing value.*

### Some possible strategies for dealing with missing data

1. Whenever there is pleanty of data and very little missing data, you should consider dropping rows and/or columns. This may introduce some bias in the data but again, if the problem is limited to a very few rows or columns, it is easy in training to reproduce.

2. Fill with fixed value using sklearn.impute.SimpleImputer.
    a. 'constant' 0. Rarely a good idea but sometimes, if we can assume that when it is missing it is basically 0, this might be a good idea. For example a data may list number of house fires in a zip code and a missing value just means none.
    b. 'mean' if the data is numeric, the mean is meaningfull.
    c. 'median' may be more sensible if the data is integer or ordered. When the mean and median are very different it is important to understand what a "typical" example might mean. When considering "income", for example, a few large outliers will mess up the mean.
    d. "most_frequent' when you have categorical (nominal) labels, mean and median don't make any sense. Most probable label is what you need to use. This is also known as "mode".

3. sklearn.impute.MissingIndicator: Sometimes the fact that a value is missing, is itself an important indicator. One can create a new feature/attribute that indicates a certain attribute is missing. If you later build a classifier by hand you can explicitly wieght each variable using the missing variable weights so that for that example (row) that attribute won't contribute to the classifier. In a deep neural network it is possible that the network can learn to do that automatically.

4. One can use the sklearn.impute.KNNImputer which will look for rows to fill in the data.

5. Fill with sklearn.impute.IterativeImputer scikit-learn provides a sophisticated imputation strategy. You can read up on this in the documentation, but it will fix on of the columns (attributes), and try to use the other features to predict similar to KNN but more sophisticated.

6. Train a classifier: You can build your own classifier using machine learning. This is kind of a problem within a problem but if done correctly, it has the potential to be more accurate than a simpler method. Of course, if done badly it could be worse.

7. Manually impute the missing values. You may know enough about the problem to build an ad-hoc way to fill in the missing values for each column in a way that makes the most sense. This almost always requires a great deal of domain expertise. 



In [125]:
# Add your code for filling in the data here. Please end by using the appropriate method for data filling.
# to show the amount of missing data (which in the end should not be any since you dropped or filled in data)


import pandas as pd

columns = [
    "age",
    "work_class",
    "fnlwgt",
    "education",
    "education_num",
    "marital_status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "capital_gain",
    "capital_loss",
    "hours_per_week",
    "native_country",
    "target"
]
df = pd.read_csv("C:/Users/Cemil Turhan/Downloads/adult.data", names=columns)
for column in df.columns:
    if df[column].dtype == "object":
        df[column] = df[column].str.strip()
#print(df.describe)
#print(df.info)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
work_class        32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education_num     32561 non-null int64
marital_status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital_gain      32561 non-null int64
capital_loss      32561 non-null int64
hours_per_week    32561 non-null int64
native_country    32561 non-null object
target            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


## Part 2: Train Test Validate Split

Ideally you will split the data and use the train data filling in proceedure for the test data. Because this is expensive you can do experiments initially to see if this matters. Just keep carefully in mind what you will know and what you can't know during the test evaluation. Both sklearn and tensorflow provide facilities for train test split. Take your pick.

At the end of this you should have a train, validate and test split. In the next part you are going to do preliminary testing of your model with your train+validation sets to get some idea of good canditates for hyperparameters. Later you will merge your training and validation set and resplit them up using cross validation to get better estimates for setting hyper-parameters

**NOTE: It is very important that you record very carefully any parameters you have for filling in data in step 1. For example if you you build a "fit" using some training data, later you will need to use the this "fit" to transform the data, you can not re-fit on new data. In other words if your "pipline" in training takes the mean of the input to fill in the first column, you need to fill with exactly that number, when you get new data for testing. Don't take the mean of the test data.**

In [150]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve
from sklearn.dummy import DummyClassifier

In [127]:
# Fill your solution for a train-test split in here.


#convert all numericals to float
df.age = df.age.astype(float)
df.fnlwgt = df.fnlwgt.astype(float)
df.education_num = df.education_num.astype(float)
df.hours_per_week = df.hours_per_week.astype(float)

#has many zeros, not meaningful
df.drop("capital_loss", axis=1, inplace=True,)
df.drop("capital_gain", axis=1, inplace=True,)


num_quest=df[df["work_class"]=="?"]
#print(num_quest)
df.work_class.value_counts() #Private is the most seen one   

df.work_class.replace("?", "Private")



0               State-gov
1        Self-emp-not-inc
2                 Private
3                 Private
4                 Private
5                 Private
6                 Private
7        Self-emp-not-inc
8                 Private
9                 Private
10                Private
11              State-gov
12                Private
13                Private
14                Private
15                Private
16       Self-emp-not-inc
17                Private
18                Private
19       Self-emp-not-inc
20                Private
21                Private
22            Federal-gov
23                Private
24                Private
25              Local-gov
26                Private
27                Private
28                Private
29                Private
               ...       
32531             Private
32532             Private
32533             Private
32534             Private
32535             Private
32536             Private
32537             Private
32538       

In [128]:
num_quest=df[df["fnlwgt"]=="?"]  #checked fnlwgt, education, education_num, marital_status, relationship, race, sex,capital_gain, capital_loss, hours_per_week
#print(num_quest)

num_quest=df[df["occupation"]=="?"]
#print(num_quest)
df.occupation.value_counts()    

df=df[df["occupation"]!="?"]  #I drop the null values in occupation



  result = method(y)


In [129]:
num_quest=df[df["native_country"]=="?"]
#print(num_quest)
df.native_country.value_counts()
df.native_country.replace("?", "United-States") #United_States is the most seen one


df["target"] = df["target"].map({ "<=50K": -1, ">50K": 1 })

#one-hot encoding for non-numerical columns
df = pd.get_dummies(df, columns=[
    "work_class", "education", "marital_status", "occupation", "relationship",
    "race", "sex", "native_country",
])


In [131]:
# split features and target variable
data_x = df.iloc[:,:-1]
data_y = df.iloc[:, -1]

X_train, X_test, y_train, y_test= train_test_split(data_x, data_y, test_size=0.25, stratify=data_y, random_state=0) 
X_tr, X_val, y_tr, y_val= train_test_split(X_train, y_train, test_size=0.25,  stratify=y_train, random_state=42) 



## Part 3: Build different five different variations sklearn models and a Dummy

You will need to use a baseline classifier. Sklearn has sklearn.dummy.DummyClassifier which you can use as a benchmark for a braindead classifier. Pick 5 classifiers including simple ones like Knn or linear like logistic regression, and sophistocated ones like random forest and svm. Use the training and validation data from above (don't look at the testing data). Get a baseline for performance.

Create bar graphs using different metrics including accuracy, recall, precision and f1 score accross the different algorithms.

In [151]:
# Get baseline results here with logisic regression and random forest

# Set up your models here

def model_one():
    logreg=LogisticRegression()
    fit=logreg.fit(X_tr, y_tr)
    y_pred=logreg.predict(X_val)
    score=logreg.score(X_val,y_val)
    conf_mat=confusion_matrix(y_val,y_pred)
    clas_rep=classification_report(y_val,y_pred)
    print(score)
    print(conf_mat)
    print(clas_rep)

def model_two():
    knn=KNeighborsClassifier(n_neighbors=6)
    fit=knn.fit(X_tr, y_tr)
    y_pred=knn.predict(X_val)
    score_2=knn.score(X_val,y_val)
    conf_mat_2=confusion_matrix(y_val,y_pred)
    clas_rep_2=classification_report(y_val,y_pred)
    print(score_2)
    print(conf_mat_2)
    print(clas_rep_2)

def model_three():
    dt=DecisionTreeClassifier(random_state=0)
    fit=dt.fit(X_tr, y_tr)
    y_pred=dt.predict(X_val)
    score_3=dt.score(X_val,y_val)
    conf_mat_3=confusion_matrix(y_val,y_pred)
    clas_rep_3=classification_report(y_val,y_pred)
    print(score_3)
    print(conf_mat_3)
    print(clas_rep_3)

def model_four():
    svm=LinearSVC(random_state=0)
    fit=svm.fit(X_tr, y_tr)
    y_pred=svm.predict(X_val)
    score_4=svm.score(X_val,y_val)
    conf_mat_4=confusion_matrix(y_val,y_pred)
    clas_rep_4=classification_report(y_val,y_pred)
    print(score_4)
    print(conf_mat_4)
    print(clas_rep_4)

def model_five():
    rf=RandomForestClassifier(max_depth=3, random_state=0)
    fit=rf.fit(X_tr, y_tr)
    y_pred=rf.predict(X_val)
    score_5=rf.score(X_val,y_val)
    conf_mat_5=confusion_matrix(y_val,y_pred)
    clas_rep_5=classification_report(y_val,y_pred)
    print(score_5)
    print(conf_mat_5)
    print(clas_rep_5)
    

def model_baseline():
    dummy = DummyClassifier(strategy="stratified")
    fit=dummy.fit(X_tr, y_tr)
    y_pred=dummy.predict(X_val)
    score_base=dummy.score(X_val,y_val)
    conf_mat_base=confusion_matrix(y_val,y_pred)
    clas_rep_base=classification_report(y_val,y_pred)
    print(score_base)
    print(conf_mat_base)
    print(clas_rep_base)
    
# Perform preliminary evaluations here



In [158]:
model_baseline()
model_one()
model_two()
model_three()
model_four()
model_five()

0.9994791666666667
[[5754    3]
 [   3    0]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5757
           1       0.00      0.00      0.00         3

    accuracy                           1.00      5760
   macro avg       0.50      0.50      0.50      5760
weighted avg       1.00      1.00      1.00      5760



  'precision', 'predicted', average, warn_for)


0.9994791666666667
[[5757    0]
 [   3    0]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5757
           1       0.00      0.00      0.00         3

    accuracy                           1.00      5760
   macro avg       0.50      0.50      0.50      5760
weighted avg       1.00      1.00      1.00      5760



  'precision', 'predicted', average, warn_for)


0.9994791666666667
[[5757    0]
 [   3    0]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5757
           1       0.00      0.00      0.00         3

    accuracy                           1.00      5760
   macro avg       0.50      0.50      0.50      5760
weighted avg       1.00      1.00      1.00      5760

0.9987847222222223
[[5753    4]
 [   3    0]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5757
           1       0.00      0.00      0.00         3

    accuracy                           1.00      5760
   macro avg       0.50      0.50      0.50      5760
weighted avg       1.00      1.00      1.00      5760

0.9994791666666667
[[5757    0]
 [   3    0]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5757
           1       0.00      0.00      0.00         3

    accuracy                           1.00 

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


### Preliminary conclusions on your models

Include some graphs and peformance metrics

## Part 4: Cross-validation
We really should have used k-fold (eg. k=5) crossvalidation here, to not only evaluate our five keras/tensorflow models. See how your preliminary results change. Now that we have validation results with uncertainy (+- standard deviation), do your prior conclusion change. 

In [157]:
# Part 4 inplement cross validation here
from sklearn.model_selection import cross_val_score
logreg=LogisticRegression()   
scores_1=cross_val_score(logreg, data_x, data_y, cv=10)
print(scores_1.mean())

knn=KNeighborsClassifier(n_neighbors=6)   
scores_2=cross_val_score(knn, data_x, data_y, cv=10)
print(scores_2.mean())

dt=DecisionTreeClassifier(random_state=0)   
scores_3=cross_val_score(dt, data_x, data_y, cv=10)
print(scores_3.mean())

svm=LinearSVC(random_state=0)  
scores_4=cross_val_score(svm, data_x, data_y, cv=10)
print(scores_4.mean())

rf=RandomForestClassifier(max_depth=3, random_state=0)  
scores_5=cross_val_score(rf, data_x, data_y, cv=10)
print(scores_5.mean())




0.9994791666390718
0.9994791666390718
0.9986653327148509




0.9994791666390718




0.9994791666390718


### Fill in your Part 4 Conclusion here 

After cross validation, accuracies didn't change too much because the previous accuracies were also high.

## Part 5: Refining with Regularization

We know that our biggest problem, if our models are flexibile enough, will be overfitting. Please try to regularize your best 2 models to see if you can improve their results. Not all algorithms have regularization but analyze two that do. Make sure you show graph performance has you change the regularization parameters.
Look at these questions:

* Try regularizing each of your two best models, does the generalizability increase?  of Decrease?  
* Is one more sensitive than the other? Why might this happen and why?  
* Please try this with all of your features and then with the reduced set of features.  
* Report your precision, recall and f1 score on the train and validation sets (no cross validatio yet).  
* Next carry out cross validation.  Does regularization reduce under or overfitting?   Why or why not?  

** Hint: Try both L1 or L2 norm for regularization or dropout **


In [161]:
# Fill in your code analysis for part 5 here
logreg=LogisticRegression(penalty="l1")
fit=logreg.fit(X_tr, y_tr)
y_pred=logreg.predict(X_val)
score=logreg.score(X_val,y_val)
conf_mat=confusion_matrix(y_val,y_pred)
clas_rep=classification_report(y_val,y_pred)
print(score)
print(conf_mat)
print(clas_rep)


logreg=LogisticRegression(penalty="l2")
fit=logreg.fit(X_tr, y_tr)
y_pred=logreg.predict(X_val)
score=logreg.score(X_val,y_val)
conf_mat=confusion_matrix(y_val,y_pred)
clas_rep=classification_report(y_val,y_pred)
print(score)
print(conf_mat)
print(clas_rep)
    
svm=LinearSVC(random_state=0, penalty="l2")
fit=svm.fit(X_tr, y_tr)
y_pred=svm.predict(X_val)
score_4=svm.score(X_val,y_val)
conf_mat_4=confusion_matrix(y_val,y_pred)
clas_rep_4=classification_report(y_val,y_pred)
print(score_4)
print(conf_mat_4)
print(clas_rep_4)

  'precision', 'predicted', average, warn_for)


0.9994791666666667
[[5757    0]
 [   3    0]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5757
           1       0.00      0.00      0.00         3

    accuracy                           1.00      5760
   macro avg       0.50      0.50      0.50      5760
weighted avg       1.00      1.00      1.00      5760

0.9994791666666667
[[5757    0]
 [   3    0]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5757
           1       0.00      0.00      0.00         3

    accuracy                           1.00      5760
   macro avg       0.50      0.50      0.50      5760
weighted avg       1.00      1.00      1.00      5760

0.9994791666666667
[[5757    0]
 [   3    0]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5757
           1       0.00      0.00      0.00         3

    accuracy                           1.00 

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


### Fill in your part 5 conclusions here
Since first results were also good, they did not change.

## Overall Conclusion

Conclude with a full report here on what we know now about this problem. How well it does verses baseline, what the best Keras archtecture is, what features should be used, how the data should be cleaned etc.