<a href="https://colab.research.google.com/github/nawazf/BSMM-8740-1/blob/main/Adult_Income_Model_Selection_Optimization_pynb_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 0\. Background

*   Your goal is to predict whether income exceeds $50K/yr based on census data.
*   You have a sample dataset with a binary <font color="red">income</h1></font> target. There are two possible values **'>50K'** and **'<=50K'**



This dataset is sourced from
https://archive.ics.uci.edu/dataset/2/adult



In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import  LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

## 1\. ***Load the Dataset into a Pandas DataFrame***

In [2]:
# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
column_names = ["age", "workclass", "fnlwgt", "education", "education_num", "marital_status",
                "occupation", "relationship", "race", "sex", "capital_gain", "capital_loss",
                "hours_per_week", "native_country", "income"]
data = pd.read_csv(url, names=column_names, skipinitialspace=True)

## 2\.***Data Exploration***

### 2\.1 Display the top 5 rows and bottom 5 rows of the DataFrame

In [3]:
data

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


### 2\.2 Identify which rows have nulls

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [5]:
data.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

### 2\.3 Identify the number of unique values in each column in your dataset

In [6]:
data.nunique()

age                  73
workclass             9
fnlwgt            21648
education            16
education_num        16
marital_status        7
occupation           15
relationship          6
race                  5
sex                   2
capital_gain        119
capital_loss         92
hours_per_week       94
native_country       42
income                2
dtype: int64

### 2\.4 For your target variable, identify the percentage of records for each class


In [7]:
data["income"].value_counts()

income
<=50K    24720
>50K      7841
Name: count, dtype: int64

In [8]:
data["income"].value_counts(normalize = True)

income
<=50K    0.75919
>50K     0.24081
Name: proportion, dtype: float64

## 3\.***Pre-process the data***

### 3\.1. Organize your columns into Categorical features, numeric features, and a target variable.  Assign your features to an X variable and your target to a y variable.  Display the output of both X and y.


In [9]:
numeric_features = ["age", "fnlwgt", "education_num", "capital_gain", "capital_loss", "hours_per_week"]
categorical_features = ["workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country"]
target = 'income'

In [10]:
X = data[numeric_features + categorical_features]
y = data[target]

In [11]:
X

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,workclass,education,marital_status,occupation,relationship,race,sex,native_country
0,39,77516,13,2174,0,40,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States
1,50,83311,13,0,0,13,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States
2,38,215646,9,0,0,40,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States
3,53,234721,7,0,0,40,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States
4,28,338409,13,0,0,40,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,257302,12,0,0,38,Private,Assoc-acdm,Married-civ-spouse,Tech-support,Wife,White,Female,United-States
32557,40,154374,9,0,0,40,Private,HS-grad,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,United-States
32558,58,151910,9,0,0,40,Private,HS-grad,Widowed,Adm-clerical,Unmarried,White,Female,United-States
32559,22,201490,9,0,0,20,Private,HS-grad,Never-married,Adm-clerical,Own-child,White,Male,United-States


In [12]:
y

0        <=50K
1        <=50K
2        <=50K
3        <=50K
4        <=50K
         ...  
32556    <=50K
32557     >50K
32558    <=50K
32559    <=50K
32560     >50K
Name: income, Length: 32561, dtype: object

### 3\.2 Label Encoding


Notice the binary target variable has text values.  
The target variable needs to be converted to numeric values before using it to build your predictive model.

In [13]:
import sklearn
from sklearn.preprocessing import LabelEncoder

le= sklearn.preprocessing.LabelEncoder()
le.fit(y)
y = le.transform(y)
class_names = le.classes_

In [14]:
class_names

array(['<=50K', '>50K'], dtype=object)

In [15]:
y

array([0, 0, 0, ..., 0, 0, 1])

### 3\.3 Feature Transformation

*  Perform One Hot Encoding for categorical features


In [22]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
        ('num', SimpleImputer(strategy='median', ), numeric_features)
    ])

In [None]:
X = preprocessor.fit_transform(X)

## 4\. ***Train and Evaluate Classification Models***

### 4\.1 Train/Test Split

Split your data to (80% Training/20% Testing) with random seed value of 42

<font color="red"><h1>Submit Answer Below [3] - 1 point:</h1></font>

In [23]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 4\.2 Train and Evaluate Models

For all your trained models, report


*   Confusion Matrix
*   Accuracy Score
*   AUC Score



#### Logistic Regression

In [27]:
from sklearn.linear_model import LogisticRegression
lr_clf = LogisticRegression()
lr_clf.fit(X_train, y_train)

ValueError: could not convert string to float: 'Local-gov'

<font color="red"><h1>Challenge 1 - Fix the above error</h1></font>

In [None]:
# y_test_pred = lr_clf.predict(X_test)
# print(f'Area under ROC curve: {roc_auc_score(y_test, y_test_pred): 0.4f}')
# print(f'Accuracy {accuracy_score(y_test, y_test_pred): 0.4f}')
# print(f'Weigted F1 score {f1_score(y_test, y_test_pred, average="weighted"): 0.4f}')
# print(classification_report(y_test, y_test_pred, digits=4))
# pd.DataFrame(confusion_matrix(y_test, y_test_pred))

<font color="red"><h1>Challenge 2 - Try to come up with the Best AUC Score in the class - let's see who will win this challange.  Use every trick at your disposal.  </h1></font>