# Income Classification using Logistic Regression

In this project, I will use a dataset containing census information from the 1994 Census to create a logistic regression model, that predicts whether or not a person makes more than $50,000 a year.


### Dataset

Data set is available at the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/20/census+income).

### Features

Input and Output `features`:
* age: continuous
* workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
* education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
* race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
* sex: Female, Male
* capital-gain: continuous
* capital-loss: continuous
* hours-per-week: continuous
* native country: discrete
* income: discrete, >50K, <=50K

------

### EDA and Logistic Regression Assumptions
1. The dataset has been saved as a dataframe named `df`. The outcome variable here is `income`. Check if the dataset is `imbalanced`.
2. Notice we have created a variable named `feature_cols`. This contains a list of the variables we will use as our predictor variables.
    `Transform` the dataset of predictor variables to dummy variables and save this in a new DataFrame called `X`.
3. Using `X`, create a `heatmap` of the correlation values.
4. Determine if `scaling` is needed for `X` prior to modeling. Then create the `y` output variable which is binary, `0` when income is less than $50K, `1` when greater than $50K.

### Logistic Regression Models and Evaluation
5. Split the data into a training and testing set. Set the `random_state` to 1 and `test_size` to `.2`.
6. Print the model parameters (`intercept` and `coefficients`).
7. Evaluate the predictions of the model on the `test set`. Print the `confusion matrix` and `accuracy score`.
8. Create a new DataFrame of the model coefficients and variable names. Sort values based on coefficient and exclude any that are equal to zero. Print the values of the DataFrame.
9. Create a `barplot` of the coefficients sorted in ascending order.
10. Plot the `ROC curve` and print the `AUC value`.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score, roc_curve, roc_auc_score
from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt
import seaborn as sns


col_names = ['age', 'workclass', 'fnlwgt','education', 'education-num', 
'marital-status', 'occupation', 'relationship', 'race', 'sex',
'capital-gain','capital-loss', 'hours-per-week','native-country', 'income']
df = pd.read_csv('adult.data',header = None, names = col_names)

#Clean columns by stripping extra whitespace for columns of type "object"
for c in df.select_dtypes(include=['object']).columns:
    df[c] = df[c].str.strip()
print(df.head())

   age         workclass  fnlwgt  education  education-num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   

       marital-status         occupation   relationship   race     sex  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2            Divorced  Handlers-cleaners  Not-in-family  White    Male   
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   

   capital-gain  capital-loss  hours-per-week native-country income  
0          2174             0              40  United-States  <=50K  
1             0             0             

In [None]:
#1. Check Class Imbalance


In [None]:
#2. Create feature dataframe X with feature columns and dummy variables for categorical features
feature_cols = ['age','capital-gain', 'capital-loss', 'hours-per-week', 'sex','race', 'hours-per-week', 'education']

In [None]:
#3. Create a heatmap of X data to see feature correlation


In [None]:
#4. Create output variable y which is binary, 0 when income is less than 50k, 1 when it is greater than 50k


In [None]:
#5a. Split data into a train and test set


#5b. Fit LR model with sklearn on train set, and predicting on the test set
log_reg = LogisticRegression(C=0.05, penalty='l1', solver='liblinear')

In [None]:
#6. Print model parameters (intercept and coefficients)
print('Model Parameters, Intercept:')

print('Model Parameters, Coeff:')

In [None]:
#7. Evaluate the predictions of the model on the test set. Print the confusion matrix and accuracy score.
print('Confusion Matrix on test set:')
print('Accuracy Score on test set:')

In [None]:

# 8.Create new DataFrame of the model coefficients and variable names; sort values based on coefficient

In [None]:

#9. barplot of the coefficients sorted in ascending order

In [None]:
#10. Plot the ROC curve and print the AUC value.
#y_pred_prob = log_reg.predict_proba(x_test)