# Logistic Regression Example
The goal of this project is to predict adult income based on the available data using logistic regression.

# Import Libraries

In [54]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
import warnings
warnings.filterwarnings('ignore')

# Importing and exploring the data

In [55]:
data = pd.read_csv('adult.csv')
print(data.shape)
print(list(data.columns))

(48842, 15)
['age', 'workclass', 'fnlwgt', 'education', 'educational-num', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']


Our data has 49k records and 15 columns. The value to be predicted is income. Let's have a look at the first few rows of the data.

In [56]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


# Very Simple First Analysis
Let's start as simple as possible. Education and age are probably the largest predictors. Let's build a model on this first, then see if the model can be improved later. These two parameters also work well because there shouldn't be interaction between the two variables.

In [57]:
X = data[['age','educational-num']]
X.describe()

Unnamed: 0,age,educational-num
count,48842.0,48842.0
mean,38.643585,10.078089
std,13.71051,2.570973
min,17.0,1.0
25%,28.0,9.0
50%,37.0,10.0
75%,48.0,12.0
max,90.0,16.0


In [58]:
# 'age' and 'educational-num' are conveniently already numerical.
# now we need to transform the categorical value y into a number:
y = data['income'].map(lambda x: 0 if x == "<=50K" else 1)
print(y.head())

0    0
1    0
2    1
3    1
4    0
Name: income, dtype: int64


# Modelling & testing

In [59]:
# Create Test Train Split
X_train, X_test, y_train, y_test = train_test_split(X,y)
# fit the model on the training data
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
#check the accuracy of the data
y_pred = logreg.predict(X_test)
print(logreg.score(X_test,y_test))

0.7835558103349439


# Intermediate Conclusion
This means that looking only education and age we can already predict 78% certain if a person will be earning more than 50K or not! We can make the model more complicated to get a better result, but this seems to be a good start! (keep in mind that a coin-flip would have given 50%, so that's where we started)

# Add more variables
let's add hours per week, gender and race to the model next and see how this improves the outcome.

In [60]:
X[['gender','race', 'hours-per-week']] = data[['gender','race', 'hours-per-week']]
print(X.head())
X['gender'] = X['gender'].map(lambda x: 1 if x == "Male" else 0)
X['race'] = X['race'].map(lambda x: 1 if x == "Black" else 0)
print(X.head())
X.describe()

   age  educational-num  gender   race  hours-per-week
0   25                7    Male  Black              40
1   38                9    Male  White              50
2   28               12    Male  White              40
3   44               10    Male  Black              40
4   18               10  Female  White              30
   age  educational-num  gender  race  hours-per-week
0   25                7       1     1              40
1   38                9       1     0              50
2   28               12       1     0              40
3   44               10       1     1              40
4   18               10       0     0              30


Unnamed: 0,age,educational-num,gender,race,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,10.078089,0.668482,0.095922,40.422382
std,13.71051,2.570973,0.470764,0.294487,12.391444
min,17.0,1.0,0.0,0.0,1.0
25%,28.0,9.0,0.0,0.0,40.0
50%,37.0,10.0,1.0,0.0,40.0
75%,48.0,12.0,1.0,0.0,45.0
max,90.0,16.0,1.0,1.0,99.0


# More complex modelin & testing
So initially we got 78% based on education and age only. Now let's see what the improvement is if we add gender, hours per week and race.

In [61]:
# Create Test Train Split
X_train, X_test, y_train, y_test = train_test_split(X,y)
# fit the model on the training data
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
#check the accuracy of the data
y_pred = logreg.predict(X_test)
print(logreg.score(X_test,y_test))

0.7990336581770535


# Conclusion:
We can improve the logistic result by 2% if we add gender, race and hours per week to the model. Age and education are the largest predictors for income, giving a 78% accurate prediction.