Source: https://archive.ics.uci.edu/ml/datasets/Adult

Use ML/statistical techniques (logistic regression) to predict whether or not an income level exceeds a certain threshold.

Import libraries:


In [2]:
import numpy as np
import pandas as pd

Load and inspect dataset:

In [3]:
df = pd.read_csv("adult.data")
df = pd.DataFrame(df)
df_columns = ["age","workclass","fnlwgt","education","education-num","marital-status","occupation","serv","relationship","race","sex","capital-gain","capital-loss","hours-per-week","native-country"]
df.columns = df_columns
df.to_csv('adult.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,serv,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


We see that many variables are mistitled, but we can interpret what they really are. The response variable will be native-country, which represents whether or not an income exceeds a certain value, in this case 50k.

Drop columns with too many unique values to reduce chance of overfitting:

In [4]:
df = df.drop(['occupation', 'hours-per-week', 'workclass'], axis=1)

Count number of unique values in categorical variables:

In [5]:
for i in range(len(df.columns)):
    col_name = df.columns[i]
    print(col_name)
    print(df[col_name].unique())

age
[50 38 53 28 37 49 52 31 42 30 23 32 40 34 25 43 54 35 59 56 19 39 20 45
 22 48 21 24 57 44 41 29 18 47 46 36 79 27 67 33 76 17 55 61 70 64 71 68
 66 51 58 26 60 90 75 65 77 62 63 80 72 74 69 73 81 78 88 82 83 84 85 86
 87]
fnlwgt
[ 83311 215646 234721 ...  34066  84661 257302]
education
[' Bachelors' ' HS-grad' ' 11th' ' Masters' ' 9th' ' Some-college'
 ' Assoc-acdm' ' Assoc-voc' ' 7th-8th' ' Doctorate' ' Prof-school'
 ' 5th-6th' ' 10th' ' 1st-4th' ' Preschool' ' 12th']
education-num
[13  9  7 14  5 10 12 11  4 16 15  3  6  2  1  8]
marital-status
[' Married-civ-spouse' ' Divorced' ' Married-spouse-absent'
 ' Never-married' ' Separated' ' Married-AF-spouse' ' Widowed']
serv
[' Husband' ' Not-in-family' ' Wife' ' Own-child' ' Unmarried'
 ' Other-relative']
relationship
[' White' ' Black' ' Asian-Pac-Islander' ' Amer-Indian-Eskimo' ' Other']
race
[' Male' ' Female']
sex
[    0 14084  5178  5013  2407 14344 15024  7688 34095  4064  4386  7298
  1409  3674  1055  3464  2050  2176  217

In [6]:
df.columns

Index(['age', 'fnlwgt', 'education', 'education-num', 'marital-status', 'serv',
       'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
       'native-country'],
      dtype='object')

Replace string values in categorical variables.

In [7]:
df['native-country'] = df['native-country'].replace([' <=50K', ' >50K'], [0,1])
df['race'] = df['race'].replace([' Male', ' Female'] , [0,1])
df['education'] = df['education'].replace([' Bachelors', ' HS-grad', ' 11th', ' Masters', ' 9th' ,
                                            ' Some-college', ' Assoc-acdm', ' Assoc-voc', ' 7th-8th', ' Doctorate' ,
                                            ' Prof-school', ' 5th-6th', ' 10th', ' 1st-4th', ' Preschool', ' 12th'] , 
                                          [12, 9, 7, 13, 5, 
                                           11, 11, 11, 4, 14, 
                                           11, 3, 6, 2, 1, 8])
df.head(10)

Unnamed: 0,age,fnlwgt,education,education-num,marital-status,serv,relationship,race,sex,capital-gain,capital-loss,native-country
0,50,83311,12,13,Married-civ-spouse,Husband,White,0,0,0,13,0
1,38,215646,9,9,Divorced,Not-in-family,White,0,0,0,40,0
2,53,234721,7,7,Married-civ-spouse,Husband,Black,0,0,0,40,0
3,28,338409,12,13,Married-civ-spouse,Wife,Black,1,0,0,40,0
4,37,284582,13,14,Married-civ-spouse,Wife,White,1,0,0,40,0
5,49,160187,5,5,Married-spouse-absent,Not-in-family,Black,1,0,0,16,0
6,52,209642,9,9,Married-civ-spouse,Husband,White,0,0,0,45,1
7,31,45781,13,14,Never-married,Not-in-family,White,1,14084,0,50,1
8,42,159449,12,13,Married-civ-spouse,Husband,White,0,5178,0,40,1
9,37,280464,11,10,Married-civ-spouse,Husband,Black,0,0,0,80,1


Define train set predictor and response variables:

In [8]:
df = df.drop(["marital-status" , "serv", "relationship"], axis=1)

Run logistic regression:

In [9]:
# instantiate the model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 

logreg = LogisticRegression(max_iter=150)

# split the voting data
X_train, X_test, y_train, y_test = train_test_split(df.drop(["native-country"], axis=1), df["native-country"], test_size=0.2, random_state=42)

logreg.fit(X_train, y_train)

LogisticRegression(max_iter=150)

Get accuracy, recall, precision metrics.

In [10]:
# Predict the labels of the test set: y_pred
y_pred = logreg.predict(X_test)

from sklearn.metrics import confusion_matrix, classification_report

# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[4760  152]
 [1165  435]]
              precision    recall  f1-score   support

           0       0.80      0.97      0.88      4912
           1       0.74      0.27      0.40      1600

    accuracy                           0.80      6512
   macro avg       0.77      0.62      0.64      6512
weighted avg       0.79      0.80      0.76      6512



The accuracy rate for the model is 80 percent. Not bad but could use improvement.