This notebooks contains my analysis of a model to predict revenues of US citizens.

Specifically, the models predicts whether a revenue is lower or higher than 50000 US$

In [1]:
from pathlib import Path
import pandas as pd
folder_data=Path('us_census_full')

In [2]:
df=pd.read_csv(folder_data / "census_income_learn.csv",header=None)

The categorical value have been identified before hand. Categorical variables have been labelled as such if their values are not continuous. 

In [3]:
columns=pd.read_csv(folder_data/"column_data.csv",sep=';')

In [4]:
categorical_variables=columns.loc[columns['Type']=='CAT','Column']

Name columns according to the dictionnary provided in the census_income_metadata.txt, line 143 to 184. 

In [5]:
df.columns=list(columns['Column'])

In [6]:
df[categorical_variables]=df[categorical_variables].apply(lambda s:s.astype(object))

In [125]:
df.dtypes

age                                             int64
class of worker                                object
detailed industry recode                       object
detailed occupation recode                     object
education                                      object
wage per hour                                   int64
enroll in edu inst last wk                     object
marital stat                                   object
major industry code                            object
major occupation code                          object
race                                           object
hispanic origin                                object
sex                                            object
member of a labor union                        object
reason for unemployment                        object
full or part time employment stat              object
capital gains                                   int64
capital losses                                  int64
dividends from stocks       

# data exploration 

In [126]:
df.head()

Unnamed: 0,age,class of worker,detailed industry recode,detailed occupation recode,education,wage per hour,enroll in edu inst last wk,marital stat,major industry code,major occupation code,...,country of birth father,country of birth mother,country of birth self,citizenship,own business or self employed,fill inc questionnaire for veteran's admin,veterans benefits,weeks worked in year,year,target
0,73,Not in universe,0,0,High school graduate,0,Not in universe,Widowed,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,95,- 50000.
1,58,Self-employed-not incorporated,4,34,Some college but no degree,0,Not in universe,Divorced,Construction,Precision production craft & repair,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,- 50000.
2,18,Not in universe,0,0,10th grade,0,High school,Never married,Not in universe or children,Not in universe,...,Vietnam,Vietnam,Vietnam,Foreign born- Not a citizen of U S,0,Not in universe,2,0,95,- 50000.
3,9,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.
4,10,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.


# Build variables, target and weights 

In [210]:
df.columns

Index(['age', 'class of worker', 'detailed industry recode',
       'detailed occupation recode', 'education', 'wage per hour',
       'enroll in edu inst last wk', 'marital stat', 'major industry code',
       'major occupation code', 'race', 'hispanic origin', 'sex',
       'member of a labor union', 'reason for unemployment',
       'full or part time employment stat', 'capital gains', 'capital losses',
       'dividends from stocks', 'tax filer stat',
       'region of previous residence', 'state of previous residence',
       'detailed household and family stat',
       'detailed household summary in household', 'instance weight',
       'migration code-change in msa', 'migration code-change in reg',
       'migration code-move within reg', 'live in this house 1 year ago',
       'migration prev res in sunbelt', 'num persons worked for employer',
       'family members under 18', 'country of birth father',
       'country of birth mother', 'country of birth self', 'citizenship',
 

In [211]:
test=OrdinalEncoder()
columns=df.columns
# columns=['class of worker','year','target','wage per hour']
category_var=['class of worker','year','target']
# category_var=categorical_variables
preprocessor = ColumnTransformer([
    ('categorical', test, category_var)],
    remainder="passthrough")
z=pd.DataFrame(data=preprocessor.fit_transform(df[columns]),columns=columns)


In [212]:
z['target'].value_counts()

0     95983
52    70314
40     2790
50     2304
26     2268
48     1806
12     1780
30     1378
20     1330
8      1126
36     1108
16      945
32      883
44      845
51      819
24      767
4       757
46      708
35      704
10      694
45      669
6       646
39      602
42      573
28      568
49      509
13      496
1       464
2       458
25      457
3       417
38      380
43      374
22      370
15      353
17      331
5       309
47      278
18      272
14      257
9       239
34      230
7       152
21      135
37      123
41       88
33       81
11       78
27       76
23       67
29       63
31       51
19       48
Name: target, dtype: int64

In [183]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder

# encoder=OrdinalEncoder()
encoder = OrdinalEncoder(handle_unknown="use_encoded_value",unknown_value=-1)

preprocessor = ColumnTransformer([
    ('categorical', encoder, categorical_variables)],
    remainder="passthrough")
preprocessor.fit(df)

ColumnTransformer(remainder='passthrough',
                  transformers=[('categorical',
                                 OrdinalEncoder(handle_unknown='use_encoded_value',
                                                unknown_value=-1),
                                 1                                class of worker
2                       detailed industry recode
3                     detailed occupation recode
4                                      education
6                     enroll in edu inst last wk
7                                   marital stat
8                            major industry code
9                          major occupation code
10                                          race
11                               hispanic origin
12                                           sex
13                       member of a labor union
14                       reason for unemploy...
25                  migration code-change in msa
26                  migration code-chang

In [206]:
df['target'].value_counts()

 - 50000.    187141
 50000+.      12382
Name: target, dtype: int64

In [185]:
df_transform=pd.DataFrame(data=preprocessor.transform(df),columns=df.columns)

In [186]:
df_transform['target'].value_counts()

0.0     95983
52.0    70314
40.0     2790
50.0     2304
26.0     2268
48.0     1806
12.0     1780
30.0     1378
20.0     1330
8.0      1126
36.0     1108
16.0      945
32.0      883
44.0      845
51.0      819
24.0      767
4.0       757
46.0      708
35.0      704
10.0      694
45.0      669
6.0       646
39.0      602
42.0      573
28.0      568
49.0      509
13.0      496
1.0       464
2.0       458
25.0      457
3.0       417
38.0      380
43.0      374
22.0      370
15.0      353
17.0      331
5.0       309
47.0      278
18.0      272
14.0      257
9.0       239
34.0      230
7.0       152
21.0      135
37.0      123
41.0       88
33.0       81
11.0       78
27.0       76
23.0       67
29.0       63
31.0       51
19.0       48
Name: target, dtype: int64

In [136]:
def get_Xyw(df):
    X=df[[c for c in df.columns if c not in ['instance weight','target']]]
    y=df['target']
    w=df['instance weight']
    return X,y,w

In [137]:
X,y,w=get_Xyw(df_transform)

# Hyper parameter tuning 

Stratification is handled in the GridSearchCV object

In [138]:
from lightgbm import LGBMClassifier

In [139]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import HistGradientBoostingClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import GridSearchCV
params={"n_estimators":[50,100,200],"max_depth":[5,10,20]}
params={"max_iter":[50],"max_depth":[3]}

gs=GridSearchCV(LGBMClassifier(),param_grid=params,cv=2,verbose=2)

Improve metric: imbalanced classes

In [140]:
gs.fit(X,y,sample_weight=w)

Fitting 2 folds for each of 1 candidates, totalling 2 fits




[CV] END ...........................max_depth=3, max_iter=50; total time=  19.7s




[CV] END ...........................max_depth=3, max_iter=50; total time=  19.0s






GridSearchCV(cv=2, estimator=LGBMClassifier(),
             param_grid={'max_depth': [3], 'max_iter': [50]}, verbose=2)

In [141]:
model=gs.best_estimator_

In [142]:
model.score(X,y,sample_weight=w)

0.7996878456756287

In [None]:
from sklearn.metrics import RocCurveDisplay,auc

In [None]:
auc()

# Prediction on test dataset 

In [143]:
df_test=pd.read_csv(folder_data / "census_income_test.csv",header=None)
df_test.columns=df.columns
df_test[categorical_variables]=df_test[categorical_variables].apply(lambda s: s.astype('object'))

In [145]:
df['target'].value_counts()

 - 50000.    187141
 50000+.      12382
Name: target, dtype: int64

In [152]:
preprocessor.transform(df_test)

array([[4.00000e+00, 6.00000e+00, 3.60000e+01, ..., 1.03238e+03,
        4.00000e+00, 1.20000e+01],
       [6.00000e+00, 3.70000e+01, 1.20000e+01, ..., 1.46233e+03,
        1.00000e+00, 2.60000e+01],
       [3.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.60175e+03,
        0.00000e+00, 0.00000e+00],
       ...,
       [6.00000e+00, 1.00000e+00, 4.30000e+01, ..., 2.08376e+03,
        2.00000e+00, 5.20000e+01],
       [4.00000e+00, 4.50000e+01, 2.00000e+00, ..., 1.68006e+03,
        5.00000e+00, 5.20000e+01],
       [3.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.58248e+03,
        0.00000e+00, 0.00000e+00]])

In [153]:
df_test_transform=pd.DataFrame(data=preprocessor.transform(df_test),columns=df_test.columns)

In [156]:
df_test_transform['target'].value_counts()

0.0     47889
52.0    35052
40.0     1362
26.0     1125
50.0     1110
48.0     1010
12.0      953
20.0      725
30.0      725
8.0       564
36.0      553
44.0      459
16.0      450
32.0      419
51.0      413
24.0      405
10.0      384
4.0       361
35.0      360
45.0      349
6.0       315
49.0      308
46.0      302
42.0      299
28.0      282
13.0      267
39.0      260
25.0      253
1.0       226
2.0       209
3.0       207
15.0      191
22.0      190
43.0      188
38.0      180
17.0      151
14.0      136
18.0      130
5.0       127
9.0       126
47.0      124
34.0      116
7.0        71
37.0       70
21.0       60
11.0       48
33.0       48
27.0       41
41.0       37
23.0       36
29.0       35
31.0       32
19.0       29
Name: target, dtype: int64

In [157]:
X_test,y_test,w_test=get_Xyw(df_test_transform)

In [158]:
model.score(X_test,y_test,sample_weight=w_test)

0.7943285165227298

# Synthesis 

Analysis of feature importance
Train and test score