## Use pandas & sklearn to predict
whether the income of a person exceeds $50K/yr (two-class classification)  
* Dataset: http://archive.ics.uci.edu/ml/datasets/Adult
* *adult.data* is the train set and *adult.test* is the test set, i.e. you need not separate train and test set manually
* adult.names gives descriptions of the features
* Present the classification **accuracy** and try to get close to the official accuracy (85\%)
* You can use any ML model you like, see https://scikit-learn.org/stable/supervised_learning.html
* Try to figure out why some models are better than others

### Download dataset

In [1]:
import os
import urllib.request

print('Begin downloading automobile dataset...')

data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
description = 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names'
test_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'
if not os.path.isfile("adult.test"):
    urllib.request.urlretrieve(data_url, 'adult.data')
    urllib.request.urlretrieve(description, 'adult.names')
    urllib.request.urlretrieve(test_url, 'adult.test')

Begin downloading automobile dataset...


## Data Processing

In [2]:
import numpy as np
import pandas as pd

### Read dataset
Commonly in `csv` format (i.e. items separated by `,`)

See [`pd.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) for more usage

In [3]:
attr = ["age","workclass","fnlwgt","education","education-num","marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hours-per-week","native-country","money"]
data = pd.read_csv("adult.data",names=attr)
test = pd.read_csv("adult.test",names=attr)
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,money
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
test.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,money
0,|1x3 Cross validator,,,,,,,,,,,,,,
1,25,Private,226802.0,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,40.0,United-States,<=50K.
2,38,Private,89814.0,HS-grad,9.0,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,50.0,United-States,<=50K.
3,28,Local-gov,336951.0,Assoc-acdm,12.0,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,40.0,United-States,>50K.
4,44,Private,160323.0,Some-college,10.0,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688.0,0.0,40.0,United-States,>50K.


In [5]:
test = test.drop([0])

In [6]:
test.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,money
1,25,Private,226802.0,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,40.0,United-States,<=50K.
2,38,Private,89814.0,HS-grad,9.0,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,50.0,United-States,<=50K.
3,28,Local-gov,336951.0,Assoc-acdm,12.0,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,40.0,United-States,>50K.
4,44,Private,160323.0,Some-college,10.0,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688.0,0.0,40.0,United-States,>50K.
5,18,?,103497.0,Some-college,10.0,Never-married,?,Own-child,White,Female,0.0,0.0,30.0,United-States,<=50K.


### Missing data
We observe that this dataset exists lots of `?`, which means data is lost
* Use [`data.isna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html) to find out NaN
* But we should firstly replace `?` with NaN

In [7]:
test.replace(" ?", np.nan, inplace=True)

In [8]:
test.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,money
1,25,Private,226802.0,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,40.0,United-States,<=50K.
2,38,Private,89814.0,HS-grad,9.0,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,50.0,United-States,<=50K.
3,28,Local-gov,336951.0,Assoc-acdm,12.0,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,40.0,United-States,>50K.
4,44,Private,160323.0,Some-college,10.0,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688.0,0.0,40.0,United-States,>50K.
5,18,,103497.0,Some-college,10.0,Never-married,,Own-child,White,Female,0.0,0.0,30.0,United-States,<=50K.


In [9]:
test.isna().sum()

age                 0
workclass         963
fnlwgt              0
education           0
education-num       0
marital-status      0
occupation        966
relationship        0
race                0
sex                 0
capital-gain        0
capital-loss        0
hours-per-week      0
native-country    274
money               0
dtype: int64

To deal with missing data, we can
* Delete rows with missing item (maybe most of the data are deleted)
* Fill with **means / modes / maximums / other meaningful metrics**

The following only gives a naive method.

In practice, you **should** use different metrics for different types of attributes!

In [10]:
modes = test.mode().iloc[0]
test.fillna(modes,inplace=True)

In [11]:
test.isna().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
money             0
dtype: int64

In [12]:
test.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,money
1,25,Private,226802.0,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,40.0,United-States,<=50K.
2,38,Private,89814.0,HS-grad,9.0,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,50.0,United-States,<=50K.
3,28,Local-gov,336951.0,Assoc-acdm,12.0,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,40.0,United-States,>50K.
4,44,Private,160323.0,Some-college,10.0,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688.0,0.0,40.0,United-States,>50K.
5,18,Private,103497.0,Some-college,10.0,Never-married,Prof-specialty,Own-child,White,Female,0.0,0.0,30.0,United-States,<=50K.


### Separate data and label
We use first several features (X) to predict price (Y, the last column)

### Change text data (categorical) into number
* Use number to denote different catalogs
* Change categorical features into [one-hot encoding](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f)

In [13]:
from sklearn.preprocessing import LabelEncoder
attr_1 = ["age","workclass","fnlwgt","education","marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","native-country","money"]
for a in attr_1:
    a_le = LabelEncoder()
    test[a] = a_le.fit_transform(test[a].values)
for a in attr_1:
    a_le = LabelEncoder()
    data[a] = a_le.fit_transform(data[a].values)

In [14]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,money
0,22,7,2671,9,13,4,1,1,4,1,25,0,40,39,0
1,33,6,2926,9,13,2,4,0,4,1,0,0,13,39,0
2,21,4,14086,11,9,0,6,1,4,1,0,0,40,39,0
3,36,4,15336,1,7,2,6,0,2,1,0,0,40,39,0
4,11,4,19355,9,13,2,10,5,2,0,0,0,40,5,0


In [15]:
test.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,money
1,8,3,8931,1,7.0,4,6,3,2,1,0,0,40.0,37,0
2,21,3,1888,11,9.0,2,4,0,4,1,0,0,50.0,37,0
3,11,1,11540,7,12.0,2,10,0,4,1,0,0,40.0,37,1
4,27,3,5146,15,10.0,2,6,0,2,1,90,0,40.0,37,1
5,1,3,2450,15,10.0,4,9,3,4,0,0,0,30.0,37,0


In [16]:
y_train = data["money"]
data.drop(["money"],axis=1,inplace=True)
data.drop(["fnlwgt"],axis=1,inplace=True)

In [17]:
data.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,22,7,9,13,4,1,1,4,1,25,0,40,39
1,33,6,9,13,2,4,0,4,1,0,0,13,39
2,21,4,11,9,0,6,1,4,1,0,0,40,39
3,36,4,1,7,2,6,0,2,1,0,0,40,39
4,11,4,9,13,2,10,5,2,0,0,0,40,5


In [18]:
y_test = test["money"]
test.drop(["money"],axis=1,inplace=True)
test.drop(["fnlwgt"],axis=1,inplace=True)

In [19]:
y_train

0        0
1        0
2        0
3        0
4        0
        ..
32556    0
32557    1
32558    0
32559    0
32560    1
Name: money, Length: 32561, dtype: int32

In [20]:
y_train = y_train[:].to_numpy().astype(np.float64)

In [21]:
y_train

array([0., 0., 0., ..., 0., 0., 1.])

In [22]:
y_test

1        0
2        0
3        1
4        1
5        0
        ..
16277    0
16278    0
16279    0
16280    0
16281    1
Name: money, Length: 16281, dtype: int32

In [23]:
test.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
1,8,3,1,7.0,4,6,3,2,1,0,0,40.0,37
2,21,3,11,9.0,2,4,0,4,1,0,0,50.0,37
3,11,1,7,12.0,2,10,0,4,1,0,0,40.0,37
4,27,3,15,10.0,2,6,0,2,1,90,0,40.0,37
5,1,3,15,10.0,4,9,3,4,0,0,0,30.0,37


### Feature selection
* Variance
* Pearson correlation $R$
* $\chi^2$ test

### Dimensionality reduction
* Principle Components Analysis (PCA)
* Linear Discriminant Analysis (LDA)

For more methods, please see <https://www.zhihu.com/question/29316149/answer/110159647>

### Scaling
* Normalization: $x'=\frac{x-\bar{x}}{\sigma}$
* MinMaxScaling: $x'=\frac{x-\min}{\max-\min}$

In [24]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

scaler = StandardScaler().fit(data)
data = scaler.transform(data)
test = scaler.transform(test)

In [25]:
data

array([[ 0.0307785 ,  2.15057856, -0.33543693, ..., -0.20417671,
        -0.03542945,  0.29156857],
       [ 0.83750854,  1.46373585, -0.33543693, ..., -0.20417671,
        -2.22215312,  0.29156857],
       [-0.04256059,  0.09005041,  0.18133163, ..., -0.20417671,
        -0.03542945,  0.29156857],
       ...,
       [ 1.4242213 ,  0.09005041,  0.18133163, ..., -0.20417671,
        -0.03542945,  0.29156857],
       [-1.21598611,  0.09005041,  0.18133163, ..., -0.20417671,
        -1.65522476,  0.29156857],
       [ 0.98418673,  0.77689313,  0.18133163, ..., -0.20417671,
        -0.03542945,  0.29156857]])

## Training & Evaluation

In [26]:
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
# create
reg = linear_model.BayesianRidge(normalize=True)

In [27]:
# model fitting
reg.fit(data, y_train)

BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, alpha_init=None,
              compute_score=False, copy_X=True, fit_intercept=True,
              lambda_1=1e-06, lambda_2=1e-06, lambda_init=None, n_iter=300,
              normalize=True, tol=0.001, verbose=False)

In [28]:
# prediction
y_pred = reg.predict(test)

In [29]:
y_pred

array([-0.00584884,  0.24755473,  0.33617386, ...,  0.43788636,
        0.73027191,  0.44200389])

In [30]:
t = [1 if x >= 0.5 else 0 for x in y_pred]
t

[0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,


In [31]:
tt=t==y_test

In [32]:
np.mean(tt)

0.8182544069774584