# Machine Learning (Pandas & Sklearn)

* [Pandas](https://pandas.pydata.org/): Python Data Analysis Library
    * Pandas is usually used for data reading and preprocessing
    * `pip install pandas`
    * Tutorial: <https://pandas.pydata.org/docs/getting_started/index.html>
    * Cheat sheet: <https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf>
* [Scikit-learn](http://scikit-learn.org/) (sklearn)
    * Scikit-learn package is used for various machine learning algorithms, which range from classification, regression to clustering
    * `pip install sklearn`
    * Tutorial: <https://scikit-learn.org/stable/getting_started.html>
    * API Reference: <https://scikit-learn.org/stable/modules/classes.html>
    * Cheat sheet: <https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf>
    
In this demo, we use [UCI Machine Learning dataset](https://archive.ics.uci.edu/ml/datasets.php). Specifically, we use [Automobile](https://archive.ics.uci.edu/ml/datasets/Automobile) dataset. Given three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars, and predict the price of the cars.

## Download dataset

In [637]:
import os
import urllib.request

print('Begin downloading automobile dataset...')

# We use UCI Machine Learning dataset - Automobile here
#data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data'
datatrain_url='http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
datatest_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'
description = 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names'
#http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names
#|1x3 Cross validator
if not os.path.isfile("adult.test"):
    urllib.request.urlretrieve(datatrain_url, 'adult.data')
    urllib.request.urlretrieve(datatest_url, 'adult.test')
    urllib.request.urlretrieve(description, 'adult.names')

Begin downloading automobile dataset...


### Attribute Information

|     Attribute             |  Attribute Range |
| :---: | :---: |
|  1. symboling             |  -3, -2, -1, 0, 1, 2, 3 |
|  2. normalized-losses     |  continuous from 65 to 256 |
|  3. make                  |  alfa-romero, audi, bmw, chevrolet, dodge, honda |
|  ...                       |  isuzu, jaguar, mazda, mercedes-benz, mercury |
|  ...                       |  mitsubishi, nissan, peugot, plymouth, porsche |
|  ...                       |  renault, saab, subaru, toyota, volkswagen, volv |
|  4. fuel-type             |  diesel, gas |
|  5. aspiration            |  std, turbo |
|  6. num-of-doors          |  four, two |
|  7. body-style            |  hardtop, wagon, sedan, hatchback, convertible |
|  8. drive-wheels          |  4wd, fwd, rwd |
|  9. engine-location       |  front, rear |
| 10. wheel-base            |  continuous from 86.6 120.9 |
| 11. length                |  continuous from 141.1 to 208.1 |
| 12. width                 |  continuous from 60.3 to 72.3 |
| 13. height                |  continuous from 47.8 to 59.8 |
| 14. curb-weight           |  continuous from 1488 to 4066 |
| 15. engine-type           |  dohc, dohcv, l, ohc, ohcf, ohcv, rotor |
| 16. num-of-cylinders      |  eight, five, four, six, three, twelve, two |
| 17. engine-size           |  continuous from 61 to 326 |
| 18. fuel-system           |  1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi |
| 19. bore                  |  continuous from 2.54 to 3.94 |
| 20. stroke                |  continuous from 2.07 to 4.17 |
| 21. compression-ratio     |  continuous from 7 to 23 |
| 22. horsepower            |  continuous from 48 to 288 |
| 23. peak-rpm              |  continuous from 4150 to 6600 |
| 24. city-mpg              |  continuous from 13 to 49 |
| 25. highway-mpg           |  continuous from 16 to 54 |
| 26. price                 |  continuous from 5118 to 45400 |

## Data Processing

In [638]:
import numpy as np
import pandas as pd

### Read dataset
Commonly in `csv` format (i.e. items separated by `,`)

See [`pd.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) for more usage

In [639]:
attr = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"]
traindata = pd.read_csv("adult.data",names=attr)
testdata = pd.read_csv("adult.test",names=attr)
testdata.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K.
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K.
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K.
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K.
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K.
5,34,Private,198693,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K.
6,29,?,227026,HS-grad,9,Never-married,?,Unmarried,Black,Male,0,0,40,United-States,<=50K.
7,63,Self-emp-not-inc,104626,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,3103,0,32,United-States,>50K.
8,24,Private,369667,Some-college,10,Never-married,Other-service,Unmarried,White,Female,0,0,40,United-States,<=50K.
9,55,Private,104996,7th-8th,4,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,10,United-States,<=50K.


### Different types of data
Ref: <https://towardsdatascience.com/data-types-in-statistics-347e152e8bee>

* Categorical
    * Nominal: No order, e.g. sex
    * Ordinal: With order, e.g. education

* Numerical
    * Discrete: Can't be measured but can be counted, e.g. # of times doing sth.
    * Continuous: Can't be counted but can be measured, e.g. temperature

### Missing data
We observe that this dataset exists lots of `?`, which means data is lost
* Use [`data.isna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html) to find out NaN
* But we should firstly replace `?` with NaN

In [640]:
testdata.replace(' ?', np.nan, inplace=True)
traindata.replace(' ?', np.nan, inplace=True)
testdata.replace(' <=50K.', 0, inplace=True)
traindata.replace(' <=50K', 0, inplace=True)
traindata.replace(' >50K', 1, inplace=True)
testdata.replace(' >50K.', 1, inplace=True)

In [641]:
traindata.isna().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
income               0
dtype: int64

To deal with missing data, we can
* Delete rows with missing item (maybe most of the data are deleted)
* Fill with **means / modes / maximums / other meaningful metrics**

The following only gives a naive method.

In practice, you **should** use different metrics for different types of attributes!

In [642]:
modes = testdata.mode().iloc[0]
testdata.fillna(modes,inplace=True)
modes = traindata.mode().iloc[0]
traindata.fillna(modes,inplace=True)

In [643]:
traindata.isna().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64

In [644]:
traindata.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


### Separate data and label
We use first several features (X) to predict price (Y, the last column)

In [645]:
testy = testdata["income"]
trainy = traindata["income"]
testdata.drop(["income"],axis=1,inplace=True)
traindata.drop(["income"],axis=1,inplace=True)
testdata.drop(["fnlwgt"],axis=1,inplace=True)
traindata.drop(["fnlwgt"],axis=1,inplace=True)

### Change text data (categorical) into number
* Use number to denote different catalogs
* Change categorical features into [one-hot encoding](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f)

In [646]:
cat_attr = ["workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"]
for a in cat_attr:
    testdata[a] = pd.Categorical(testdata[a]) # change column type to categorical
    traindata[a] = pd.Categorical(traindata[a])
    testdummies = pd.get_dummies(testdata[a],prefix="{}_category".format(a))
    traindummies = pd.get_dummies(traindata[a],prefix="{}_category".format(a))
    testdata = pd.concat([testdata,testdummies],axis=1)
    traindata = pd.concat([traindata,traindummies],axis=1)
diff = [x for x in list(traindata)if x not in list(testdata)]
testdata.drop(cat_attr,axis=1,inplace=True)
traindata.drop(cat_attr,axis=1,inplace=True)
traindata.drop(diff,axis=1,inplace=True)
traindata.head()

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,workclass_category_ Federal-gov,workclass_category_ Local-gov,workclass_category_ Never-worked,workclass_category_ Private,workclass_category_ Self-emp-inc,...,native-country_category_ Portugal,native-country_category_ Puerto-Rico,native-country_category_ Scotland,native-country_category_ South,native-country_category_ Taiwan,native-country_category_ Thailand,native-country_category_ Trinadad&Tobago,native-country_category_ United-States,native-country_category_ Vietnam,native-country_category_ Yugoslavia
0,39,13,2174,0,40,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,13,0,0,13,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,9,0,0,40,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
3,53,7,0,0,40,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
4,28,13,0,0,40,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


### Feature selection
* Variance
* Pearson correlation $R$
* $\chi^2$ test

### Dimensionality reduction
* Principle Components Analysis (PCA)
* Linear Discriminant Analysis (LDA)

For more methods, please see <https://www.zhihu.com/question/29316149/answer/110159647>

### Separate train and test data
Since no test/validation data are available, we manually separate the data into train data and test data

In [675]:
X_train = traindata
y_train = trainy.to_numpy().astype(np.int)
X_test = testdata
y_test = testy.to_numpy().astype(np.int)
print("Train size: {}".format(len(X_train)))
print("Test size: {}".format(len(X_test)))

Train size: 32561
Test size: 16281


In [676]:
"""
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, y, random_state=3)
"""

'\nfrom sklearn.cross_validation import train_test_split\nX_train, X_test, y_train, y_test = train_test_split(data, y, random_state=3)\n'

### Scaling
* Normalization: $x'=\frac{x-\bar{x}}{\sigma}$
* MinMaxScaling: $x'=\frac{x-\min}{\max-\min}$

In [677]:
X_test

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,workclass_category_ Federal-gov,workclass_category_ Local-gov,workclass_category_ Never-worked,workclass_category_ Private,workclass_category_ Self-emp-inc,...,native-country_category_ Portugal,native-country_category_ Puerto-Rico,native-country_category_ Scotland,native-country_category_ South,native-country_category_ Taiwan,native-country_category_ Thailand,native-country_category_ Trinadad&Tobago,native-country_category_ United-States,native-country_category_ Vietnam,native-country_category_ Yugoslavia
0,25,7,0,0,40,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
1,38,9,0,0,50,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
2,28,12,0,0,40,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,44,10,7688,0,40,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
4,18,10,0,0,30,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16276,39,13,0,0,36,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
16277,64,9,0,0,40,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
16278,38,13,0,0,50,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
16279,44,13,5455,0,40,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0


In [664]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [665]:
X_test

array([[ 2.50000000e+01,  7.00000000e+00, -1.30931372e-17, ...,
         1.00000000e+00, -2.30220996e-17, -5.23725489e-18],
       [ 3.80000000e+01,  9.00000000e+00, -1.30931372e-17, ...,
         1.00000000e+00, -2.30220996e-17, -5.23725489e-18],
       [ 2.80000000e+01,  1.20000000e+01, -1.30931372e-17, ...,
         1.00000000e+00, -2.30220996e-17, -5.23725489e-18],
       ...,
       [ 3.80000000e+01,  1.30000000e+01, -1.30931372e-17, ...,
         1.00000000e+00, -2.30220996e-17, -5.23725489e-18],
       [ 4.40000000e+01,  1.30000000e+01,  5.45500000e+03, ...,
         1.00000000e+00, -2.30220996e-17, -5.23725489e-18],
       [ 3.50000000e+01,  1.30000000e+01, -1.30931372e-17, ...,
         1.00000000e+00, -2.30220996e-17, -5.23725489e-18]])

## Training & Evaluation

In [699]:
from sklearn.linear_model import LinearRegression

# create model
lr = LinearRegression(normalize=True)

In [700]:
# model fitting
lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=True)

In [701]:
# prediction
y_pred = lr.predict(X_test)

In [702]:
y_pred[y_pred>0.5] = 1
y_pred[y_pred!=1] = 0

In [703]:
1-abs(y_pred - y_test).sum()/len(y_test)

0.8412873902094467

In [704]:
# evaluate
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test,y_pred)

0.1587126097905534

In [705]:
from sklearn import linear_model
reg = linear_model.BayesianRidge()

In [706]:
reg.fit(X_train,y_train)

BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, alpha_init=None,
              compute_score=False, copy_X=True, fit_intercept=True,
              lambda_1=1e-06, lambda_2=1e-06, lambda_init=None, n_iter=300,
              normalize=False, tol=0.001, verbose=False)

In [707]:
y_pred=reg.predict(X_test)

In [708]:
y_pred[y_pred>0.5] = 1

In [709]:
y_pred[y_pred!=1] = 0

In [710]:
1-abs(y_pred - y_test).sum()/len(y_test)

0.8406731773232603