# Study Unit 5 Data Analytics in Python
Yao Renjie (rjyao001@suss.edu.sg)

## scikit-learn

* Simple and efficient tools for **predictive data analysis**
* Accessible to everybody, and reusable in various contexts
* Built on NumPy, SciPy, and matplotlib
* Open source, commercially usable - BSD license

## Algorithms in scikit-learn

1. Supervised learning: learn with labled data
2. Unsupervised learning: learn with unlabled data

In [33]:
# Install a pip package in the current Jupyter kernel
# https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/
import sys
!{sys.executable} -m pip install scikit-learn



## Page 5: Install and import scikit-learn

Importing whole sklearn takes longer time. Usually we only import packages we need.

In [34]:
!pip install scikit-learn



In [35]:
# import sklearn

from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np

## Page 7 Activity

In [36]:
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import preprocessing
from sklearn import tree
from sklearn.cluster import KMeans
from sklearn import decomposition 

## Page 8: Discussion

1. What is the difference between supervised and unsupervised machine learning?
2. Is it sensible to use alias when importing a module or an estimator of an algorithm from the scikit-learn package?

## Page 10: Specify and remove missing values

1. If `na_filter` is True, pandas will convert all white spaces `""` to `NaN`.
2. With `na_values`, we can declare certain strings from our DataFrame to be recognised as missing values. 


In [37]:
# make sure you upload the csv file first

#car_model = pd.read_csv('./car_model.csv', na_values='na_string', na_filter=True)
# car_price = pd.read_csv('./car_price_model.csv', na_values='na_string', na_filter=True)
#car_df = car_model.merge(car_price, how='inner', on=['Year', 'Make', 'Model'])
car_df = pd.read_csv('./car_price_model.csv', na_values='na_string', na_filter=True)
car_df.dropna(axis=0, how='any')
car_df.head()
#display(car_df)

Unnamed: 0,Year,Make,Model,Price
0,2021,Acura,ILX,
1,2021,Acura,RDX,
2,2021,Acura,TLX,37500.0
3,2021,Alfa Romeo,Giulia,40350.0
4,2021,Alfa Romeo,Stelvio,42350.0


## Page 14-16: Selection using iloc vs. loc

Using `loc` can avoid problems named "Magic Number".

In [38]:
display(car_df.loc[:, ['Year']].head(2))
display(car_df.iloc[:, 0:1].head(2))

Unnamed: 0,Year
0,2021
1,2021


Unnamed: 0,Year
0,2021
1,2021


## Page 17: Rename variables

Standardize column name.

In [39]:
car_df = car_df.rename(columns={'model': 'Model', 'category': 'Category'})

# if you do not assign back to car_df, you will still see the old columcn names
car_df.head()

Unnamed: 0,Year,Make,Model,Price
0,2021,Acura,ILX,
1,2021,Acura,RDX,
2,2021,Acura,TLX,37500.0
3,2021,Alfa Romeo,Giulia,40350.0
4,2021,Alfa Romeo,Stelvio,42350.0


## Page 18: Create dummy variables

Dummy coding the categorical data. 
In our example, it is a 2-bit code.
1. `Acura` => `10`
2. `Alfa Romeo` => `01`

But, can we use a 1-bit code? For example
1. `Acura` => `0`
2. `Alfa Romeo` => `1`

Actually, we only needs `K-1` bit, to represent `K` categories.

In [40]:
column_make = car_df['Make'].head(5)
display(column_make)
display(pd.get_dummies(column_make, drop_first=False))
display(pd.get_dummies(column_make, drop_first=True))

0         Acura
1         Acura
2         Acura
3    Alfa Romeo
4    Alfa Romeo
Name: Make, dtype: object

Unnamed: 0,Acura,Alfa Romeo
0,1,0
1,1,0
2,1,0
3,0,1
4,0,1


Unnamed: 0,Alfa Romeo
0,0
1,0
2,0
3,1
4,1


In [41]:
car_df.head()

Unnamed: 0,Year,Make,Model,Price
0,2021,Acura,ILX,
1,2021,Acura,RDX,
2,2021,Acura,TLX,37500.0
3,2021,Alfa Romeo,Giulia,40350.0
4,2021,Alfa Romeo,Stelvio,42350.0


In [42]:
year_str_column = car_df.astype({'Year': str})['Year']

In [43]:
year_str_column.head()

0    2021
1    2021
2    2021
3    2021
4    2021
Name: Year, dtype: object

In [44]:
display(type(year_str_column[0])) # data frame after `astype`
display(type(car_df['Year'][0])) # origin data frame

str

numpy.int64

## Page19: *Data Tranformation


* The `StandardScaler` assumes your data is normally distributed within each feature and will scale them such that the distribution is now centred around 0, with a standard deviation of 1. `z = (x - u) / s, u: mean, s: standard deviation.` **If data is not normally distributed, this is not the best scaler to use.**

![](./images/StandardScaler.png)
* The `Normalizer` scales each value by dividing each value by its magnitude in n-dimensional space for n number of features. Each sample is rescaled independently of other samples so that its **norm (l1, l2 or inf) equals one**.
For l2 norm, [1, 2, 3] => [1/sqrt(1\*1 + 2\*2 + 3\*3), ...]

![](./images/Normalizer.png)

In [45]:
from sklearn.preprocessing import Normalizer, StandardScaler

X = np.array([[0, 2, 8], 
              [2, 3, 4],
              [3, 3, 4]],)
print(np.mean(X, axis=0))
print(np.std(X, axis=0))

# works on **columns**
# the first column [0,2,3] has mean=1.666 and population stdevP = 1.247
# 0 becomes (0-1.666)/1.247 = -1.09
# 2 becomes (2-1.666)/1.247 = 0.267
scaler = preprocessing.StandardScaler()
# the fit(X) function does nothing and return the estimator unchanged
scaler.fit(X)
scaler.transform(X)

[1.66666667 2.66666667 5.33333333]
[1.24721913 0.47140452 1.88561808]


array([[-1.33630621, -1.41421356,  1.41421356],
       [ 0.26726124,  0.70710678, -0.70710678],
       [ 1.06904497,  0.70710678, -0.70710678]])

In [46]:
X = np.array([[1, 2, 7], 
              [3, 3, 4]])

# [0,2,8] becomes [0/10, 2/10, 8/10]
normalizer = preprocessing.Normalizer(norm='l1') 
normalizer.fit(X)
normalizer.transform(X)

array([[0.1, 0.2, 0.7],
       [0.3, 0.3, 0.4]])

## Page 20: Training and Testing Data

In [50]:
train_df, test_df = train_test_split(car_df, train_size=0.75, test_size=0.25)
print(f'car_df size: {car_df.size}, train_df size: {train_df.size}, test_df size: {test_df.size}')

car_df size: 960, train_df size: 720, test_df size: 240


# Page 23: Activity

In [51]:
car = pd.read_csv('./car_model.csv', na_values='na_string', na_filter=True)## Page 23: Activity
new_car_df = car.drop(columns=['Model'])
new_car_df.head()

Unnamed: 0,Year,Make,Category
0,2021,Acura,Sedan
1,2021,Acura,SUV
2,2021,Acura,Sedan
3,2021,Alfa Romeo,Sedan
4,2021,Alfa Romeo,SUV


In [52]:
#print(new_car_df['Category'])
c_unique = new_car_df['Category'].unique()
#print(c_unique)

# we want to split the strings using the special character ','
# then take the first element from this split
# if the special character ',' exists in the column, catgeory

# d is a new dictionary that stores data values in key:value pairs.
d = {}

for c in new_car_df['Category'].unique():
# print(c)
 if ',' in c:
# d has c as the key, and it has the first item before ',' as the value
# example: key:value is 'Convertible,Sedan,Coupe': 'Convertible' respectively
  d[c] = c.split(',')[0]

print("===")
print(d)
    
# following is a one-line code that is hard to understand
# d = {category: category.split(',')[0] for category in new_car_df['Category'].unique() if ',' in category}

===
{'Convertible,Sedan,Coupe': 'Convertible', 'Convertible,Coupe': 'Convertible', 'Coupe,Convertible': 'Coupe', 'Hatchback,Sedan': 'Hatchback', 'Wagon,Sedan': 'Wagon'}


In [None]:
# verify that the command works
# d is made up of two elements
# the first element is before ':', it is the unique category found in new_car_df['Category'].unique()
# the second element is after ':', it is the item extracted before ',' in the first element
# note: single item 'Category' has no ',', hence not captured
d

In [53]:
# show all unique entries in the column, Category
display(new_car_df['Category'].unique())

array(['Sedan', 'SUV', 'Wagon', 'Convertible,Sedan,Coupe', 'Hatchback',
       'Coupe', 'Pickup', 'Convertible,Coupe', 'Van/Minivan',
       'Coupe,Convertible', 'Hatchback,Sedan', 'Convertible',
       'Wagon,Sedan'], dtype=object)

In [54]:
# replace the entries of Category, using the dictionary d
new_car_df['Category'].replace(to_replace=d, inplace=True)
new_car_df

Unnamed: 0,Year,Make,Category
0,2021,Acura,Sedan
1,2021,Acura,SUV
2,2021,Acura,Sedan
3,2021,Alfa Romeo,Sedan
4,2021,Alfa Romeo,SUV
...,...,...,...
235,2021,Volvo,Wagon
236,2021,Volkswagen,SUV
237,2021,Volvo,SUV
238,2021,Volvo,SUV


In [55]:
# check that the replacement works, by looking at unique entries in Category
display(new_car_df['Category'].unique())

array(['Sedan', 'SUV', 'Wagon', 'Convertible', 'Hatchback', 'Coupe',
       'Pickup', 'Van/Minivan'], dtype=object)

In [56]:
new_car_df

Unnamed: 0,Year,Make,Category
0,2021,Acura,Sedan
1,2021,Acura,SUV
2,2021,Acura,Sedan
3,2021,Alfa Romeo,Sedan
4,2021,Alfa Romeo,SUV
...,...,...,...
235,2021,Volvo,Wagon
236,2021,Volkswagen,SUV
237,2021,Volvo,SUV
238,2021,Volvo,SUV


In [57]:
# split into a training dataframe, and a testing dataframe with a 70%-30% split
train_df, test_df = train_test_split(new_car_df, train_size=0.7, random_state = 10)

In [59]:
# we need training and testing dataframe to see if ML method learns the true relationship
# by definition, testing dataframe not used in the training process
print(f'car_df size: {new_car_df.size}, train_df size: {train_df.size}, test_df size: {test_df.size}')

car_df size: 720, train_df size: 504, test_df size: 216
