# Machine Learning

## Introduction

- Applications
  - natural language processing,
  - search engines,
  - medical diagnosis,
  - detecting credit card fraud,
  - stock market analysis,
  - bio-informatics, e.g. classifying DNA sequences,
  - speech and handwriting recognition,
  - object recognition in computer vision,
  - playing games – learning by self-play: Checkers, Backgammon.
  - robot locomotion.

- Categories
  - Supervised Machine Learning
    - Regression
    - Classification
      - Binary Classification
      - Multi-Class Classification
    - Gradient Boosting
  - Unsupervised Machine Learning
    - Association
      - Self-organized mapping
      - Nearest-neighbor mapping
    - Clustering
      - K-means Clustering
    - Density Estimation
      - Single value decomposition
    - Dimentionality Reduction
      - PCA: Principle Component Analysis
  - Reinforement Learning
    - Decision Making under Uncertainty

- Model/Algorithm
  - Clustering
    - KNN: K-Nearest Neighbor
    - KM: K-Mean Clustering
    - RF: Random Forest
  - Classification
  - Regression
  - Dimentionality Reduction

- Analysis
  - Exploratory
  - Confirmatory
  - Explanaroty

- Bias-Variance Trade-off
  - underfitting: low variance, high bias
  - overfitting: high variance, low bias
  - $Error = Variance + Bias^2$

- Labeled Data
  - Training Set
  - Test Set

- Data Types and Transformation
  - Numeric
    - Normalized
    - Binning with One Hot Encoding
  - Text
    - NGRAM Transformation
    - OSB Transformation: Orthogonal Sparse Bigram
    - Stemming
  - Categorical
    - One Hot Encoding
    - Cartesian Transformation

## DevOps

In [3]:
import numpy as np
import pandas as np
import matplotlib.pyplot as plt
import seaborn as sns

### Pipeline

#### Data Acquisition

#### Data Cleaning

- Rouge Data Types
  - Outliers
  - Missing Data
  - Malicious Data
  - Erroneous Data
  - Irrelevent Data
  - Inconsistent Data
  - Data Formatting/Tranformation

In [None]:
from sklearn.preprocessing import Imputer

- Fill NA

In [None]:
imputer.fit()
imputer.transform()
pd.fillna()
pd.ffill()
pd.bfill()

- Encoding Categorical Variable

In [None]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

In [None]:
lableencoder.fit_transform()

def lableEncoding(X,cols):
    """
    Encoding Columns into numeric lables
    """
    for i in cols:
        labelencoder = LabelEncoder()
        X[:, i] = labelencoder.fit_transform(X[:, i])

lableEncoding(X, [1, 2])

In [None]:
onehotencoder.fit_transform(drop)

def onehotEncoding(XC):
    """
    Encoding DataFrame into Onehot Lables
    """
    onehotencode = OneHotEncoder(categories='auto', drop='first')
    XC = onehotencode.fit_transform(XC).toarray()        
        
onehotEncoding(XC)

X = np.concatenate((XN, XC), axis=1)

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')

- Training/Testing Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test,  y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)

- Feature Scaling
  - Standardization: $x_{std} = \frac{x-\mu}{\sigma}$
  - Normalization: $x_{norm} = \frac{x-min(X)}{max(X)-min(X)} $

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
standardscalar.fit_transform()
standardscalar.transform()

label= StandardScaler()
X_train = standardscalar_X.fit_transform(X_train)
X_test = standardscalar_X.transform(X_test)

#### Modeling

- Explore

- Build

- Train

- Test

#### Testing

- Test-Train Splitting

- K-fold Cross Validatioin

#### Deployment

# Regression

## SLR/Simple Linear Regression

### Intro

- Math
  - Indepent Variable X and Dependent Variable Y: $y = b_0 + b_1 x_1$
  - Coefficient: $b_1$
  - Intercept: $b_0$
  - Cost Function: 
    - $$ RSS = \sum_{i=1}^{n}(y_i-\hat{y}_i)^2 $$
    - $$ MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)^2 $$

### Demo/Salary_Data

In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [None]:
filename='Salary_Data.csv'
rds = pd.read_csv('../00Data/'+filename)
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')

In [None]:
rds.columns

In [None]:
X = rds['YearsExperience'].values
y = rds['Salary'].values

In [None]:
X_train, X_test,  y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)

In [None]:
X_train.reshape(-1,1).shape

In [None]:
slr = LinearRegression()
slr.fit(X_train.reshape(-1,1),y_train)

In [None]:
y_pred = slr.predict(X_test.reshape(-1,1))

In [None]:
plt.figure(figsize=(10,5))
sns.regplot(X_train, y_train, color='gray')
sns.scatterplot(X_test, y_pred, color='red', s=100, marker='<')
sns.scatterplot(X_test, y_test, color='green', s=100, marker='>')

### Demo/USA_Housing

- Data Acquisition

In [None]:
filename='USA_Housing.csv'
rds = pd.read_csv('Data/'+filename)

In [None]:
rds.describe()

- Data Cleaning

In [None]:
for col in rds.columns:
    print(rds.isna()[col].value_counts())

In [None]:
ds = rds

In [None]:
sns.pairplot(ds)

In [None]:
sns.distplot(ds['Price'])

In [None]:
plt.figure(figsize=(8,8))
sns.heatmap(ds.corr(), annot=True)

In [None]:
ds.columns

In [None]:
X = ds[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population']]
y = ds[['Price']]

- Modeling

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

In [None]:
X_train, X_test,  y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
lm = LinearRegression()


- Training

In [None]:
lm.fit(X_train, y_train)

In [None]:
lm.coef_, lm.intercept_

In [None]:
cdf = pd.DataFrame(data=lm.coef_, columns=X.columns, index=['Coef'])
cdf

- Testing

In [None]:
y_test

In [None]:
y_pred = lm.predict(X_test)

In [None]:
y_test

In [None]:
y_pred

In [None]:
metrics.mean_absolute_error(y_test.values, y_pred)

In [None]:
metrics.mean_squared_error(y_test.values, y_pred)

In [None]:
root_mean_squared_error = np.sqrt(metrics.mean_squared_error(y_test.values, y_pred))
root_mean_squared_error

- Visulization

In [None]:
y_test.values, y_pred

In [None]:
sns.jointplot(x=y_test.values,y=y_pred, kind='reg')

In [None]:
sns.distplot(y_test.values-y_pred)

- Deployment

## MLR/Multiple Linear Regression

### Intro

- Math
  - Indepent Variable (X1,...,Xn) and Dependent Variable Y: $y = b_0 + \sum_{i=1}^{n}b_i x_i$
  - Coefficient: $B=(b_1,...,b_n)$
  - Intercept: $b_0$
- Dummy Variable
  - $X_c$ is categorical
  - Shifting Column's Values into Column names
  - Drop 1 value's column to avoid dummpy varible trap
- Modeling Guide
  - all-in
  - Backward Elimination
  - Forward Selection
  - Bidirectional Elimination
  - Score Comparison
- Multicollinearity
  - VIF: Variance Inflation Factor
  - BIC: Baysian Information Criterion

### Demo/50_Startups

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [None]:
filename='50_Startups.csv'
rds = pd.read_csv('Data/'+filename)
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')

In [None]:
rds.info()

In [None]:
XN = rds[['R&D Spend', 'Administration', 'Marketing Spend']].values
XC = rds[['State']].values
y = rds[['Profit']].values

In [None]:
onehotencode = OneHotEncoder(categories='auto', drop='first')
XC = onehotencode.fit_transform(XC).toarray()

In [None]:
X = np.concatenate((XN, XC), axis=1)

In [None]:
X_train, X_test,  y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)

In [None]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
y_pred = regressor.predict(X_test)

In [None]:
sns.scatterplot(x=y_test[:,0], y=y_pred[:,0], hue=y_pred[:,0]-y_test[:,0], palette='viridis')
sns.scatterplot(x=y_train[:,0], y=y_train[:,0], color='gray')

- Backward Elimination

In [None]:
import statsmodels.api as sm

In [None]:
X_be.shape

In [None]:
X_be = np.append(arr=np.ones((len(X),1)).astype(int), values=X, axis=1)
X_opt = X_be[:, [0, 1, 2, 3, 4, 5]]
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()

In [None]:
X_be = np.append(arr=np.ones((len(X),1)).astype(int), values=X, axis=1)
X_opt = X_be[:, [0, 1, 2, 3, 4]]
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()

In [None]:
X_be = np.append(arr=np.ones((len(X),1)).astype(int), values=X, axis=1)
X_opt = X_be[:, [0, 1, 2, 3]]
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()

In [None]:
X_be = np.append(arr=np.ones((len(X),1)).astype(int), values=X, axis=1)
X_opt = X_be[:, [0, 1, 3]]
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()

In [None]:
rds.columns

In [None]:
X_be = np.append(arr=np.ones((len(X),1)).astype(int), values=X, axis=1)
X_opt = X_be[:, [0, 1]]
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()

## PLR/Polynomial Linear Regression

### Intro

- Math
  - Indepent Variable X1 and Dependent Variable Y: $y = b_0 + \sum_{i=1}^{n}b_i x_1^i$
  - Coefficient: $B=(b_1,...,b_n)$
  - Intercept: $b_0$
- Modeling Guide

### Demo/Position_Salaries

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [None]:
filename='Position_Salaries.csv'
rds = pd.read_csv('Data/'+filename)
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')

In [None]:
sns.scatterplot(data=rds, x='Level', y='Salary')

In [None]:
rds.info()

In [None]:
X = rds[['Level']].values
y = rds[['Salary']].values

In [None]:
X_train, X_test,  y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)

In [None]:
linreg = LinearRegression()
linreg.fit(X, y)

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
polyreg = PolynomialFeatures(degree=2)
X_poly = polyreg.fit_transform(X)

In [None]:
linreg2 = LinearRegression()
linreg2.fit(X_poly, y)

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(x=X[:,0], y=y[:,0], color='gray')
sns.lineplot(x=X[:,0], y=linreg.predict(X)[:,0], color='red')
sns.lineplot(x=X[:,0], y=linreg2.predict(X_poly)[:,0], color='blue')

In [None]:
def plotploynominal(degree=2, X=X, y=y):
    polyreg = PolynomialFeatures(degree=degree)
    X_poly = polyreg.fit_transform(X)
    linreg = LinearRegression()
    linreg.fit(X_poly, y)
    plt.figure(figsize=(8,6))
    sns.scatterplot(x=X[:,0], y=y[:,0], color='gray')
    sns.lineplot(x=X[:,0], y=linreg.predict(X_poly)[:,0], color='blue')
    
plotploynominal(degree=3), plotploynominal(degree=4)

## SVR/Support Vector Machine Regression

### Intro

- Math
  - training set: $T = (\overrightarrow{X}, \overrightarrow{Y})$
  - model function: $\overrightarrow{Y} = F(\overrightarrow{X})$
  - support vector: given y, $\overrightarrow{x}$ is support vector of y implies $D(y,\overrightarrow{x}) = min(D(y, \overrightarrow{x_i})), \overrightarrow{x_i} \in \overrightarrow{X}$
  - kernel
    - Linear
    - Polynomial
    - RBF
  - regularization
  - correlation matrix:$K_{i,j} = e^{\sum_k\theta_k|x_k^i-x_k^j|^2}+\epsilon\delta{i,j}$
  - fitting to calculate vector $\overrightarrow{\alpha}$: $\bar{K}\overrightarrow{\alpha}=\overrightarrow{Y}$
- Modeling Guide
  - support linear or nonlinear regressions
  - in classification domain, vectors $\overrightarrow{X}$ is a hpyerplane seperates 2 different classes

### Demo/Position_Salaries

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR

In [None]:
filename='Position_Salaries.csv'
rds = pd.read_csv('Data/'+filename)
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')

In [None]:
rds['Salary'].values

In [None]:
X = rds[['Level']].values
y = rds[['Salary']].values
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y).flatten()

In [None]:
regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)

In [None]:
# Predicting a new result
regressor.predict(X)

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(x=X[:,0], y=y, color='gray')
sns.lineplot(x=X[:,0], y=regressor.predict(X), color='blue')

## DTR/Decision Tree Regression

### Intro

- Intuition
  - seperate points in n-D into different regions by planes
  - 2D: transform a Tree into 2D plane division
- Math
  - TODO
- Information Entropy
- Modeling Guide

### Demo/Position_Salaries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

In [None]:
filename='Position_Salaries.csv'
rds = pd.read_csv('Data/'+filename)
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')

In [None]:
X = rds[['Level']].values
y = rds[['Salary']].values

In [None]:
regressor = DecisionTreeRegressor(random_state=101)
regressor.fit(X, y)

In [None]:
regressor.predict([[6.5]]), regressor.predict(X)

In [None]:
y.flatten().shape

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(x=X[:,0], y=y.flatten(), color='gray')
sns.lineplot(x=X[:,0], y=regressor.predict(X), color='blue')

In [None]:
X_grid = np.arange(min(X), max(X), 0.1)

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(x=X[:,0], y=y.flatten(), color='gray')
sns.lineplot(x=X_grid, y=regressor.predict(X_grid.reshape(-1,1)), color='blue')

## RFR/Random Forest Regression

### Intro

- Intuition
  - TODO
- Math
  - TODO
- Information Entropy
- Modeling Guide

### Demo/Position_Salaries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

In [None]:
filename='Position_Salaries.csv'
rds = pd.read_csv('Data/'+filename)
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')

In [None]:
X = rds[['Level']].values
y = rds['Salary'].values

In [None]:
regressor = RandomForestRegressor(n_estimators=10, random_state=101)
regressor.fit(X, y)

In [None]:
regressor.predict([[6.5]]), regressor.predict(X)

In [None]:
y.flatten().shape

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(x=X[:,0], y=y.flatten(), color='gray')
sns.lineplot(x=X[:,0], y=regressor.predict(X), color='blue')

In [None]:
X_grid = np.arange(min(X), max(X), 0.01)

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(x=X[:,0], y=y.flatten(), color='gray')
sns.lineplot(x=X_grid, y=regressor.predict(X_grid.reshape(-1,1)), color='blue')

In [None]:
 np.linspace(min(X), max(X), 100).shape

- Plot Estimator Variance

In [None]:
def plot_random_forest(n_estimator=10, X=X, y=y):
    regressor = RandomForestRegressor(n_estimator, random_state=101)
    regressor.fit(X, y)
    X_grid = np.linspace(min(X), max(X), 100)
    plt.figure(figsize=(8,6))
    sns.scatterplot(x=X[:,0], y=y.flatten(), color='gray')
    sns.lineplot(x=X_grid[:,0], y=regressor.predict(X_grid), color='blue')

In [None]:
plot_random_forest(n_estimator=300)

# Classification

## Logistic Regression Classification

### Intro

- Math
  - Logistic/Sigmoid Function: $\phi(x) = \frac{1}{1+e^{-x}}$
  - Linear Regression Form: $\ln(\frac{p}{1-p})=b_0+b_1x$
- Intuition
  - Bisection data classification

- Confusion Matrix
  - TP: True Positive
  - TN: True Negative
  - FP: False Positive
  - FN: False Negative
- Error Rate
  - (FP+FN)/Total
- Type I Error
  - False Positive
- Type II Error
  - False Negative

In [None]:
x = np.linspace(-10,10,500)
y = 1/(1+np.power(np.e,-x))

x1 = np.linspace(-2,2,500)
y1 = x1+0.5

x2 = np.linspace(-10,10,500)
y2 = x2-x+0.5

plt.ylim(0, 1)
sns.lineplot(x=x, y=y)
sns.lineplot(x=x2, y=y2)
sns.lineplot(x=x1, y=y1)

### Demo/Social_Network_Ads

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression

In [None]:
filename='Social_Network_Ads.csv'
rds = pd.read_csv('Data/'+filename)
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')

In [None]:
X = rds[['Age', 'EstimatedSalary']].values
y = rds['Purchased'].values

In [None]:
X_train, X_test,  y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

standardscalar_X = StandardScaler()
X_train = standardscalar_X.fit_transform(X_train)
X_test = standardscalar_X.transform(X_test)

In [None]:
classifier = LogisticRegression(random_state=0, solver='lbfgs')
classifier.fit(X_train, y_train)

In [None]:
y_pred = classifier.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
# df_train = pd.DataFrame(np.concatenate((X_train, np.zeros((len(y_train),1)), y_train.reshape(-1,1)), axis=1),
#                        columns=['Age', 'EstimatedSalary','Train', 'Purchased_train'])

# df_test = pd.DataFrame(np.concatenate((X_test, np.ones((len(y_test),1)), y_test.reshape(-1,1)), axis=1),
#                        columns=['Age', 'EstimatedSalary','Test', 'Purchased_test'])

df_pred_T = pd.DataFrame(np.concatenate((X_test, 
                                         np.zeros((len(y_pred),1)), 
                                         (y_pred+y_test).reshape(-1,1)), 
                                        axis=1), 
                         columns=['Age', 'EstimatedSalary','Z', 'Purchased'])

df_pred_F = pd.DataFrame(np.concatenate((X_test, 
                                         np.zeros((len(y_pred),1)), 
                                         (y_pred-y_test).reshape(-1,1)), 
                                        axis=1),
                       columns=['Age', 'EstimatedSalary','Z', 'Purchased'])

df_TP = df_pred_T[df_pred_T['Purchased'] > 1]
df_TN = df_pred_T[df_pred_T['Purchased'] < 1]

df_FP = df_pred_F[df_pred_F['Purchased'] > 0]
df_FP.loc[:, 'Z'] = 0.5
df_FN = df_pred_F[df_pred_F['Purchased'] < 0]
df_FN.loc[:, 'Z'] = -0.5

In [None]:
df_TP.shape, df_TN.shape, df_FP.shape, df_FN.shape

In [None]:
import plotly.graph_objects as go
fig = go.Figure()
colors = ['green', 'greenyellow', 'red', 'red']
names = ['True Positive', 'True Negative', 'False Positive', 'False Negative']

for index, df in enumerate([df_TP, df_TN, df_FP, df_FN]):
    fig.add_scatter3d(x=df['Age'], y=df['EstimatedSalary'], 
                      z=df['Z'], 
                      name=names[index],
                      mode='markers',
                      marker=dict(
                        size=12,
                        color=colors[index],        
                        colorscale='Viridis',
                        opacity=0.8)
                     )
    
fig.write_html(f'IMG/{filename}_LogiRegC.html',auto_open=True)

### Demo/Titanic

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

- Data Acquisition

In [None]:
filename = 'titanic_train.csv'

In [None]:
rds = pd.read_csv('Data/'+filename)

- Data Cleaning

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')

In [None]:
sns.boxplot(data=rds, x='Pclass', y='Age')

In [None]:
ds1 = rds
age_filling = rds.groupby(by='Pclass').mean()['Age'].apply(int)
def impute_age(cols):
    if pd.isnull(cols[1]):
        return age_filling[cols[0]]
    return cols[1]

In [None]:
ds1['Age'] = rds[['Pclass','Age']].apply(impute_age, axis=1)

In [None]:
ds1.drop('Cabin', axis=1, inplace=True)

In [None]:
ds1['Embarked'].ffill(inplace=True)

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(data=ds1.isnull(), yticklabels=False, cbar=False, cmap='viridis')

In [None]:
sex_cat = pd.get_dummies(ds1['Sex'],drop_first=True)
embarked_cat = pd.get_dummies(ds1['Embarked'], drop_first=True)

In [None]:
ds = pd.concat([ds1, sex_cat, embarked_cat], axis=1)
ds.drop(['PassengerId','Name','Sex','Ticket','Embarked'], axis=1, inplace=True)
ds

In [None]:
X = ds.drop('Survived', axis=1)
y = ds['Survived']

- Modeling

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [None]:
X_train, X_test,  y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)

In [None]:
logReg = LogisticRegression()

- Training

In [None]:
logReg.fit(X_train, y_train)

- Testing

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
y_pred = logReg.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
confusion_matrix(y_test, y_pred)

- Visulization

- Deployment

## K-NN/K-Nearest Neighbor

### Intro

- Unsupervised ML

- Modeling
  - Formular
    $$
       y = f(x); y \in Y, x \in X \\
       X = \{\text{Featurization of Dataset}\} \\
       Y = \{\text{Predicted Class Labels}\}
    $$
  - Distance
    - Minkowski: p-norm: $D(a, b)=(\sum_{i=1}^{n}|a_i-b_i|^p)^{\frac{1}{p}}$
    - Euclidean: 2-norm Minkowski distance:  $L_2(a, b)=(\sum_{i=1}^{n}|a_i-b_i|^2)^{\frac{1}{2}}$
    - Manhattan: 1-norm Minkowski distance $L_1(a, b)=\sum_{i=1}^{n}|a_i-b_i|$
    - Hamming: $L = Count(a \oplus b)$
    - Cosine: $L_c(a, b) = 1 - \frac{a \cdot b}{||a|| \cdot ||b||}$
  

### Demo/Social_Network_Ads

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier

In [None]:
filename='Social_Network_Ads.csv'
rds = pd.read_csv('Data/'+filename)
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')


In [None]:
rds.columns

In [None]:
X = rds[['Age', 'EstimatedSalary']].values
y = rds['Purchased'].values

In [None]:
X_train, X_test,  y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

standardscalar_X = StandardScaler()
X_train = standardscalar_X.fit_transform(X_train)
X_test = standardscalar_X.transform(X_test)

In [None]:
classifier = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
classifier.fit(X_train, y_train)

In [None]:
y_pred = classifier.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
# df_train = pd.DataFrame(np.concatenate((X_train, np.zeros((len(y_train),1)), y_train.reshape(-1,1)), axis=1),
#                        columns=['Age', 'EstimatedSalary','Train', 'Purchased_train'])

# df_test = pd.DataFrame(np.concatenate((X_test, np.ones((len(y_test),1)), y_test.reshape(-1,1)), axis=1),
#                        columns=['Age', 'EstimatedSalary','Test', 'Purchased_test'])

df_pred_T = pd.DataFrame(np.concatenate((X_test, 
                                         np.zeros((len(y_pred),1)), 
                                         (y_pred+y_test).reshape(-1,1)), 
                                        axis=1), 
                         columns=['Age', 'EstimatedSalary','Z', 'Purchased'])

df_pred_F = pd.DataFrame(np.concatenate((X_test, 
                                         np.zeros((len(y_pred),1)), 
                                         (y_pred-y_test).reshape(-1,1)), 
                                        axis=1),
                       columns=['Age', 'EstimatedSalary','Z', 'Purchased'])

df_TP = df_pred_T[df_pred_T['Purchased'] > 1]
df_TN = df_pred_T[df_pred_T['Purchased'] < 1]

df_FP = df_pred_F[df_pred_F['Purchased'] > 0]
df_FP.loc[:, 'Z'] = 0.5
df_FN = df_pred_F[df_pred_F['Purchased'] < 0]
df_FN.loc[:, 'Z'] = -0.5

In [None]:
df_TP.shape, df_TN.shape, df_FP.shape, df_FN.shape

In [None]:
import plotly.graph_objects as go
fig = go.Figure()
colors = ['green', 'greenyellow', 'red', 'red']
names = ['True Positive', 'True Negative', 'False Positive', 'False Negative']

for index, df in enumerate([df_TP, df_TN, df_FP, df_FN]):
    fig.add_scatter3d(x=df['Age'], y=df['EstimatedSalary'], 
                      z=df['Z'], 
                      name=names[index],
                      mode='markers',
                      marker=dict(
                        size=12,
                        color=colors[index],        
                        colorscale='Viridis',
                        opacity=0.8)
                     )
    
fig.write_html(f'IMG/{filename}_KNN.html',auto_open=True)

### Demo/Classified Data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 

- Data Acquisition

In [None]:
filename = 'Classified Data'
rds = pd.read_csv('Data/'+filename, index_col=0)

In [None]:
rds

- Data Cleaning

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
ds = rds

In [None]:
X1 = ds.drop('TARGET CLASS', axis=1)
y1 = ds['TARGET CLASS']

In [None]:
ss = StandardScaler()
ss.fit(X1)
X2 = ss.transform(X1)

In [None]:
X = pd.DataFrame(X2, columns=X1.columns)
y=y1

- Modeling

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

In [None]:
X_train, X_test,  y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)

- Training

In [None]:
knn.fit(X_train, y_train)

- Testing

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
y_pred = knn.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
confusion_matrix(y_test, y_pred)

- Visulization

In [None]:
err_rate = {}

for k in range(1, 40):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    err_rate[k] = np.mean(y_test != y_pred)

In [None]:
err_rate.keys()

In [None]:
plt.figure(figsize=(12,6))
sns.lineplot(x=list(err_rate.keys()), y = list(err_rate.values()), 
             markers=True, style=len(err_rate.keys())*['o'])

- Deployment

## SVC/Support Vector Machine Classifier

### Intro

- Intuition
  - 2-D: two seperated sets $A, B$, how to draw a line to seperate these two sets
  - Maximum Margin: 
    - $\overrightarrow{oa} \in A, \overrightarrow{ob} \in B, 
    \{\overrightarrow{op}, \overrightarrow{v}\}$ seperates A, B
    - $max(||\overrightarrow{ab}\cdot \overrightarrow{pv}||)$
    - $\overrightarrow{oa}, \overrightarrow{ob}$ called support vectors
    - $\{\overrightarrow{op}, \overrightarrow{v}\}$ called maximum margin 
    
- Hyperplane/Classifier
    - Positive Hyperplane
    - Netative Hyperplane
- Kernal:
    - Linear
    - Polynomial
    - RBF
   

### Demo/Social_Network_Ads

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC

In [None]:
filename='Social_Network_Ads.csv'
rds = pd.read_csv('Data/'+filename)
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')


In [None]:
rds.columns

In [None]:
X = rds[['Age', 'EstimatedSalary']].values
y = rds['Purchased'].values

In [None]:
X_train, X_test,  y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

standardscalar_X = StandardScaler()
X_train = standardscalar_X.fit_transform(X_train)
X_test = standardscalar_X.transform(X_test)

In [None]:
classifier = SVC(kernel='linear', random_state=0)
classifier.fit(X_train, y_train)

In [None]:
y_pred = classifier.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
# df_train = pd.DataFrame(np.concatenate((X_train, np.zeros((len(y_train),1)), y_train.reshape(-1,1)), axis=1),
#                        columns=['Age', 'EstimatedSalary','Train', 'Purchased_train'])

# df_test = pd.DataFrame(np.concatenate((X_test, np.ones((len(y_test),1)), y_test.reshape(-1,1)), axis=1),
#                        columns=['Age', 'EstimatedSalary','Test', 'Purchased_test'])

df_pred_T = pd.DataFrame(np.concatenate((X_test, 
                                         np.zeros((len(y_pred),1)), 
                                         (y_pred+y_test).reshape(-1,1)), 
                                        axis=1), 
                         columns=['Age', 'EstimatedSalary','Z', 'Purchased'])

df_pred_F = pd.DataFrame(np.concatenate((X_test, 
                                         np.zeros((len(y_pred),1)), 
                                         (y_pred-y_test).reshape(-1,1)), 
                                        axis=1),
                       columns=['Age', 'EstimatedSalary','Z', 'Purchased'])

df_TP = df_pred_T[df_pred_T['Purchased'] > 1]
df_TN = df_pred_T[df_pred_T['Purchased'] < 1]

df_FP = df_pred_F[df_pred_F['Purchased'] > 0]
df_FP.loc[:, 'Z'] = 0.5
df_FN = df_pred_F[df_pred_F['Purchased'] < 0]
df_FN.loc[:, 'Z'] = -0.5

In [None]:
df_TP.shape, df_TN.shape, df_FP.shape, df_FN.shape

In [None]:
import plotly.graph_objects as go
fig = go.Figure()
colors = ['green', 'greenyellow', 'red', 'red']
names = ['True Positive', 'True Negative', 'False Positive', 'False Negative']

for index, df in enumerate([df_TP, df_TN, df_FP, df_FN]):
    fig.add_scatter3d(x=df['Age'], y=df['EstimatedSalary'], 
                      z=df['Z'], 
                      name=names[index],
                      mode='markers',
                      marker=dict(
                        size=12,
                        color=colors[index],        
                        colorscale='Viridis',
                        opacity=0.8)
                     )
    
fig.write_html(f'IMG/{filename}_SVM.html',auto_open=True)

### Demo

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 

- Data Acquisition

In [None]:
filename = "loan_data.csv"
rds = pd.read_csv('Data/'+filename)
rds

- Data Cleaning

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')

In [None]:
# sns.pairplot(data=rds)

In [None]:
ds = rds
X = ds.drop('Kyphosis', axis=1)
y = ds['Kyphosis']


- Modeling

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

In [None]:
X_train, X_test,  y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
dtree = DecisionTreeClassifier(max_depth=6)

- Training

In [None]:
dtree.fit(X_train, y_train)

- Testing

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
y_pred = dtree.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
confusion_matrix(y_test, y_pred)

- Visulization

In [None]:
type(y.values)

In [None]:
y_train.value_counts()

In [None]:
class_values = y.unique().tolist()
for i, v in enumerate(class_values):
    print(i, v)

In [None]:
from sklearn.tree import export_graphviz
from IPython.core.display import display, HTML
from dtreeviz.trees import *

In [None]:
viz = dtreeviz(dtree, X_train=X_train, 
               y_train=y_train,
               feature_names=X.columns.to_list(), 
               target_name='Kyphosis',
               class_names=y.unique().tolist(),
             fancy=True)
display(HTML(viz.svg()))

- Deployment

In [None]:
from sklearn.datasets import *
from sklearn import tree
from dtreeviz.trees import *

In [None]:
classifier = tree.DecisionTreeClassifier(max_depth=2)  # limit depth of tree
iris = load_iris()
classifier.fit(iris.data, iris.target)

viz = dtreeviz(classifier, 
               iris.data, 
               iris.target,
               target_name='variety',
              feature_names=iris.feature_names, 
               class_names=["setosa", "versicolor", "virginica"]  # need class_names for classifier
              )  
              
viz.view() 

## Kernel SVC/Kernel Support Vector Machine Classifier

### Intro

- Intuition
  - 2-D: two seperated sets $A, B$, but not linear seperable, Seperation region has hole.
  - Maximum Margin: 
    - $\overrightarrow{oa} \in A, \overrightarrow{ob} \in B, 
    \{\overrightarrow{op}, \overrightarrow{v}\}$ seperates A, B
    - $max(||\overrightarrow{ab}\cdot \overrightarrow{pv}||)$
    - $\overrightarrow{oa}, \overrightarrow{ob}$ called support vectors
    - $\{\overrightarrow{op}, \overrightarrow{v}\}$ called maximum margin Hyperplane/Classifier
    - Positive Hyperplane
    - Netative Hyperplane
  - the kernel trick
    - The Gaussian RBF(Radio Based Funtion) Kernel, l is landmark
      > $$K(\overrightarrow{x}, \overrightarrow{l^i})=e^{-\frac{||\overrightarrow{x}-\overrightarrow{l^i}||^2}{2 \sigma^2}}$$
    -  The Sigmoid Kernel
      > $$K(X, Y) = \tanh(\gamma \cdot X^T Y + r)$$
    -  The Polynomial Kernel
      > $$K(X, Y) = (\gamma \cdot X^T Y + r)^d$$   

### Demo/Social_Network_Ads

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC

In [None]:
filename='Social_Network_Ads.csv'
rds = pd.read_csv('Data/'+filename)
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')


In [None]:
rds.columns

In [None]:
X = rds[['Age', 'EstimatedSalary']].values
y = rds['Purchased'].values

In [None]:
X_train, X_test,  y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

standardscalar_X = StandardScaler()
X_train = standardscalar_X.fit_transform(X_train)
X_test = standardscalar_X.transform(X_test)

In [None]:
classifier = SVC(kernel='rbf', random_state=0)
classifier.fit(X_train, y_train)

In [None]:
y_pred = classifier.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
# df_train = pd.DataFrame(np.concatenate((X_train, np.zeros((len(y_train),1)), y_train.reshape(-1,1)), axis=1),
#                        columns=['Age', 'EstimatedSalary','Train', 'Purchased_train'])

# df_test = pd.DataFrame(np.concatenate((X_test, np.ones((len(y_test),1)), y_test.reshape(-1,1)), axis=1),
#                        columns=['Age', 'EstimatedSalary','Test', 'Purchased_test'])

df_pred_T = pd.DataFrame(np.concatenate((X_test, 
                                         np.zeros((len(y_pred),1)), 
                                         (y_pred+y_test).reshape(-1,1)), 
                                        axis=1), 
                         columns=['Age', 'EstimatedSalary','Z', 'Purchased'])

df_pred_F = pd.DataFrame(np.concatenate((X_test, 
                                         np.zeros((len(y_pred),1)), 
                                         (y_pred-y_test).reshape(-1,1)), 
                                        axis=1),
                       columns=['Age', 'EstimatedSalary','Z', 'Purchased'])

df_TP = df_pred_T[df_pred_T['Purchased'] > 1]
df_TN = df_pred_T[df_pred_T['Purchased'] < 1]

df_FP = df_pred_F[df_pred_F['Purchased'] > 0]
df_FP.loc[:, 'Z'] = 0.5
df_FN = df_pred_F[df_pred_F['Purchased'] < 0]
df_FN.loc[:, 'Z'] = -0.5

In [None]:
df_TP.shape, df_TN.shape, df_FP.shape, df_FN.shape

In [None]:
import plotly.graph_objects as go
fig = go.Figure()
colors = ['green', 'greenyellow', 'red', 'red']
names = ['True Positive', 'True Negative', 'False Positive', 'False Negative']

for index, df in enumerate([df_TP, df_TN, df_FP, df_FN]):
    fig.add_scatter3d(x=df['Age'], y=df['EstimatedSalary'], 
                      z=df['Z'], 
                      name=names[index],
                      mode='markers',
                      marker=dict(
                        size=12,
                        color=colors[index],        
                        colorscale='Viridis',
                        opacity=0.8)
                     )
    
fig.write_html(f'IMG/{filename}_KSVM.html',auto_open=True)

## Naive Bayes/Naive Bayes Classifier

### Intro

- Math
  - Bayes Theorem:
    > $$ \frac{P(A|B)}{P(A)}=\frac{P(B|A)}{P(B)}$$

### Demo/Social_Network_Ads

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import GaussianNB

In [None]:
filename='Social_Network_Ads.csv'
rds = pd.read_csv('Data/'+filename)
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')


In [None]:
rds.columns

In [None]:
X = rds[['Age', 'EstimatedSalary']].values
y = rds['Purchased'].values

In [None]:
X_train, X_test,  y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

standardscalar_X = StandardScaler()
X_train = standardscalar_X.fit_transform(X_train)
X_test = standardscalar_X.transform(X_test)

In [None]:
classifier = GaussianNB()
classifier.fit(X_train, y_train)

In [None]:
y_pred = classifier.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
# df_train = pd.DataFrame(np.concatenate((X_train, np.zeros((len(y_train),1)), y_train.reshape(-1,1)), axis=1),
#                        columns=['Age', 'EstimatedSalary','Train', 'Purchased_train'])

# df_test = pd.DataFrame(np.concatenate((X_test, np.ones((len(y_test),1)), y_test.reshape(-1,1)), axis=1),
#                        columns=['Age', 'EstimatedSalary','Test', 'Purchased_test'])

df_pred_T = pd.DataFrame(np.concatenate((X_test, 
                                         np.zeros((len(y_pred),1)), 
                                         (y_pred+y_test).reshape(-1,1)), 
                                        axis=1), 
                         columns=['Age', 'EstimatedSalary','Z', 'Purchased'])

df_pred_F = pd.DataFrame(np.concatenate((X_test, 
                                         np.zeros((len(y_pred),1)), 
                                         (y_pred-y_test).reshape(-1,1)), 
                                        axis=1),
                       columns=['Age', 'EstimatedSalary','Z', 'Purchased'])

df_TP = df_pred_T[df_pred_T['Purchased'] > 1]
df_TN = df_pred_T[df_pred_T['Purchased'] < 1]

df_FP = df_pred_F[df_pred_F['Purchased'] > 0]
df_FP.loc[:, 'Z'] = 0.5
df_FN = df_pred_F[df_pred_F['Purchased'] < 0]
df_FN.loc[:, 'Z'] = -0.5

In [None]:
df_TP.shape, df_TN.shape, df_FP.shape, df_FN.shape

In [None]:
import plotly.graph_objects as go
fig = go.Figure()
colors = ['green', 'greenyellow', 'red', 'red']
names = ['True Positive', 'True Negative', 'False Positive', 'False Negative']

for index, df in enumerate([df_TP, df_TN, df_FP, df_FN]):
    fig.add_scatter3d(x=df['Age'], y=df['EstimatedSalary'], 
                      z=df['Z'], 
                      name=names[index],
                      mode='markers',
                      marker=dict(
                        size=12,
                        color=colors[index],        
                        colorscale='Viridis',
                        opacity=0.8)
                     )
    
fig.write_html(f'IMG/{filename}_NaiveBayes.html',auto_open=True)

## Decision Tree Classifier

### Intro

- Math
- Information Entropy
- 

### Demo/Social_Network_Ads

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier

In [None]:
filename='Social_Network_Ads.csv'
rds = pd.read_csv('Data/'+filename)
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')


In [None]:
rds.columns

In [None]:
X = rds[['Age', 'EstimatedSalary']].values
y = rds['Purchased'].values

In [None]:
X_train, X_test,  y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

standardscalar_X = StandardScaler()
X_train = standardscalar_X.fit_transform(X_train)
X_test = standardscalar_X.transform(X_test)

In [None]:
classifier = DecisionTreeClassifier(criterion='entropy', random_state=0)
classifier.fit(X_train, y_train)

In [None]:
y_pred = classifier.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
# df_train = pd.DataFrame(np.concatenate((X_train, np.zeros((len(y_train),1)), y_train.reshape(-1,1)), axis=1),
#                        columns=['Age', 'EstimatedSalary','Train', 'Purchased_train'])

# df_test = pd.DataFrame(np.concatenate((X_test, np.ones((len(y_test),1)), y_test.reshape(-1,1)), axis=1),
#                        columns=['Age', 'EstimatedSalary','Test', 'Purchased_test'])

df_pred_T = pd.DataFrame(np.concatenate((X_test, 
                                         np.zeros((len(y_pred),1)), 
                                         (y_pred+y_test).reshape(-1,1)), 
                                        axis=1), 
                         columns=['Age', 'EstimatedSalary','Z', 'Purchased'])

df_pred_F = pd.DataFrame(np.concatenate((X_test, 
                                         np.zeros((len(y_pred),1)), 
                                         (y_pred-y_test).reshape(-1,1)), 
                                        axis=1),
                       columns=['Age', 'EstimatedSalary','Z', 'Purchased'])

df_TP = df_pred_T[df_pred_T['Purchased'] > 1]
df_TN = df_pred_T[df_pred_T['Purchased'] < 1]

df_FP = df_pred_F[df_pred_F['Purchased'] > 0]
df_FP.loc[:, 'Z'] = 0.5
df_FN = df_pred_F[df_pred_F['Purchased'] < 0]
df_FN.loc[:, 'Z'] = -0.5

df_TP.shape, df_TN.shape, df_FP.shape, df_FN.shape

In [None]:
import plotly.graph_objects as go
fig = go.Figure()
colors = ['green', 'greenyellow', 'red', 'red']
names = ['True Positive', 'True Negative', 'False Positive', 'False Negative']

for index, df in enumerate([df_TP, df_TN, df_FP, df_FN]):
    fig.add_scatter3d(x=df['Age'], y=df['EstimatedSalary'], 
                      z=df['Z'], 
                      name=names[index],
                      mode='markers',
                      marker=dict(
                        size=12,
                        color=colors[index],        
                        colorscale='Viridis',
                        opacity=0.8)
                     )
    
fig.write_html(f'IMG/{filename}_DecisionTreeClassifier.html',auto_open=True)

### Demo with PySpark/Pasthires

In [None]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree
from pyspark import SparkConf, SparkContext
from numpy import array

In [None]:
# Boilerplate Spark stuff:
conf = SparkConf().setMaster("local").setAppName("SparkDecisionTree")
sc = SparkContext(conf = conf)

In [None]:
# Some functions that convert our CSV input data into numerical
# features for each job candidate
def binary(YN):
    if (YN == 'Y'):
        return 1
    else:
        return 0

def mapEducation(degree):
    if (degree == 'BS'):
        return 1
    elif (degree =='MS'):
        return 2
    elif (degree == 'PhD'):
        return 3
    else:
        return 0

# Convert a list of raw fields from our CSV file to a
# LabeledPoint that MLLib can use. All data must be numerical...
def createLabeledPoints(fields):
    yearsExperience = int(fields[0])
    employed = binary(fields[1])
    previousEmployers = int(fields[2])
    educationLevel = mapEducation(fields[3])
    topTier = binary(fields[4])
    interned = binary(fields[5])
    hired = binary(fields[6])

    return LabeledPoint(hired, array([yearsExperience, employed,
        previousEmployers, educationLevel, topTier, interned]))

#Load up our CSV file, and filter out the header line with the column names
rawData = sc.textFile("Data/PastHires.csv")
header = rawData.first()
rawData = rawData.filter(lambda x:x != header)

# Split each line into a list based on the comma delimiters
csvData = rawData.map(lambda x: x.split(","))

# Convert these lists to LabeledPoints
trainingData = csvData.map(createLabeledPoints)

# Create a test candidate, with 10 years of experience, currently employed,
# 3 previous employers, a BS degree, but from a non-top-tier school where
# he or she did not do an internship. You could of course load up a whole
# huge RDD of test candidates from disk, too.
testCandidates = [ array([10, 1, 3, 1, 0, 0])]
testData = sc.parallelize(testCandidates)

In [None]:
testData

In [None]:
# Train our DecisionTree classifier using our data set
model = DecisionTree.trainClassifier(trainingData, numClasses=2,
                                     categoricalFeaturesInfo={1:2, 3:4, 4:2, 5:2},
                                     impurity='gini', maxDepth=5, maxBins=32)

# Now get predictions for our unknown candidates. (Note, you could separate
# the source data into a training set and a test set while tuning
# parameters and measure accuracy as you go!)
predictions = model.predict(testData)
print('Hire prediction:')
results = predictions.collect()
for result in results:
    print(result)

# We can also print out the decision tree itself:
print('Learned classification tree model:')
print(model.toDebugString())


### Demo/loan_data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 

- Data Acquisition

In [None]:
filename = "loan_data.csv"
rds = pd.read_csv('Data/'+filename)
rds

- Data Cleaning

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')

In [None]:
# sns.pairplot(data=rds)

In [None]:
ds = rds
X = ds.drop('Kyphosis', axis=1)
y = ds['Kyphosis']


- Modeling

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

In [None]:
X_train, X_test,  y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
dtree = DecisionTreeClassifier(max_depth=6)

- Training

In [None]:
dtree.fit(X_train, y_train)

- Testing

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
y_pred = dtree.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
confusion_matrix(y_test, y_pred)

- Visulization

In [None]:
type(y.values)

In [None]:
y_train.value_counts()

In [None]:
class_values = y.unique().tolist()
for i, v in enumerate(class_values):
    print(i, v)

In [None]:
from sklearn.tree import export_graphviz
from IPython.core.display import display, HTML
from dtreeviz.trees import *

In [None]:
viz = dtreeviz(dtree, X_train=X_train, 
               y_train=y_train,
               feature_names=X.columns.to_list(), 
               target_name='Kyphosis',
               class_names=y.unique().tolist(),
             fancy=True)
display(HTML(viz.svg()))

- Deployment

In [None]:
from sklearn.datasets import *
from sklearn import tree
from dtreeviz.trees import *

In [None]:
classifier = tree.DecisionTreeClassifier(max_depth=2)  # limit depth of tree
iris = load_iris()
classifier.fit(iris.data, iris.target)

viz = dtreeviz(classifier, 
               iris.data, 
               iris.target,
               target_name='variety',
              feature_names=iris.feature_names, 
               class_names=["setosa", "versicolor", "virginica"]  # need class_names for classifier
              )  
              
viz.view() 

## Random Forest Classifier

### Intro

- Math
- Information Entropy
- 

### Demo/Social_Network_Ads

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier

In [None]:
filename='Social_Network_Ads.csv'
rds = pd.read_csv('Data/'+filename)
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')


In [None]:
rds.columns

In [None]:
X = rds[['Age', 'EstimatedSalary']].values
y = rds['Purchased'].values

In [None]:
X_train, X_test,  y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

standardscalar_X = StandardScaler()
X_train = standardscalar_X.fit_transform(X_train)
X_test = standardscalar_X.transform(X_test)

In [None]:
classifier = RandomForestClassifier(n_estimators=100, criterion='entropy', random_state=0)
classifier.fit(X_train, y_train)

In [None]:
y_pred = classifier.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
# df_train = pd.DataFrame(np.concatenate((X_train, np.zeros((len(y_train),1)), y_train.reshape(-1,1)), axis=1),
#                        columns=['Age', 'EstimatedSalary','Train', 'Purchased_train'])

# df_test = pd.DataFrame(np.concatenate((X_test, np.ones((len(y_test),1)), y_test.reshape(-1,1)), axis=1),
#                        columns=['Age', 'EstimatedSalary','Test', 'Purchased_test'])

df_pred_T = pd.DataFrame(np.concatenate((X_test, 
                                         np.zeros((len(y_pred),1)), 
                                         (y_pred+y_test).reshape(-1,1)), 
                                        axis=1), 
                         columns=['Age', 'EstimatedSalary','Z', 'Purchased'])

df_pred_F = pd.DataFrame(np.concatenate((X_test, 
                                         np.zeros((len(y_pred),1)), 
                                         (y_pred-y_test).reshape(-1,1)), 
                                        axis=1),
                       columns=['Age', 'EstimatedSalary','Z', 'Purchased'])

df_TP = df_pred_T[df_pred_T['Purchased'] > 1]
df_TN = df_pred_T[df_pred_T['Purchased'] < 1]

df_FP = df_pred_F[df_pred_F['Purchased'] > 0]
df_FP.loc[:, 'Z'] = 0.5
df_FN = df_pred_F[df_pred_F['Purchased'] < 0]
df_FN.loc[:, 'Z'] = -0.5

df_TP.shape, df_TN.shape, df_FP.shape, df_FN.shape

In [None]:
import plotly.graph_objects as go
fig = go.Figure()
colors = ['green', 'greenyellow', 'red', 'red']
names = ['True Positive', 'True Negative', 'False Positive', 'False Negative']

for index, df in enumerate([df_TP, df_TN, df_FP, df_FN]):
    fig.add_scatter3d(x=df['Age'], y=df['EstimatedSalary'], 
                      z=df['Z'], 
                      name=names[index],
                      mode='markers',
                      marker=dict(
                        size=12,
                        color=colors[index],        
                        colorscale='Viridis',
                        opacity=0.8)
                     )
    
fig.write_html(f'IMG/{filename}_RandomForestClassifier_100.html',auto_open=True)

# Clustering

## K-means Clustering

### Intro

### Demo/Mall_Customers

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.cluster import KMeans

In [None]:
filename='Mall_Customers.csv'
rds = pd.read_csv('Data/'+filename)
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')

In [None]:
sns.scatterplot(data=rds, x='Annual Income (k$)', y='Spending Score (1-100)', 
                hue='Age')

In [None]:
rds.columns

In [None]:
X = rds[['Annual Income (k$)', 'Spending Score (1-100)']].values

In [None]:
def show_elbow(X=X, filename=filename):
    wcss = []
    fig = go.Figure()
    for n_clusters in range(1, 20):
        classifier = KMeans(n_clusters=n_clusters, 
                            init='k-means++', max_iter=500, 
                            n_init=20, random_state=0)
        classifier.fit(X)
        wcss.append(classifier.inertia_)
    fig.add_trace(go.Scatter(x=np.arange(1, len(wcss)+1), y=wcss, name='wcss'))
    fig.write_html(f'IMG/{filename}_KMeans_elbow.html',auto_open=True)

In [None]:
show_elbow()

In [None]:
classifier = KMeans(n_clusters=5, 
                    init='k-means++', max_iter=500, 
                    n_init=20, random_state=0)
y_pred = classifier.fit_predict(X)

In [None]:
rds['Type'] =y_pred
centroids = pd.DataFrame(classifier.cluster_centers_, columns=['x', 'y'])
centroids['color']=centroids.shape[0]*['red']
centroids['size']=centroids.shape[0]*[16]

In [None]:
fig = px.scatter(data_frame=rds, x='Annual Income (k$)', y='Spending Score (1-100)', 
                 color='Type')
fig.add_trace(fig1.data[0])
fig.add_trace(go.Scatter(x=centroids['x'], y=centroids['y'], 
                        mode='markers',
                        marker=dict(
                            size=centroids['size'],
                            color='red', #set color equal to a variable
                            colorscale='Viridis', # one of plotly colorscales
                            showscale=True
                        ),
                        showlegend=False,
                        name='centroids'))
fig.write_html(f'IMG/{filename}_KMeans.html',auto_open=True)

## Hierarchical  Clustering

### Intro

- Agglomerative
- Divisive
- Dendrogram

### Demo/Mall_Customers

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as sch

In [None]:
filename='Mall_Customers.csv'
rds = pd.read_csv('Data/'+filename)
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')

In [None]:
sns.scatterplot(data=rds, x='Annual Income (k$)', y='Spending Score (1-100)', 
                hue='Age')

In [None]:
rds.columns

In [None]:
X = rds[['Annual Income (k$)', 'Spending Score (1-100)']].values

In [None]:
dendrogram = sch.dendrogram(sch.linkage(X, method='ward'))

In [None]:
classifier = AgglomerativeClustering(n_clusters=5, 
                    affinity='euclidean', linkage='ward')
y_pred = classifier.fit_predict(X)

In [None]:
rds['Type'] =y_pred
# centroids = pd.DataFrame(classifier.cluster_centers_, columns=['x', 'y'])
# centroids['color']=centroids.shape[0]*['red']
# centroids['size']=centroids.shape[0]*[16]

In [None]:
fig = px.scatter(data_frame=rds, x='Annual Income (k$)', y='Spending Score (1-100)', 
                 color='Type')
# fig.add_trace(go.Scatter(x=centroids['x'], y=centroids['y'], 
#                         mode='markers',
#                         marker=dict(
#                             size=centroids['size'],
#                             color='red', #set color equal to a variable
#                             colorscale='Viridis', # one of plotly colorscales
#                             showscale=True
#                         ),
#                         showlegend=False,
#                         name='centroids'))
fig.write_html(f'IMG/{filename}_HierarchicalCluster.html',auto_open=True)

# Association Rule Learning

## Apriori

### Intro

### Demo/Market_Basket_Optimisation

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.cluster import KMeans

In [None]:
filename='Market_Basket_Optimisation.csv'
rds = pd.read_csv('Data/'+filename)
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')

In [None]:
sns.scatterplot(data=rds, x='Annual Income (k$)', y='Spending Score (1-100)', 
                hue='Age')

In [None]:
rds.columns

In [None]:
X = rds[['Annual Income (k$)', 'Spending Score (1-100)']].values

In [None]:
def show_elbow(X=X, filename=filename):
    wcss = []
    fig = go.Figure()
    for n_clusters in range(1, 20):
        classifier = KMeans(n_clusters=n_clusters, 
                            init='k-means++', max_iter=500, 
                            n_init=20, random_state=0)
        classifier.fit(X)
        wcss.append(classifier.inertia_)
    fig.add_trace(go.Scatter(x=np.arange(1, len(wcss)+1), y=wcss, name='wcss'))
    fig.write_html(f'IMG/{filename}_KMeans_elbow.html',auto_open=True)

In [None]:
show_elbow()

In [None]:
classifier = KMeans(n_clusters=5, 
                    init='k-means++', max_iter=500, 
                    n_init=20, random_state=0)
y_pred = classifier.fit_predict(X)

In [None]:
rds['Type'] =y_pred
centroids = pd.DataFrame(classifier.cluster_centers_, columns=['x', 'y'])
centroids['color']=centroids.shape[0]*['red']
centroids['size']=centroids.shape[0]*[16]

In [None]:
fig = px.scatter(data_frame=rds, x='Annual Income (k$)', y='Spending Score (1-100)', 
                 color='Type')
fig.add_trace(fig1.data[0])
fig.add_trace(go.Scatter(x=centroids['x'], y=centroids['y'], 
                        mode='markers',
                        marker=dict(
                            size=centroids['size'],
                            color='red', #set color equal to a variable
                            colorscale='Viridis', # one of plotly colorscales
                            showscale=True
                        ),
                        showlegend=False,
                        name='centroids'))
fig.write_html(f'IMG/{filename}_KMeans.html',auto_open=True)

## Eclat

# Reinforcement Learning

- Agent
- Environment
- Policy
- Reward Function
- Value Function
- Model Envrionment

## UCB/Upper Confidence Bound

### Demo/Ads_CTR_Optimisation

## Markov Decision Process

## Dynamic Programming

# Dimentionality Reduction/Factor Analysis

- Feature Selection
  - Backward Elimination
  - Forward Selection
  - Bidirectional Elimination
  - Score Comparison
- Feature Extraction
  - PCA
  - LDA
  - Kernel PCA

## PCA: Principal Component Analysis

### Intro

### Demo/Wine

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

In [None]:
filename='Wine.csv'
rds = pd.read_csv('Data/'+filename)
plt.figure(figsize=(12,8))
sns.heatmap(data=rds.isnull(), yticklabels=False, cbar=False, cmap='viridis')


In [None]:
rds.columns

In [None]:
X = rds.iloc[:, 0:13].values
y = rds['Customer_Segment'].values

In [None]:
X_train, X_test,  y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

standardscalar_X = StandardScaler()
X_train = standardscalar_X.fit_transform(X_train)
X_test = standardscalar_X.transform(X_test)

In [None]:
def show_main_components(X_train, X_test):
    decomposor = PCA(n_components=None)
    X_train = decomposor.fit_transform(X_train)
    X_test = decomposor.transform(X_test)
    expained_variance = decomposor.explained_variance_ratio_
    sns.scatterplot(x=range(len(expained_variance)), y=expained_variance, hue=expained_variance)
    print(expained_variance)

show_main_components(X_train, X_test)

In [None]:
decomposor = PCA(n_components=2)
X_train = decomposor.fit_transform(X_train)
X_test = decomposor.transform(X_test)

In [None]:
classifier = LogisticRegression(solver='lbfgs',random_state=0, multi_class='auto')
classifier.fit(X_train, y_train)

In [None]:
y_pred = classifier.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green', 'blue')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
plt.title('Logistic Regression (Training set)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()

In [None]:
# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green', 'blue')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
plt.title('Logistic Regression (Test set)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()

## Kernel PCA: Kernal Principal Component Analysis

### Intro

### Demo

## LDA:  Linear Discriminant Analysis

### Intro

### Demo

## Factorization Machine

# Time Series Analysis

In [None]:
DeepAR/ARIMA/ETS

# Optimization

## K-Fold Cross Validation

## XGboost

### Intro

### Demo

In [None]:
import xgboost as xgb
from sklearn.datasets import load_boston

boston = load_boston()

# XGBoost API example
params = {'tree_method': 'gpu_hist', 'max_depth': 3, 'learning_rate': 0.1}
dtrain = xgb.DMatrix(boston.data, boston.target)
xgb.train(params, dtrain, evals=[(dtrain, "train")])

# sklearn API example
gbm = xgb.XGBRegressor(silent=False, n_estimators=10, tree_method='gpu_hist')
gbm.fit(boston.data, boston.target, eval_set=[(boston.data, boston.target)])

In [None]:
import xgboost as xgb
import numpy as np
from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
import time

# Fetch dataset using sklearn
cov = fetch_covtype()
X = cov.data
y = cov.target

# Create 0.75/0.25 train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, train_size=0.75,
                                                    random_state=42)

# Specify sufficient boosting iterations to reach a minimum
num_round = 3000

# Leave most parameters as default
param = {'objective': 'multi:softmax', # Specify multiclass classification
         'num_class': 8, # Number of possible output classes
         'tree_method': 'gpu_hist' # Use GPU accelerated algorithm
         }

# Convert input data from numpy to XGBoost format
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

gpu_res = {} # Store accuracy result
tmp = time.time()
# Train model
xgb.train(param, dtrain, num_round, evals=[(dtest, 'test')], evals_result=gpu_res)
print("GPU Training Time: %s seconds" % (str(time.time() - tmp)))

# Repeat for CPU algorithm
tmp = time.time()
param['tree_method'] = 'hist'
cpu_res = {}
xgb.train(param, dtrain, num_round, evals=[(dtest, 'test')], evals_result=cpu_res)
print("CPU Training Time: %s seconds" % (str(time.time() - tmp)))

In [None]:
1024/60, 12307/60

# End

## Pipeline