# Practical Machine Learning

# What is Machine Learning?
### Machine learning teaches computers to do what comes naturally to humans and animals: learn from experience. Machine learning algorithms use computational methods to “learn” information directly from data without relying on a predetermined equation as a model. The algorithms adaptively improve their performance as the number of samples available for learning increases.

### Machine learning algorithms find natural patterns in data that generate insight and help you make better decisions and predictions.

![ML](machine-learning1.png)
https://www.slideshare.net/awahid/big-data-and-machine-learning-for-businesses

# Data Exploration and Analysis

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes=True)
%matplotlib inline

In [None]:
print(np.__version__)
print(pd.__version__)
print(sns.__version__)
import sys
print(sys.version)

In [None]:
path='C:/2021/chinaM/Oct21/data/'
df = pd.read_csv(path+'iris.data')

In [None]:
df.head()

In [None]:
df = pd.read_csv(path+'iris.data', header=None)
df.head()

In [None]:
col_name = ['sepal length', 'sepal width', 'petal length', 'petal width', 'species']

In [None]:
df.columns = col_name

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
print(df.info())

In [None]:
print(df.groupby('species').size())

# Visualisation

In [None]:
sns.pairplot(df, hue='species', height=3, aspect=1);

In [None]:
df.hist(edgecolor='black', linewidth=1.2, figsize=(12,8));
plt.show();

In [None]:
plt.figure(figsize=(12,8));
plt.subplot(2,2,1)
sns.violinplot(x='species', y='sepal_length', data=iris)
plt.subplot(2,2,2)
sns.violinplot(x='species', y='sepal_width', data=iris)
plt.subplot(2,2,3)
sns.violinplot(x='species', y='petal_length', data=iris)
plt.subplot(2,2,4)
sns.violinplot(x='species', y='petal_width', data=iris);


In [None]:
df.boxplot(by='species', figsize=(12,8));

In [None]:
pd.plotting.scatter_matrix(df, figsize=(12,10))
plt.show()

In [None]:
sns.pairplot(df, hue="species", diag_kind="kde");

# scikit-learn

url = [http://scikit-learn.org/stable/](http://scikit-learn.org/stable/)

In [None]:
%%HTML
<iframe width=100% height=500 src='http://scikit-learn.org/stable/'></iframe>

## Key points

* Data in the form of a table
* Features in the form of a matrix
* Label or target array

# Scikit-Learn API

url = [https://arxiv.org/abs/1309.0238](https://arxiv.org/abs/1309.0238)

## General principles

* **Consistency**. All objects (basic or composite) share a consistent interface composed of a limited set of methods. This interface is documented in a consistent manner for all objects. 

* **Inspection**. Constructor parameters and parameter values determined by learning algorithms are stored and exposed as public attributes. 

* **Non-proliferation of classes**. Learning algorithms are the only objects to be represented using custom classes. Datasets are represented as NumPy arrays or SciPy sparse matrices. Hyper-parameter names and values are represented as standard Python strings or numbers whenever possible. This keeps scikitlearn easy to use and easy to combine with other libraries. 

* **Composition**. Many machine learning tasks are expressible as sequences or combinations of transformations to data. Some learning algorithms are also naturally viewed as meta-algorithms parametrized on other algorithms. Whenever feasible, such algorithms are implemented and composed from existing building blocks. 

* **Sensible defaults**. Whenever an operation requires a user-deﬁned parameter, an appropriate default value is deﬁned by the library. The default value should cause the operation to be performed in a sensible way (giving a baseline solution for the task at hand).



# Basic Steps of Using Scikit-Learn API

1. Choose a class of model
2. Choose model hyperparameters
3. Arrage data into features matrix and target array
4. Fit model to data
5. Apply trained model to new data



# Supervised Learning: Simple Linear Regression

In [None]:
generate_random = np.random.RandomState(0)
x = 10 * generate_random.rand(100)

In [None]:
y = 3 * x + np.random.randn(100)

In [None]:
plt.figure(figsize = (10, 8))
plt.scatter(x, y);

## Step 1. Choose a class of model

In [None]:
from sklearn.linear_model import LinearRegression

## Step 2. Choose model hyperparameters

In [None]:
model = LinearRegression(fit_intercept=True)

In [None]:
model

## Step 3. Arrage data into features matrix and target array


In [None]:
X = x.reshape(-1, 1)
X.shape

## Step 4. Fit model to data


In [None]:
model.fit(X, y)

In [None]:
model.coef_

In [None]:
model.intercept_

## Step 5. Apply trained model to new data

In [None]:
x_fit = np.linspace(-1, 11)

In [None]:
X_fit = x_fit.reshape(-1,1)

In [None]:
y_fit = model.predict(X_fit)

## Visualise

In [None]:
plt.figure(figsize = (10, 8))
plt.scatter(x, y)
plt.plot(x_fit, y_fit);

# Robust Regression

Outlier Demo: [http://digitalfirst.bfwpub.com/stats_applet/stats_applet_5_correg.html](http://digitalfirst.bfwpub.com/stats_applet/stats_applet_5_correg.html)

In [None]:
df = pd.read_csv(path+'housing.data', delim_whitespace=True, header=None)

In [None]:
col_name = ['CRIM', 'ZN' , 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

In [None]:
df.columns = col_name

In [None]:
df.head()

## RANdom SAmple Consensus (RANSAC) Algorithm

link = [http://scikit-learn.org/stable/modules/linear_model.html#ransac-regression](http://scikit-learn.org/stable/modules/linear_model.html#ransac-regression)

Each iteration performs the following steps:

1. Select `min_samples` random samples from the original data and check whether the set of data is valid (see `is_data_valid`).

2. Fit a model to the random subset (`base_estimator.fit`) and check whether the estimated model is valid (see `is_model_valid`).

3. Classify all data as inliers or outliers by calculating the residuals to the estimated model (`base_estimator.predict(X) - y`) - all data samples with absolute residuals smaller than the `residual_threshold` are considered as inliers.

4. Save fitted model as best model if number of inlier samples is maximal. In case the current estimated model has the same number of inliers, it is only considered as the best model if it has better score.

In [None]:
X = df['RM'].values.reshape(-1,1)
y = df['MEDV'].values

In [None]:
from sklearn.linear_model import RANSACRegressor

In [None]:
ransac = RANSACRegressor()

In [None]:
ransac.fit(X, y)

In [None]:
inlier_mask = ransac.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)

In [None]:
np.arange(3, 10, 1)

In [None]:
line_X = np.arange(3, 10, 1)
line_y_ransac = ransac.predict(line_X.reshape(-1, 1))

In [None]:
sns.set(style='darkgrid', context='notebook')
plt.figure(figsize=(12,8));
plt.scatter(X[inlier_mask], y[inlier_mask], 
            c='blue', marker='o', label='Inliers')
plt.scatter(X[outlier_mask], y[outlier_mask],
            c='brown', marker='s', label='Outliers')
plt.plot(line_X, line_y_ransac, color='red')
plt.xlabel('average number of rooms per dwelling')
plt.ylabel("Median value of owner-occupied homes in $1000's")
plt.legend(loc='upper left')
plt.show()

In [None]:
ransac.estimator_.coef_

In [None]:
ransac.estimator_.intercept_

***

In [None]:
X = df['LSTAT'].values.reshape(-1,1)
y = df['MEDV'].values
ransac.fit(X, y)
inlier_mask = ransac.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)
line_X = np.arange(0, 40, 1)
line_y_ransac = ransac.predict(line_X.reshape(-1, 1))

In [None]:
sns.set(style='darkgrid', context='notebook')
plt.figure(figsize=(12,8));
plt.scatter(X[inlier_mask], y[inlier_mask], 
            c='blue', marker='o', label='Inliers')
plt.scatter(X[outlier_mask], y[outlier_mask],
            c='brown', marker='s', label='Outliers')
plt.plot(line_X, line_y_ransac, color='red')
plt.xlabel('% lower status of the population')
plt.ylabel("Median value of owner-occupied homes in $1000's")
plt.legend(loc='upper right')
plt.show()

***

# Performance Evaluation of Regression Model

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
#X = df['LSTAT'].values.reshape(-1,1)
X = df.iloc[:, :-1].values

In [None]:
y = df['MEDV'].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
lr = LinearRegression()

In [None]:
lr.fit(X_train, y_train)

In [None]:
y_train_pred = lr.predict(X_train)

In [None]:
y_test_pred = lr.predict(X_test)

***

# Method 1: Residual Analysis

In [None]:
plt.figure(figsize=(12,8))
plt.scatter(y_train_pred, y_train_pred - y_train, c='blue', marker='o', label='Training data')
plt.scatter(y_test_pred, y_test_pred - y_test, c='orange', marker='*', label='Test data')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.legend(loc='upper left')
plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='k')
plt.xlim([-10, 50])
plt.show()

***

# Method 2: Mean Squared Error (MSE)

$$MSE=\frac{1}{n}\sum^n_{i=1}(y_i-\hat{y}_i)^2$$

* The average value of the Sums of Squared Error cost function  

* Useful for comparing different regression models 

* For tuning parameters via a grid search and cross-validation

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
mean_squared_error(y_train, y_train_pred)

In [None]:
mean_squared_error(y_test, y_test_pred)

# Method 3: Coefficient of Determination, $R^2$

$$R^2 = 1 - \frac{SSE}{SST}$$

SSE: Sum of squared errors

SST: Total sum of squares

In [None]:
from sklearn.metrics import r2_score

In [None]:
r2_score(y_train, y_train_pred)

In [None]:
r2_score(y_test, y_test_pred)