# Introduction to Machine Learning

According to **Arthur Samuel**,

**Machine learning** is a field of computer science that gives computers the ability to learn without being explicitly programmed.

A Machine learning project is typically classified into** two categories**, depending on its learning system.

*   Supervised Learning
*  Unsupervised Learning.





# Steps in a Machine Learning Project

A Machine Learning Project involves the following steps:

**Defining the Problem**:Define a problem statement, which addresses a business problem.

**Obtaining the Source Data**:The raw data required to build a model can be presented in a single or multiple sources such as relational databases, and social networking sites.

**Understanding Data Through Visualization**:Look into data and understand important features such as its mean, and spread.

**Preparing Data for Machine Learning Algorithms**:Mostly, the captured raw data cannot be used to train using a Machine learning algorithm. The raw datasets have to be manipulated or transformed through one or more pre-processing steps.

**Choosing an algorithm**:Based on features of data set, pick a suitable algorithm.

**Building the Model**:Train the algorithm with considered training data set and verify its performance through a metric.

**Fine-tuning the Model**:Identify values of vital parameters, associated with the chosen model for better performance.

**Use the best model**:Use the model with better performance for addressing the defined problem

# Introduction to scikit-learn

**scikit-learn** is a Machine learning** toolkit in Python**. The package contains efficient tools used for **Data Mining and Data Analysis**.

It is built on **NumPy, SciPy, and matplotlib** packages. It is **opensource** and also commercially usable under **BSD license**.

## scikit-learn Utilities

scikit-learn library has many utilities that can be used to perform the following tasks involved in Machine Learning.

*   Preprocessing
*   Model Selection
*   Classification
*   Regression
*   Clustering
*   Dimensionality Reduction


## Steps with scikit-learn

Mostly, one would perform the following steps while working on a Machine learning problem with** scikit-learn**:


1.   Cleaning raw data set.
2.   Further transforming with many scikit-learn pre-processing utilities.
3.   Splitting data into train and test sets with train_test_split utility.
4.   Creating a suitable model with default parameters.
5.   Training the Model using fit function.
6.   Evaluating the Model and fine-tuning it.



# Gathering Data from Multiple Sources

## Reading Data for ML

Any Machine Learning Algorithm requires **data** for building a model.

The data can be obtained from Multiple sources such as **http, ftp repositories, databases, local repositories, etc**.

Many times raw data, read from a source, **cannot be used directly** by an ML algorithm for building a Model.

So, raw data has to be **cleaned, processed, transformed (if required)** and then passed to an ML algorithm always.

## Example Data - Breast Cancer Dataset

**Breast Cancer data** set is a popular one, which contains details of** 30** features obtained from** 569** cancer patients.

We will be doing the following tasks and make cancer data set ready for ML.


*   Reading raw data from UCI archive
*   Extract features from Raw data.
*   Naming or Labelling features
*   Extract target values from Raw data
*   Naming or Labelling target values








### Reading Data from UCI Archive

The raw data set from UCI archive can be read with the following code snippet.

In [77]:
import pandas as pd

cancer_set = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', 
                        header = None)
print(cancer_set.shape)

(569, 32)


Read raw dataset contains **32** columns.

The** 1st column** has **patient ID** details, and the** 2nd** one has **tumor type, i.e. malignant or benign**.

The rest **30** columns represent various features obtained from **each patient.**

### Extracting Features from Raw Set

All columns, representing features are extracted with the following code snippet.

In [78]:
cancer_features = cancer_set.iloc[:,2:]

print(cancer_features.shape)
print(type(cancer_features))

(569, 30)
<class 'pandas.core.frame.DataFrame'>


**cancer_features** is a **dataframe**. It is converted to a **numpy array** with below code.

In [79]:
cancer_features = cancer_features.values
print(type(cancer_features))
print(cancer_features.shape)

<class 'numpy.ndarray'>
(569, 30)


### Naming features

The **30** features used associated with **cancer_features** dataset are labeled with the following listed names.

In [0]:
cancer_features_names = ['mean radius', 
'mean texture', 'mean perimeter', 
'mean area', 'mean smoothness', 
'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry',
'mean fractal dimension','radius error',
'texture error','perimeter error',
'area error', 'smoothness error',
'compactness error','concavity error',
'concave points error','symmetry error',
'fractal dimension error','worst radius',
'worst texture', 'worst perimeter', 
'worst area','worst smoothness', 
'worst compactness', 'worst concavity',
'worst concave points','worst symmetry',
'worst fractal dimension']

### Extracting target values from Raw data

Target values of each patient are extracted with below code snippet.

In [81]:
cancer_target = cancer_set.iloc[:, 1]

# Replacing 'M' with 0 and 'B' with 1
cancer_target = cancer_target.replace(['M', 'B'], [0, 1])

# Converting to numpy array
cancer_target = cancer_target.values

print(type(cancer_target))
print(cancer_target.shape)

<class 'numpy.ndarray'>
(569,)


Thus obtained **cancer_features** and cancer_target can be used by a ML algorithm

### scikit-learn Datasets

**scikit-learn **by default comes with few popular datasets.

They can be loaded into your working environment and used.

You can know more about datasets from scikit-learn in the following

### Reading Cancer Data from scikit-learn

Previously, you have read breast cancer data from **UCI archive** and derived **cancer_features** and **cancer_target arrays**.

The **same processed data** is available in **scikit-learn**. The below code snippet illustrates accessing features and target arrays.

In [82]:
import sklearn.datasets as datasets

cancer = datasets.load_breast_cancer()

print(cancer.data.shape)
print(cancer.target.shape)

(569, 30)
(569,)


In [83]:
from sklearn import datasets

iris = datasets.load_iris()
type(iris)

sklearn.utils.Bunch

# Preprocessing with scikit-learn

**Preprocessing** is a step, in which raw data is modified or transformed into a format, suitable for further downstream processing.

scikit-learn provides many preprocessing utilities such as,


*   Standardization mean removal
*   Scaling
*   Normalization
*   Binarization
*   One Hot Encoding
*   Label Encoding
*   Imputation


## Standardization

**Standardization** or **Mean Removal **is the process of transforming each feature vector into a normal distribution with mean** 0** and variance **1**.

This can be achieved using **StandardScaler**.

An example with its output is shown 

In [0]:
import sklearn.preprocessing as preprocessing

In [85]:
standardizer = preprocessing.StandardScaler()
standardizer = standardizer.fit(cancer.data)
breast_cancer_standardized = standardizer.transform(cancer.data)

print('Mean of each feature after Standardization :\n\n')
print(breast_cancer_standardized.mean(axis=0))
print('\nStd. of each feature after Standardization :\n\n')
print(breast_cancer_standardized.std(axis=0))

Mean of each feature after Standardization :


[-3.16286735e-15 -6.53060890e-15 -7.07889127e-16 -8.79983452e-16
  6.13217737e-15 -1.12036918e-15 -4.42138027e-16  9.73249991e-16
 -1.97167024e-15 -1.45363120e-15 -9.07641468e-16 -8.85349205e-16
  1.77367396e-15 -8.29155139e-16 -7.54180940e-16 -3.92187747e-16
  7.91789988e-16 -2.73946068e-16 -3.10823423e-16 -3.36676596e-16
 -2.33322442e-15  1.76367415e-15 -1.19802625e-15  5.04966114e-16
 -5.21317026e-15 -2.17478837e-15  6.85645643e-16 -1.41265636e-16
 -2.28956670e-15  2.57517109e-15]

Std. of each feature after Standardization :


[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]


## Scaling

Scaling transforms existing data values to lie between a **minimum** and **maximum** value.

**MinMaxScaler** transforms data to range** 0** and **1**.

**MaxAbsScaler** transforms data to range **-1** and **1**.

Transforming **breast_cancer** dataset through Scaling is shown 

### Using MinMaxScaler

**MinMaxScaler** with specified range

Data is transformed to range **0** and **10**

In [0]:
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 10)).fit(cancer.data)

breast_cancer_minmaxscaled10 = min_max_scaler.transform(cancer.data)

In [87]:
breast_cancer_minmaxscaled10

array([[5.21037437, 0.22658099, 5.45988529, ..., 9.12027491, 5.98462448,
        4.18863964],
       [6.43144493, 2.72573554, 6.15783291, ..., 6.39175258, 2.33589592,
        2.22878132],
       [6.01495575, 3.90260399, 5.95743211, ..., 8.35051546, 4.03705894,
        2.13433032],
       ...,
       [4.55251077, 6.21237741, 4.45788128, ..., 4.87285223, 1.28720678,
        1.51908697],
       [6.44564343, 6.63510315, 6.65537972, ..., 9.10652921, 4.97141731,
        4.52315361],
       [0.36868759, 5.01521813, 0.28539838, ..., 0.        , 2.57441356,
        1.00682146]])

By default, transformation occurs to a range of 0 and 1. It can also be customized with **feature_range** argument as shown 

In [0]:
min_max_scaler = preprocessing.MinMaxScaler().fit(cancer.data)

breast_cancer_minmaxscaled = min_max_scaler.transform(cancer.data)

In [89]:
breast_cancer_minmaxscaled

array([[0.52103744, 0.0226581 , 0.54598853, ..., 0.91202749, 0.59846245,
        0.41886396],
       [0.64314449, 0.27257355, 0.61578329, ..., 0.63917526, 0.23358959,
        0.22287813],
       [0.60149557, 0.3902604 , 0.59574321, ..., 0.83505155, 0.40370589,
        0.21343303],
       ...,
       [0.45525108, 0.62123774, 0.44578813, ..., 0.48728522, 0.12872068,
        0.1519087 ],
       [0.64456434, 0.66351031, 0.66553797, ..., 0.91065292, 0.49714173,
        0.45231536],
       [0.03686876, 0.50152181, 0.02853984, ..., 0.        , 0.25744136,
        0.10068215]])

### Using MaxAbsScaler

Using **MaxAbsScaler**, the maximum absolute value of each feature is scaled to unit size, i.e., **1**. 

It is intended for data that is previously centered at **sparse or zero **data.

By default, **MaxAbsScaler** transforms data to the range **-1** and **1**.

In [0]:
max_abs_scaler = preprocessing.MaxAbsScaler().fit(cancer.data)

breast_cancer_maxabsscaled = max_abs_scaler.transform(cancer.data)

In [91]:
breast_cancer_maxabsscaled

array([[0.63998577, 0.26425662, 0.65145889, ..., 0.91202749, 0.69313046,
        0.57301205],
       [0.73176805, 0.45239308, 0.70503979, ..., 0.63917526, 0.41428141,
        0.42901205],
       [0.70046247, 0.54098778, 0.68965517, ..., 0.83505155, 0.54429045,
        0.42207229],
       ...,
       [0.59053718, 0.71486762, 0.57453581, ..., 0.48728522, 0.33413679,
        0.37686747],
       [0.73283529, 0.74669043, 0.74323607, ..., 0.91065292, 0.6156975 ,
        0.59759036],
       [0.27605834, 0.62474542, 0.25421751, ..., 0.        , 0.43250979,
        0.33922892]])

## Normalization

**Normalization** scales each sample to have a unit norm.

**Normalization** can be achieved with** 'l1', 'l2', and 'max'** norms.

**'l1' **norm makes the **sum of absolute values** of each row as **1**, and **'l2'** norm makes the** sum of squares** of each row as **1**.

**'l1'** norm is insensitive to **outliers**.

**By default l2 **norm is considered. 

Hence, removing **outliers** is recommended before applying **l2** norm.

In [0]:
normalizer = preprocessing.Normalizer(norm='l1').fit(cancer.data)

breast_cancer_normalized = normalizer.transform(cancer.data)

In [93]:
sum(breast_cancer_normalized[3,:])

0.9999999999999999

## Binarization

**Binarization** is the process of transforming data points to **0** or **1** based on a given threshold.


*   Any value above the threshold is transformed to **1**, and any value below the threshold is transformed to **0**.
*   By **default**, a threshold of **0** is used.


In [94]:
binarizer = preprocessing.Binarizer(threshold=3.0).fit(cancer.data)
breast_cancer_binarized = binarizer.transform(cancer.data)
print(breast_cancer_binarized[:5,:5])

[[1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 0.]]


## OneHotEncoder

**OneHotEncoder** converts categorical integer values into one-hot vectors. 

In an **on-hot vector**, every category is transformed into a binary attribute having only **0 and 1 values**.

An example creating two binary attributes for the categorical integers 1 and 2

In [95]:
onehotencoder = preprocessing.OneHotEncoder()
onehotencoder = onehotencoder.fit([[1], [1], [1], [2], [2], [1]])

# Transforming category values 1 and 2 to one-hot vectors
print(onehotencoder.transform([[1]]).toarray())
print(onehotencoder.transform([[2]]).toarray())

[[1. 0.]]
[[0. 1.]]


## Imputation

**Imputation** replaces **missing values** with either **median, mean, or the most common value **of the** column or row** in which the missing values exist.

Below example replaces **missing** values, represented by **np.nan**, with the **mean** of respective column (**axis 0**).

In [0]:
imputer = preprocessing.Imputer(missing_values='NaN', strategy='mean')

imputer = imputer.fit(cancer.data)
breast_cancer_imputed = imputer.transform(cancer.data)

In [97]:
breast_cancer_imputed

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

## Label Encoding

**Label Encoding **is a step in which, in which categorical features are represented as **categorical integers**. 

An example of transforming categorical values** ["benign","malignant"]**into**[0, 1]** is shown below.

In [0]:
labels = ['malignant', 'benign', 'malignant', 'benign']

labelencoder = preprocessing.LabelEncoder()

labelencoder = labelencoder.fit(labels)

bc_labelencoded = labelencoder.transform(cancer.target_names)

In [99]:
bc_labelencoded

array([1, 0])

## HandsOn

Import two modules **sklearn.datasets** and **sklearn.preprocessing**.

Load popular** iris data set** from **sklearn.datasets** module and assign it to variable '**iris**'.

Perform **Imputation** on '**iris.data**' and save the transformed data in variable '**iris_imputed**'. 

- ***Hint*** : use **Imputer API**, Replace **np.NaN** values with **mean** of corresponding data.

In [0]:
from sklearn import datasets
import sklearn.preprocessing as preprocessing
iris= datasets.load_iris()
imputer=preprocessing.Imputer(missing_values='NaN',strategy='mean')
imputer=imputer.fit(iris.data)
iris_imputed=imputer.transform(iris.data)

Perform **Standardization** transformation on **iris.dat**a with **l2 norm **and save the transformed data in variable **iris_standarized**.

***Hint***: Use **StandardScaler** API.

In [0]:
standardizer=preprocessing.StandardScaler()
standardizer=standardizer.fit(iris.data)
iris_standarized=standardizer.transform(iris.data)

Convert the **categorical integer** list** iris.target** into** three binary attribute** representation and store the result in variable **iris_target_onehot**.

***Hint***: Use **reshape(-1,1)** on **iris.target **and **OneHotEncoder**.

Transform **iris_target_onehot** to an **array** representation and display the** first five rows** of it.

***Hint***: Use **toarray** method.

In [102]:
reshape_iris_target=iris.target.reshape(-1,1)
onhotencoder=preprocessing.OneHotEncoder()
onehotencoder=onehotencoder.fit(reshape_iris_target)
iris_target_onehot=onehotencoder.transform(reshape_iris_target).toarray()
iris_target_onehot[:5]

array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.]])

## OtherHandsOn

In [103]:
regions = ['HYD', 'CHN', 'MUM', 'HYD', 'KOL', 'CHN']
print(preprocessing.LabelEncoder().fit(regions).transform(regions))

[1 0 3 1 2 0]


# Nearest Neighbors Technique

**Nearest neighbors** method is used to determine a predefined number of data points that are closer to a sample point and predict its label.

**sklearn.neighbors** provides utilities for **unsupervised** and **supervised** neighbors-based learning methods.

scikit-learn implements **two** different nearest neighbors classifiers:


*   KNeighborsClassifier
*   RadiusNeighborsClassifier




## Nearest Neighbor Classifiers


*   **KNeighborsClassifier** classifies based on k nearest neighbors of every query point, where **k** is an **integer** value specified by the user.

*  **RadiusNeighborsClassifier** classifies based on the number of neighbors present in a fixed radius **r** of every training point.



## Nearest Neighbors Regression

**scikit-learn** implements the following two regressors:

*   **KNeighborsRegressor** predicts based on the **k** nearest neighbors of each query point.
*   **RadiusNeighborsRegressor** predicts based on the neighbors present in a fixed radius **r** of the query point.


## Demo of KNeighborsClassifier

In [0]:
import sklearn.datasets as datasets

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

cancer = datasets.load_breast_cancer()  # Loading the data set

### Building a Model of KNN classifier

The following code creates **training and test data sets**, initializes a** KNN classifie**r, and fits it with training data.

In [0]:
X_train, X_test, Y_train, Y_test = train_test_split(cancer.data, cancer.target,
           stratify=cancer.target,                  random_state=42)

knn_classifier = KNeighborsClassifier()   

knn_classifier = knn_classifier.fit(X_train, Y_train) 

### Determining Accuracy of the Model

In [106]:
print('Accuracy of Train Data :', knn_classifier.score(X_train,Y_train))
print('Accuracy of Test Data :', knn_classifier.score(X_test,Y_test))

Accuracy of Train Data : 0.9460093896713615
Accuracy of Test Data : 0.9300699300699301


## HandsOn

Import the **three** modules **sklearn.datasets**, **sklearn.model_selection**, and **sklearn.neighbors**

Load popular **iris** data set from **sklearn.datasets** module and assign it to **variable iris**.

**Split iris.data** into two sets names **x_train** and **x_test**. Also,** split iris.target** into two sets **y_train** and **y_test**.

***Hint***: Use **train_test_split** method from **sklearn.model_selection**; set **random_state** to **30** and perform stratified sampling.

In [0]:
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
import sklearn.neighbors
iris=datasets.load_iris()
x_train,x_test,y_train,y_test=train_test_split(iris.data,iris.target,random_state=30)

Fit** K nearest neighbors model** on **x_train data**, with **default parameters**. Name the model as **knn_clf**.

Evaluate the model **accuracy on x_train and x_test sets**.

In [108]:
knn_clf=KNeighborsClassifier()
knn_clf=knn_clf.fit(x_train,y_train)
print('Accuracy on train: ',knn_clf.score(x_train,y_train))
print('Accuracy on test: ',knn_clf.score(x_test,y_test))

Accuracy on train:  0.9821428571428571
Accuracy on test:  0.9210526315789473


Fit **multiple** K nearest neighbors models on x_train data with** n_neighbors** parameter value changing from **3 to 10**.

Evaluate each model** accuracy on x_train and x_test sets**.

***Hint***: Make use of **for loop**

In [109]:
for r in range(3,11):
  knn_clf=KNeighborsClassifier(n_neighbors=r)
  knn_clf=knn_clf.fit(x_train,y_train)
  print('Accuracy on train for n_neighbors %s:%s ' %(r,knn_clf.score(x_train,y_train)))
  print('Accuracy on test for n_neighbors %s:%s '%(r,knn_clf.score(x_test,y_test)))

Accuracy on train for n_neighbors 3:0.9732142857142857 
Accuracy on test for n_neighbors 3:0.9210526315789473 
Accuracy on train for n_neighbors 4:0.9732142857142857 
Accuracy on test for n_neighbors 4:0.9473684210526315 
Accuracy on train for n_neighbors 5:0.9821428571428571 
Accuracy on test for n_neighbors 5:0.9210526315789473 
Accuracy on train for n_neighbors 6:0.9732142857142857 
Accuracy on test for n_neighbors 6:0.9473684210526315 
Accuracy on train for n_neighbors 7:0.9732142857142857 
Accuracy on test for n_neighbors 7:0.9736842105263158 
Accuracy on train for n_neighbors 8:0.9642857142857143 
Accuracy on test for n_neighbors 8:0.9473684210526315 
Accuracy on train for n_neighbors 9:0.9642857142857143 
Accuracy on test for n_neighbors 9:0.9473684210526315 
Accuracy on train for n_neighbors 10:0.9642857142857143 
Accuracy on test for n_neighbors 10:0.9473684210526315 


# Decision Trees Technique

## Decision Trees

**Decision Trees** is another Supervised Learning method used for **Classification** and **Regression**.



*   **Decision Trees** learn **simple decision rules** from **training data and build a Mode**l.
*   **DecisionTreeClassifier** and **DecisionTreeRegressor** are the two utilities from **sklearn.tree**, which can be used for **classification** and **regression** respectively



## Advantages of Decision Trees

*   Decision Trees are **easy** to understand.
*   They often do **not require any preprocessing**.
*   Decision Trees can learn from both **numerical** and **categorical** data.






## Disadvantages of Decision Trees


*   **Decision trees** sometimes become **complex**, which do not **generalize** well and leads to **overfitting**. **Overfitting** can be addressed by placing the **least number of samples** needed at a **leaf node** or placing the **highest depth** of the tree.
*   A **small variation** in data can result in a **completely different tree**. This problem can be **addressed** by using decision trees within an **ensemble**.



## Building a Decision Tree Classifier Model

The subsequent code represents the building of a Decision Tree Classifier model.

Before executing this code, perform importing required modules, load cancer dataset, and create train and test data sets as shown in Neighbors classifier example.

In [0]:
from sklearn.tree import DecisionTreeClassifier

dt_classifier = DecisionTreeClassifier()   

dt_classifier = dt_classifier.fit(X_train, Y_train) 

## Determining Accuracy of the Model

Further the below code determines the model accuracy. You can observe that the model is **overfitted**.

In [111]:
print('Accuracy of Train Data :', dt_classifier.score(X_train,Y_train))

print('Accuracy of Test Data :', dt_classifier.score(X_test,Y_test))

Accuracy of Train Data : 1.0
Accuracy of Test Data : 0.9090909090909091


## Fine Tuning the Model

rther the model is improved with change in **max_depth** value to **2**.

In [112]:
dt_classifier = DecisionTreeClassifier(max_depth=2)   

dt_classifier = dt_classifier.fit(X_train, Y_train) 

print('Accuracy of Train Data :', dt_classifier.score(X_train,Y_train))

print('Accuracy of Test Data :', dt_classifier.score(X_test,Y_test))

Accuracy of Train Data : 0.9577464788732394
Accuracy of Test Data : 0.9090909090909091


## HandsOn

Import the **three** modules **sklearn.datasets, sklearn.model_selection, and sklearn.tree**.

Load popular **Boston** dataset from **sklearn.datasets** module and assign it to **variable boston**.

**Split boston.data** into two sets names **x_train** and **x_test**. Also, **split boston.target** into two sets **y_train** and **y_test**.

***Hint***: Use **train_test_split** method from **sklearn.model_selection**; set **random_state** to **30**.

In [0]:
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
boston=datasets.load_boston()
x_train,x_test,y_train,y_test=train_test_split(boston.data,boston.target,random_state=30)

Build a **Decision tree Regressor** model from **x_train set**, with **default** parameters. Name the model as **dt_reg**.

Evaluate the model **accuracy on x_train** and **x_test** sets.

Predict the** housing price for first two samples** of **x_test** set.

In [114]:
dt_reg=DecisionTreeRegressor()
dt_reg=dt_reg.fit(x_train,y_train)
dt_reg.predict(x_test[:2])

array([18.2, 12.8])

Fit **multiple** Decision tree regressors on **x_train** data with **max_depth** parameter value changing from **2 to 5**.

Evaluate each model** accuracy on x_train and x_test** sets.

In [115]:
for m in range(2,6):
  dt_reg=DecisionTreeRegressor(max_depth=m)
  dt_reg=dt_reg.fit(x_train,y_train)
  print('Accuracy on train for n_neighbors %s:%s '%(m,dt_reg.score(x_train,y_train)))
  print('Accuracy on test for n_neighbors %s:%s '%(m,dt_reg.score(x_test,y_test)))

Accuracy on train for n_neighbors 2:0.6939571491936384 
Accuracy on test for n_neighbors 2:0.6876109752166819 
Accuracy on train for n_neighbors 3:0.8205778867160011 
Accuracy on test for n_neighbors 3:0.6962264524668584 
Accuracy on train for n_neighbors 4:0.8990286027546579 
Accuracy on test for n_neighbors 4:0.7086640885662667 
Accuracy on train for n_neighbors 5:0.9333783943749767 
Accuracy on test for n_neighbors 5:0.5657838515086966 


# Ensemble Methods

**Ensemble** methods **combine predictions** of **other learning algorithms**, to **improve** the **generalization**.

Ensemble methods are **two** types:

**Averaging Methods**: They build several base estimators **independently** and finally **average** their predictions.

***E.g.***: Bagging Methods, Forests of randomised trees

**Boosting Methods**: They build base estimators **sequentially** and tries to **reduce** the bias of the **combined** estimator.

***E.g***.: Adaboost, Gradient Tree Boosting

## Bagging Methods

**Bagging Methods** draw **random** subsets of the **original dataset**, build an **estimator** and **aggregate** individual results to form a **final one**.

**BaggingClassifier** and **BaggingRegressor** are the utilities from **sklearn.ensemble** to deal with Bagging.

## Randomized Trees

**sklearn.ensemble** offers **two** types of algorithms based on **randomized trees**:** Random Forest** and **Extra randomness** algorithms.


*   **RandomForestClassifier** and **RandomForestRegressor** classes are used to deal with random forests.
*   In random forests, each estimator is built from a sample drawn with replacement from the training set.
*   **ExtraTreesClassifier** and **ExtraTreesRegressor** classes are used to deal with extremely randomized forests.
*   In extremely randomized forests, more randomness is introduced, which further reduces the variance of the model.







## Demo of Random Forest Classifier

In [116]:
from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier()

rf_classifier = rf_classifier.fit(X_train, Y_train) 

print('Accuracy of Train Data :', rf_classifier.score(X_train,Y_train))

print('Accuracy of Test Data :', rf_classifier.score(X_test,Y_test))

Accuracy of Train Data : 0.9929577464788732
Accuracy of Test Data : 0.951048951048951


## Boosting Methods

Boosting Methods** combine several weak models** to create a **improvised** ensemble.

**sklearn.ensemble** also provides the following boosting algorithms:


*   AdaBoostClassifier
*   GradientBoostingClassifier



## HandsOn

Import the **three** modules **sklearn.datasets, sklearn.model_selection, and sklearn.ensemble**.

Load popular **boston** data set from **sklearn.datasets** module and assign it to **variable boston**.

**Split boston.data** into two sets names **x_train** and **x_test**. Also **split boston.target** into two sets **y_train** and **y_test**.

***Hint***: Use **train_test_split** method from **sklearn.model_selection**; set **random_state** to **30**.

In [0]:
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
boston=datasets.load_boston()
x_train,x_test,y_train,y_test=train_test_split(boston.data,boston.target,random_state=30)

Build a **Random Forest Regressor** model from **x_train** set, with **default** parameters. Name the model as **rf_reg**.

Evaluate the model **accuracy** on **x_train** and **x_test** sets.

In [118]:
rf_reg=RandomForestRegressor()
rf_reg=rf_reg.fit(x_train,y_train)
print('Accuracy of Train Data :', rf_reg.score(x_train,y_train))

print('Accuracy of Test Data :', rf_reg.score(x_test,y_test))

Accuracy of Train Data : 0.9792263412328398
Accuracy of Test Data : 0.8749645488648881


Build **multiple** Random forest regressor on **x_train** data with **max_depth** parameter value changing from** 3 to 5 **and also setting **n_estimators** to one of **50, 100, 200** values.

Evaluate each model accuracy on **x_train** and **x_test** sets.

In [119]:
for m in range(3,6):
  for ne in [50,100,200]:
    rf_reg=RandomForestRegressor(max_depth=m,n_estimators=ne)
    rf_reg=rf_reg.fit(x_train,y_train)
    print('Accuracy of Train Data for max_depth:%s and n_estimators:%s is:%s'%(m,ne, rf_reg.score(x_train,y_train)))
    print('Accuracy of Test Data for max_depth:%s and n_estimators:%s is:%s:' %(m,ne,rf_reg.score(x_test,y_test)))

Accuracy of Train Data for max_depth:3 and n_estimators:50 is:0.8825584651199202
Accuracy of Test Data for max_depth:3 and n_estimators:50 is:0.8366357831878104:
Accuracy of Train Data for max_depth:3 and n_estimators:100 is:0.8812826442500727
Accuracy of Test Data for max_depth:3 and n_estimators:100 is:0.8359930752874545:
Accuracy of Train Data for max_depth:3 and n_estimators:200 is:0.8751069140883474
Accuracy of Test Data for max_depth:3 and n_estimators:200 is:0.8311704016002732:
Accuracy of Train Data for max_depth:4 and n_estimators:50 is:0.9202654884394835
Accuracy of Test Data for max_depth:4 and n_estimators:50 is:0.8567989885308496:
Accuracy of Train Data for max_depth:4 and n_estimators:100 is:0.9204019581331715
Accuracy of Test Data for max_depth:4 and n_estimators:100 is:0.8616681828203232:
Accuracy of Train Data for max_depth:4 and n_estimators:200 is:0.920846952484561
Accuracy of Test Data for max_depth:4 and n_estimators:200 is:0.8572532514231476:
Accuracy of Train Dat

# Support Vector Machines Technique

**Support Vector Machines (SVMs)** separates data points based on **decision planes**, which **separates** objects belonging to different **classes** in a **higher dimensional space**.

*   SVM algorithm uses the best suitable **kernel**, which is capable of **separating** data points into** two or more classes**.
*   Commonly used kernels are: **linear**,**polynomial**,**rbf**,**sigmoid**




## Support Vector Classification

**scikit-learn** provides the following **three** utilities for performing Support Vector Classification.



*   **SVC**
*   **NuSVC**: Same as SVC but uses a **parameter** to control the **number of support vectors**.
*   **LinearSVC**: Similar to SVC with parameter **kernel** taking **linear** value.







## Support Vector Regression

**scikit-learn** provides the following **three** utilities for performing Support Vector Regression.


*   **SVR**
*   **NuSVR**
*   **LinearSVR**






## Advantages of SVMs

*   SVM can distinguish the classes in a **higher dimensional space**.
*   SVM algorithms are **memory efficient**.
*   SVMs are **versatile**, and a **different kernel **can be used by a **decision function.**





## Disadvantages of SVMs


*   SVMs do **not perform** well on** high dimensional data** with **many samples**.
*   SVMs **work** better only with** Preprocessed data**.
*   They are **harder to visualize**.





## Demo of Support Vector Classification


In [120]:
from sklearn.svm import SVC

svm_classifier = SVC()

svm_classifier = svm_classifier.fit(X_train, Y_train) 

print('Accuracy of Train Data :', svm_classifier.score(X_train,Y_train))

print('Accuracy of Test Data :', svm_classifier.score(X_test,Y_test))

Accuracy of Train Data : 1.0
Accuracy of Test Data : 0.6293706293706294


## Improving Accuracy Using Scaled Data

In [0]:
import sklearn.preprocessing as preprocessing

standardizer = preprocessing.StandardScaler()
standardizer = standardizer.fit(cancer.data)
cancer_standardized = standardizer.transform(cancer.data)
X_train,X_test,Y_train,Y_test=train_test_split(cancer_standardized,cancer.target,random_state=30)

svm_classifier = SVC()

svm_classifier = svm_classifier.fit(X_train, Y_train) 

## Determining Accuracy of New Model

In [122]:
print('Accuracy of Train Data :', svm_classifier.score(X_train,Y_train))

print('Accuracy of Test Data :', svm_classifier.score(X_test,Y_test))

Accuracy of Train Data : 0.9835680751173709
Accuracy of Test Data : 0.9790209790209791


## Viewing the Classification Report

In [123]:
from sklearn import metrics

Y_pred = svm_classifier.predict(X_test)

print('Classification report : \n',metrics.classification_report(Y_test, Y_pred))

Classification report : 
              precision    recall  f1-score   support

          0       0.98      0.96      0.97        52
          1       0.98      0.99      0.98        91

avg / total       0.98      0.98      0.98       143



## HandsOn

Import the **three** modules **sklearn.datasets, sklearn.model_selection, and sklearn.svm**.

Load popular **digits** dataset from **sklearn.datasets** module and assign it to **variable digits**.

**Split digits.data** into two sets names **x_train** and **x_test**. Also **split digits.target** into two sets **y_train** and **y_test**.

***Hint***: Use **train_test_split** method from **sklearn.model_selection**; set **random_state** to **30**.

In [0]:
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
digits=datasets.load_digits()
x_train,x_test,y_train,y_test=train_test_split(digits.data,digits.target,random_state=30)

Build an SVM classifier from **x_train** set, with **default** parameters. Name the model as **svm_clf**.

Predict the class of samples in **x_test** set and get the **classification_report** of the prediction.


In [125]:
from sklearn import metrics
svm_clf=SVC()
svm_clf=svm_clf.fit(x_train,y_train)
y_pred=svm_clf.predict(x_test)
print('Classification report : \n',metrics.classification_report(y_test, y_pred))

Classification report : 
              precision    recall  f1-score   support

          0       1.00      0.43      0.60        44
          1       1.00      0.25      0.41        51
          2       1.00      0.22      0.37        49
          3       0.12      1.00      0.22        37
          4       1.00      0.40      0.57        50
          5       1.00      0.23      0.38        47
          6       1.00      0.48      0.65        54
          7       1.00      0.56      0.72        41
          8       1.00      0.03      0.05        40
          9       1.00      0.62      0.77        37

avg / total       0.93      0.41      0.48       450



Build **multiple** SVM classifiers on **x_train** data with **C** parameter value setting to one of the values : **5, 100, 400**.

For each model, Predict the class of samples in **x_test** set and get the **classification_report** of the prediction

In [126]:
for c in [5,100,400]:
  svm_clf=SVC(C=c)
  svm_clf=svm_clf.fit(x_train,y_train)
  y_pred=svm_clf.predict(x_test)
  print('Classification report with C parameter:%s :%s \n'%(c,metrics.classification_report(y_test, y_pred)))
  
  

Classification report with C parameter:5 :             precision    recall  f1-score   support

          0       1.00      0.48      0.65        44
          1       1.00      0.25      0.41        51
          2       1.00      0.37      0.54        49
          3       0.14      1.00      0.24        37
          4       1.00      0.52      0.68        50
          5       1.00      0.30      0.46        47
          6       1.00      0.57      0.73        54
          7       1.00      0.63      0.78        41
          8       1.00      0.12      0.22        40
          9       1.00      0.62      0.77        37

avg / total       0.93      0.48      0.55       450
 

Classification report with C parameter:100 :             precision    recall  f1-score   support

          0       1.00      0.48      0.65        44
          1       1.00      0.25      0.41        51
          2       1.00      0.37      0.54        49
          3       0.14      1.00      0.24        37
       

# Clustering Technique

## Introduction to Clustering

**Clustering** is one of the** unsupervised learning technique**.


*   The technique is typically used to **group data points** into **clusters** based on a **specific algorithm**.
*   Major clustering algorithms that can be implemented using scikit-learn are:    

    *   K-means Clustering
    *   Agglomerative clustering  
    *   DBSCAN clustering
    *   Mean-shift clustering  
    *   Affinity propagation
    *   Spectral clustering



## K-Means Clustering

In** K-means Clustering** entire data set is** grouped into k clusters**.

Steps involved are:


*   **k** centroids are chosen **randomly**.
*   The distance of each data point from **k** centroids is calculated. A data point is assigned to the **nearest cluster**.
*   Centroids of **k** clusters are recomputed.
*   The above steps are **iterated** till the number of data points a **cluster reach convergence**.

**KMeans** from **sklearn.cluster** can be used for **K-means clustering**.

## Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering is a** bottom-up** approach.

Steps involved are:

*   **Each data point** is treated as a **single cluster** at the **beginning**.
*   The distance between each cluster is **computed**, and the **two** nearest clusters are **merged** together.
*   The above step is **iterated** till a **single cluster** is formed.
*   **AgglomerativeClustering** from **sklearn.cluster** can be used for achieving this.
*   **Merging of two clusters** can be any of the following linkage type: **ward**, **complete** or **average**.



## Density Based Clustering

**DBSCAN** from **sklearn.cluster** is used for this purpose.

**DBSCAN** algorithm requires 2 parameters - **epsilon** and **minimum point**
Three classification points:


*   Core points
*   Border points
*   Outlier points

Steps involved are:


*   Pick **random point **that is **not assigned** to **cluster or outlier**. Determine if it is **core point** else label the point as **outlier**.
*   Once core point is identified add **all directly reachable to cluster**. Then do **neighbor jumps** to each reachable point and add them to **cluster**. If **outlier** has been added, label it as a **border point.**
*   The above step is **iterated** till** all points** to **cluster or outlier**.


## Mean Shift Clustering

**Mean Shift Clustering** aims at discovering **dense** areas.

Steps Involved:


*   Identify** blob areas** with **randomly guessed centroids**.
*   Calculate the **centroid** of each **blob** area and **shift** to a new one, if there is a **difference**.
*   **Repeat** the above step till the **centroids** converge.

**make_blobs** from **sklearn.cluster** can be used to initialize the **blob** areas. **MeanShift** from **sklearn.cluster** can be used to perform **Mean Shift clustering**.



## Affinity Propagation

**Affinity Propagation** generates clusters by **passing messages** between **pairs of data points**, until convergence.

*   **AffinityPropagation** class from **sklearn.cluster** can be used.
*   The above class can be controlled with **two** major parameters:
    *   **preference**: It controls the **number of exemplars** to be chosen by the algorithm.
    *   **damping**: It controls **numerical oscillations** while updating messages.



## Spectral Clustering

**Spectral Clustering **is ideal to cluster data that is connected, and may not be in a compact space.

In general, the following steps are followed:


*   Build an **affinity matrix **of data points.
*   **Embed data points** in a **lower dimensional space**.
*   Use a clustering method like **k-means** to partition the points on **lower dimensional space**.

**spectral_clustering** from **sklearn.cluster** can be used for achieving this.



## Demo of KMeans

In [127]:
from sklearn.cluster import KMeans
kmeans_cluster = KMeans(n_clusters=2)
kmeans_cluster = kmeans_cluster.fit(X_train) 
kmeans_cluster.predict(X_test)

array([0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0,
       0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1], dtype=int32)

## Evaluating a Clustering algorithm

A clustering algorithm is majorly evaluated using the following scores:


*   **Homogeneity**: Evaluates if each cluster contains only members of a single class.
*   **Completeness**: All members of a given class are assigned to the same cluster.
*   **V-measure**: Harmonic mean of **Homogeneity** and **Completeness**.
*   **Adjusted Rand index**: Measures **similarity** of two assignments.

## Evaluation with scikit-learn

In [128]:
from sklearn import metrics
print(metrics.homogeneity_score(kmeans_cluster.predict(X_test), Y_test))
print(metrics.completeness_score(kmeans_cluster.predict(X_test), Y_test))
print(metrics.v_measure_score(kmeans_cluster.predict(X_test), Y_test))
print(metrics.adjusted_rand_score(kmeans_cluster.predict(X_test), Y_test))

0.5252075430986379
0.5112944931413127
0.5181576401272907
0.6417696880050958


## HandsOn

Import the **three** modules** sklearn.datasets, sklearn.model_selection, and sklearn.cluster**.

Load popular **iris** data set from **sklearn.datasets** module and assign it to variable **iris**.

In [0]:
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
iris=datasets.load_iris()
x_train,x_test,y_train,y_test=train_test_split(iris.data,iris.target,random_state=30)

Cluster **x_train** set into **3** clusters using **K-means** with **default** parameters. Name the model as **km_cls**.

Predict the cluster of samples in **x_test** and determine the **homogeneity score **of the model.

In [130]:
from sklearn import metrics
km_cls=KMeans(n_clusters=3)
km_cls=km_cls.fit(x_train)
y_pred=km_cls.predict(x_test)
print("Homegeneity Score:%s"%(metrics.homogeneity_score(y_pred,y_test)))

Homegeneity Score:0.7899456437990248


Cluster **x_train** set using **Mean shift **with **default** parameters. Name the model as **ms_cls**.

Predict the cluster of samples in **x_test** and determine the **homogeneity score** of the model.

In [131]:
from sklearn import metrics
from sklearn.cluster import MeanShift
ms_cls=MeanShift()
ms_cls=ms_cls.fit(x_train)
y_pred=ms_cls.predict(x_test)
print("Homegeneity Score:%s"%(metrics.homogeneity_score(y_pred,y_test)))

Homegeneity Score:1.0000000000000004


# OtherHandsOn

In [132]:
import sklearn.preprocessing as preprocessing

x = [[7.8], [1.3], [4.5], [0.9]]
print(preprocessing.Binarizer().fit(x).transform(x).shape)

(4, 1)


In [133]:
import sklearn.preprocessing as preprocessing

x = [[7.8], [1.3], [4.5], [0.9]]
print(preprocessing.Binarizer().fit(x).transform(x))

[[1.]
 [1.]
 [1.]
 [1.]]


In [134]:
import sklearn.preprocessing as preprocessing

x = [[0, 0], [0, 1], [2,0]]
enc = preprocessing.OneHotEncoder()
print(enc.fit(x).transform([[1, 1]]).toarray())

[[0. 0. 0. 1.]]
