# Python for Machine learning

The basic idea of any machine learning model is to have a large number of inputs and also supplied the output applicable for them. After analysing more and more data, it tries to figure out the relationship between input and output.

Consider a very simple example when you have to decide whether to wear a jacket or not depending on the outside temperature. You have the data below and we call it training data. 


| Outside Temperature | Wear a Jacket |
|---------------------|---------------|
| 90°F                | No            |
| 80°F                | No            |
| 70°F                | No            |
| 60°F                | Yes           |
| 20°F                | Yes           |

Somehow, we find out a connection between the input (temperature) and the output (decision to wear a jacket).
So, if the temperature is 65°F, you would still wear a jacket although you were never told the outcome for that particular temperature.

Now, let's move on to a slightly better problem which the computer will solve for us.
Before we begin, we need to import the scikit-learn package, it provides easy to use functions and a lot of machine learning models. We will use it for today's workshop. 

```python
# install scikit package
conda install -c conda-forge scikit-image
```

In [None]:
import sklearn
print("sklearn version is:" + sklearn.__version__)

Sample Training Set
Here, X is the input and y is the output.

| x1 | x2 | y  |
|----|----|----|
| 1  | 2  | 5  |
| 4  | 5  | 14 |
| 11 | 12 | 35 |
| 21 | 22 | 65 |
| 5  | 5  | 15 |

Given the training set you could easily guess that the output (y) is (x1 + 2````*````x2 ).

## How to Generate a Data Set

In [None]:
# import randint function from random package
from random import randint

# Create two empty list to storage training data
TrainInput = list()
TrainOutput = list()

# Generate 100 random set of x1 and x2
for i in range(100):
    x1 = randint(0, 1000) # generate random integers between 0 and 1000. 
    x2 = randint(0, 1000)
    y = x1 + (2*x2)
    #append method is to add x1 and x2 to the Train list.
    TrainInput.append([x1, x2]) 
    TrainOutput.append(y)

In [None]:
TrainInput[0:6]

In [None]:
TrainOutput[0:6]

## The Machine Learning Model: Linear Regression
Working with linear regression model is simple. Create a model, train it and then use it:)

### Train the Model
We have generated the training data already, so create a linear regression model and pass it the training data. 


In [None]:
from sklearn.linear_model import LinearRegression
predictor=LinearRegression()
predictor.fit(X=TrainInput, y=TrainOutput)
coefficient=predictor.coef_
print('Coefficient : {}.'.format(coefficient))

### Test Data

X = [[10, 20]]

The outcome should be 10+2```*```20 =50. Let us see what we get.

In [None]:
Xtest = [[10, 20]]
Outcome = predictor.predict(X=Xtest)
print('Outcome: {}'.format(outcome))

## Another Linear Regression Example

Sales (in thousands of units) for a particular product as a function of advertising budgets (in thousands of dollars) for TV, radio, and newspaper media. Suppose that in our role as statistical consultants we are asked to suggest.
1. We want to find a function that given input budgets for TV, radio and newspaper predicts the output sales.
2. Which media contribute to sales?
3. Visualize the relationship between the features and response.

Reference: https://medium.com/simple-ai/linear-regression-intro-to-machine-learning-6-6e320dbdaf06

In [8]:
#import the packages
import pandas as pd
#import the adversting.csv data
data =  pd.read_csv(r"I:\Classes\OIT_Training\Python for Machine Learning\Advertising.csv")

In [7]:
# view the first 6 rows of the data
data.head()
data.describe()

Unnamed: 0,TV,radio,newspaper,sales
count,200.0,200.0,200.0,200.0
mean,147.0425,23.264,30.554,14.0225
std,85.854236,14.846809,21.778621,5.217457
min,0.7,0.0,0.3,1.6
25%,74.375,9.975,12.75,10.375
50%,149.75,22.9,25.75,12.9
75%,218.825,36.525,45.1,17.4
max,296.4,49.6,114.0,27.0


In [None]:
# create a Python list of feature names
feature_names = ['TV', 'radio', 'newspaper']
feature_names

In [None]:
# use the list to select a subset of the original DataFrame
X = data[feature_names]

In [None]:
# use the list to select a subeset of the original DataFrame
y = data['sales']

In [None]:
# visualize relationship between the variables
from yellowbrick.features import Rank2D
visualizer = Rank2D(algorithm="pearson")
visualizer.fit_transform(data)
visualizer.poof()

In [None]:
# import train_test_split function
from sklearn.model_selection import train_test_split
# Split X and y into traning and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) # random_state=int, random_state is the seed used by the random number generator

In [None]:
#import model
from sklearn.linear_model import LinearRegression
# Linear Regression Model
linreg = LinearRegression()
# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)

# make predictions on the testing set
y_pred = linreg.predict(X_test)

# Check the coefficient
linreg.coef_

In [None]:
# plot the regression line use seaborn 
# compute the R square value
from sklearn.metrics import r2_score
print('R2 Score:', r2_score(y_test,y_pred))

In [None]:
from yellowbrick.regressor import ResidualsPlot
# Visualzie the training and fitting model and the residual histogram
visualizer = ResidualsPlot(linreg)
visualizer.fit(X_train, y_train)  # Fit the training data to the model
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
visualizer.poof()                 # Draw/show/poof the data

Exercise 1:
1. Import the packages using the python code below:
```Python
#import the packages
import pandas as pd
#import model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from yellowbrick.features import Rank2D
from yellowbrick.regressor import ResidualsPlot
```
2. Import the data named "'bikeshare.csv" as a dataframe.
```python
data =  pd.read_csv(r"C:\Users\XZHU8\Documents\OIT related\Workshop\Python for Machine Learning\bikeshare.csv")
```
3. Create a Python list of feature names:
    1. X is "season", "month", "hour", "holiday", "weekday", "workingday",
    "weather", "temp", "feelslike", "humidity", "windspeed".
4. Use the list to select a subset, X, of the original DataFrame 

5. use the python syntax below to show the pearson correlaion matrix
```python
visualizer = Rank2D(algorithm="pearson")
visualizer.fit_transform(X)
visualizer.poof()
```
6. Use the list to select a subset, y, of the original DataFrame 
    1. Y is "riders"
7. Split X and y into training and testing sets  
8. Create a linear regression function and fit the model from the training data, and test using the testing data.
9. Check the coeffiecient
10. Visulize the residuals of training and testing model 

## A Simple Classfication Problem

The fruits dataset was created by Dr. Iain Murray from University of Edinburgh. He bought a few dozen oranges, lemons and apples of different varieties, and recorded their measurements in a table. And then the professors at University of Michigan formatted the fruits data slightly. Let us import the data and see the first several rows of the data. 

In [9]:
import pandas as pd
fruits = pd.read_table(r"I:\Classes\OIT_Training\Python for Machine Learning\fruit_data_with_colors.txt")
fruits.head()

Unnamed: 0,fruit_label,fruit_name,fruit_subtype,mass,width,height,color_score
0,1,apple,granny_smith,192,8.4,7.3,0.55
1,1,apple,granny_smith,180,8.0,6.8,0.59
2,1,apple,granny_smith,176,7.4,7.2,0.6
3,2,mandarin,mandarin,86,6.2,4.7,0.8
4,2,mandarin,mandarin,84,6.0,4.6,0.79


Each row of the dataset represents one piece of the fruit as represented by several features that are in the table’s columns.

We have 59 pieces of fruits and 7 features in the dataset:

In [None]:
print(fruits.shape)

We have four types of fruits in the dataset: Apple, mandarin, orange, and lemon.

In [None]:
fruits.groupby('fruit_name').size()
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(fruits['fruit_name'], color='b', label='Count')
plt.show()

In [None]:
# Scatter plot
import seaborn as sns
sns.scatterplot(x="width", y="height",hue="fruit_name", style="fruit_name", data=fruits)
plt.show()

In [None]:
# Descriptive Statstics 
fruits.describe()

In [None]:
#histogram for each numeric imput variable
from scipy.stats import norm
fruits.drop('fruit_label' ,axis=1).hist(bins=30, figsize=(9,9))
plt.show()

Some numberical values do not have the same scale, so we need to scale them. But we will split the dataset into training and test sets first. Then we scale the training data and then apply scaling to the test set, because in practice you are not provided with test data and you just have to evaluate your model on test data.

Here we use MinMaxScaler, which rescales the data set such that all feature values are in the range [0, 1]. 

The transformation formula is given by:

$ z_i = \frac{x_i - min(x_i)}{max(x_i) - min(x_i)}$




In [None]:
#Create a feature list and y
feature_names = ['mass', 'width', 'height', 'color_score']
X = fruits[feature_names]
y = fruits['fruit_label']
#split the dataset into training and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
#Use minmax
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Fit a logistic regression model
Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes.
Reference: https://ml-cheatsheet.readthedocs.io/en/latest/logistic_regression.html

In [None]:
from sklearn.linear_model import LogisticRegression
from yellowbrick.classifier import ConfusionMatrix
logreg = LogisticRegression()
# The ConfusionMatrix visualizer taxes a model
cm = ConfusionMatrix(logreg, classes=[1,2,3,4])
#cm = ConfusionMatrix(logreg, classes=[1,2,3,4], label_encoder={1: 'apple', 2: 'mandarin', 3: 'orange', 4:'lemon'} )
# Fit fits the passed model. This is unnecessary if you pass the visualizer a pre-fitted model
cm.fit(X_train, y_train)
# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
# and then creates the confusion_matrix from scikit-learn.
cm.score(X_test, y_test)

# How did we do?
cm.poof()
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(logreg.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format(logreg.score(X_test, y_test)))

## Decision Tree 
A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.
Reference:https://medium.com/greyatom/decision-trees-a-simple-way-to-visualize-a-decision-dc506a403aeb

In [None]:
from sklearn.tree import DecisionTreeClassifier
dct = DecisionTreeClassifier()
cm = ConfusionMatrix(dct, classes=[1,2,3,4])
#cm = ConfusionMatrix(logreg, classes=[1,2,3,4], label_encoder={1: 'apple', 2: 'mandarin', 3: 'orange', 4:'lemon'} )
# Fit fits the passed model. This is unnecessary if you pass the visualizer a pre-fitted model
cm.fit(X_train, y_train)
# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
# and then creates the confusion_matrix from scikit-learn.
cm.score(X_test, y_test)

# How did we do?
cm.poof()
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(dct.score(X_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
     .format(dct.score(X_test, y_test)))

## K-Nearest Neighbors
The KNN algorithm assumes that simliar things exist in close proximity. In other words, similar things are near to each other.
Reference:https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
cm = ConfusionMatrix(knn, classes=[1,2,3,4])
# Fit fits the passed model. This is unnecessary if you pass the visualizer a pre-fitted model
cm.fit(X_train, y_train)
# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
# and then creates the confusion_matrix from scikit-learn.
cm.score(X_test, y_test)

# How did we do?
cm.poof()
print('Accuracy of K-NN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))
print('Accuracy of K-NN classifier on test set: {:.2f}'
     .format(knn.score(X_test, y_test)))


## Linear Discriminant Analysis

Linear Discriminant Analysis is a dimensionality reduction technique used as a preprocessing step in Machine Learning and pattern classification applications.
The main goal of dimensionality reduction techinques is to reduce the dimensions by removing the reduntant and dependent features by transforming the features from higher dimensional space to a space with lower dimensions.
Reference:https://medium.com/@srishtisawla/linear-discriminant-analysis-d38decf48105

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()

cm = ConfusionMatrix(lda, classes=[1,2,3,4])
#cm = ConfusionMatrix(logreg, classes=[1,2,3,4], label_encoder={1: 'apple', 2: 'mandarin', 3: 'orange', 4:'lemon'} )
# Fit fits the passed model. This is unnecessary if you pass the visualizer a pre-fitted model
cm.fit(X_train, y_train)
# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
# and then creates the confusion_matrix from scikit-learn.
cm.score(X_test, y_test)

# How did we do?
cm.poof()

print('Accuracy of LDA classifier on training set: {:.2f}'
     .format(lda.score(X_train, y_train)))
print('Accuracy of LDA classifier on test set: {:.2f}'
     .format(lda.score(X_test, y_test)))

## Gaussian Naive Bayes
It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

### Pros:

It is easy and fast to predict class of test data set. It also perform well in multi class prediction
When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data.
It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).

### Cons:

If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.
On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.
Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
cm = ConfusionMatrix(gnb, classes=[1,2,3,4])
#cm = ConfusionMatrix(logreg, classes=[1,2,3,4], label_encoder={1: 'apple', 2: 'mandarin', 3: 'orange', 4:'lemon'} )
# Fit fits the passed model. This is unnecessary if you pass the visualizer a pre-fitted model
cm.fit(X_train, y_train)
# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
# and then creates the confusion_matrix from scikit-learn.
cm.score(X_test, y_test)

# How did we do?
cm.poof()
print('Accuracy of GNB classifier on training set: {:.2f}'
     .format(gnb.score(X_train, y_train)))
print('Accuracy of GNB classifier on test set: {:.2f}'
     .format(gnb.score(X_test, y_test)))

## Random Forest

Keywords: bootstrap aggregation or bagging
          ensemble model, voting

### RF Algorithm

Fits a bunch of trees on "random samples
from our sample" (called bootstrap samples)
& they all vote on best class. The votes 
aggregated to choose the winning class.

The concept of combining predictions 
like this is called "bagging" or 
Bootstrap AGgregation.

Details:

1. Randomly select (with replacement) both
   N observations and a subset of predictors
   to create 100-500 subsets of data.

2. Fit a "bushy" tree to each sample
   e.g. no pruning so each WILL overfit!

3. At each split, variables are randomly sampled

4. Have each model make a prediction, then 
   count (when classifying) or average 
   (when regressing) their predictions
   (weights may be used based on model accuracy)
   
  

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
cm = ConfusionMatrix(rfc, classes=[1,2,3,4])
#cm = ConfusionMatrix(logreg, classes=[1,2,3,4], label_encoder={1: 'apple', 2: 'mandarin', 3: 'orange', 4:'lemon'} )
# Fit fits the passed model. This is unnecessary if you pass the visualizer a pre-fitted model
cm.fit(X_train, y_train)
# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
# and then creates the confusion_matrix from scikit-learn.
cm.score(X_test, y_test)

# How did we do?
cm.poof()
print('Accuracy of RFC classifier on training set: {:.2f}'
     .format(rfc.score(X_train, y_train)))
print('Accuracy of RFC classifier on test set: {:.2f}'
     .format(rfc.score(X_test, y_test)))

## Neural Networks (NN)

### Neural Network Algorithm

* Simulates human brain by weighting
  "neurons" stored in "hidden layers"
  
* Each training observation adjusts
  the impact of each neuron, often
  through "back-propogation"

* Deep Learning uses many layers
  and many neurons per layer
  

### Neural Network Advantages
  
* Excellent results for extreme 
  complexity e.g. voice, 
  image recognition
  
* Though the model consists of a set
  of equations that is fairly small
  compared to many other metods
  
* Model is quick to apply to new data

    
### Neural Network Disadvantages

* Computationally intensive
  to train the model
    
* Performance on numerical data
  rarely much better than faster
  methods e.g. rf, gbm
    
* Models are impossible to interpet

* Extremely sensitive to multicollinearity
  (use PCA first or method = "pcaNNet")


### Neural Network Tuning Parameters

* Depends on specific type but
  in general:
  
* Number of hidden layers

* Number of neurons per layer

* Type of feedback mechanism
  e.g. back-propogation

In [None]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(100,100,100), max_iter=500, alpha=0.0001)

cm = ConfusionMatrix(mlp, classes=[1,2,3,4])

# Fit fits the passed model. This is unnecessary if you pass the visualizer a pre-fitted model
cm.fit(X_train, y_train)
# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
# and then creates the confusion_matrix from scikit-learn.
cm.score(X_test, y_test)

# How did we do?
cm.poof()
print('Accuracy of MLP classifier on training set: {:.2f}'
     .format(mlp.score(X_train, y_train)))
print('Accuracy of MLP classifier on test set: {:.2f}'
     .format(mlp.score(X_test, y_test)))

## Model Evaluation
Common metrics for evaluating classifiers:

Precision is the number of correct positive results divided by the number of all positive results (e.g. How many of the mushrooms we predicted would be edible actually were?).

Recall is the number of correct positive results divided by the number of positive results that should have been returned (e.g. How many of the mushrooms that were poisonous did we accurately predict were poisonous?).

The F1 score is a measure of a test’s accuracy. It considers both the precision and the recall of the test to compute the score. The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0.
``` python
precision = true positives / (true positives + false positives)

recall = true positives / (false negatives + true positives)

F1 score = 2 * ((precision * recall) / (precision + recall))
```

In [None]:
from yellowbrick.classifier import ClassificationReport

# Instantiate the classification model and visualizer
visualizer = ClassificationReport(knn, classes=[1,2,3,4], support=True)

visualizer.fit(X_train, y_train)  # Fit the visualizer and the model
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
g = visualizer.poof()             # Draw/show/poof the data
fruits.head()

## Predicted vs Actual in test data

In [None]:
y_new=knn.predict(X_test)
test = list(zip(y_new, y_test))
print(test)

## Exercise2: 

The dataset we will be working with in this tutorial is the Breast Cancer Wisconsin Diagnostic Database. The dataset includes various information about breast cancer tumors, as well as classification labels of malignant or benign. The dataset has 569 instances, or data, on 569 tumors and includes information on 30 attributes, or features, such as the radius of the tumor, texture, smoothness, and area.

Using this dataset, we will build a machine learning model to use tumor information to predict whether or not a tumor is malignant or benign.

The syntax below imported and loaded the dataset, splitted the data into training and test and also applied Gaussian Naive Bayes model to the data.
You need to fit the Logistic Regression and Random Forest model and find the best one. 

```python
from sklearn.datasets import load_breast_cancer
# Load dataset
data = load_breast_cancer()
# Organize our data
y_names = data['target_names']
y = data['target']
feature_names = data['feature_names']
X = data['data']
# Split our data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33,random_state=42)
#fit Gaussian Naive model
gnb = GaussianNB()
cm = ConfusionMatrix(gnb, classes=y_names, label_encoder={0:'malignant', 1: 'benign'}  )
# Fit fits the passed model. This is unnecessary if you pass the visualizer a pre-fitted model
cm.fit(X_train, y_train)
# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
# and then creates the confusion_matrix from scikit-learn.
cm.score(X_test, y_test)
# How did we do?
cm.poof()
print('Accuracy of RFC classifier on training set: {:.2f}'
     .format(gnb.score(X_train, y_train)))
print('Accuracy of RFC classifier on test set: {:.2f}'
     .format(gnb.score(X_test, y_test)))
     

```


In [None]:
from sklearn.datasets import load_breast_cancer

# Load dataset
data = load_breast_cancer()
# Organize our data
y_names = data['target_names']
y = data['target']
feature_names = data['feature_names']
X = data['data']


# Split our data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33,random_state=42)

#fit Gaussian Naive model
gnb = GaussianNB()
cm = ConfusionMatrix(gnb, classes=y_names, label_encoder={0:'malignant', 1: 'benign'}  )
# Fit fits the passed model. This is unnecessary if you pass the visualizer a pre-fitted model
cm.fit(X_train, y_train)
# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
# and then creates the confusion_matrix from scikit-learn.
cm.score(X_test, y_test)

# How did we do?
cm.poof()
print('Accuracy of RFC classifier on training set: {:.2f}'
     .format(gnb.score(X_train, y_train)))
print('Accuracy of RFC classifier on test set: {:.2f}'
     .format(gnb.score(X_test, y_test)))

# logistic regression





In [None]:
# Random Forest model
import os
os.getcwd()

## Unsupervised Learning

So far, we have only explored supervised Machine Learning algorithms and techniques to develop models where the data had labels previously known. In other words, our data had some target variables with specific values that we used to train our models.
However, when dealing with real-world problems, most of the time, data will not come with predefined labels, so we will want to develop machine learning models that can classify correctly this data, by finding by themselves some commonality in the features, that will be used to predict the classes on new data.

## Unsupervised Learning Analysis Process

Unsupervised learning main applications are:
* Segmenting datasets by some shared atributes.
* Detecting anomalies that do not fit to any group.
* Simplify datasets by aggregating variables with similar atributes.

In summary, the main goal is to study the intrinsic (and commonly hidden) structure of the data.
This techniques can be condensed in two main types of problems that unsupervised learning tries to solve. This problems are:

* Clustering
* Dimensionality Reduction

This workshop we will cover the clustering problems. 

## Clutering Analysis

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).
![image.png](Unsupervise3.png)

Clustering, however, has many different names (with respect to the fields it is being applied):

* Cluster analysis
* Automatic classification
* Data segmentation

### All the above names essentially mean clustering.

Cluster analysis have an incredible wide range of applications and are quite useful to solve real world problems such as anomaly detection, recommending systems, documents grouping, or finding customers with common interests based on their purchases.
Some of the most common clustering algorithms, and the ones that will be explored in the workshop, are:
* K-Means
* Hierarchichal Clustering
* Density Based Scan Clustering (DBSCAN)
* Gaussian Clustering Model

## Choosing a Problem

we will take an example of market segmentation. There will be certain features due to which the market is segmented. We will try to analyse the the type of customers in the market based on the features. The data set consist of 30 samples and features are satisfaction and loyalty respectively. 

## K-means Cluster

k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster (Definition from Wiki).

### How the K-means algorithm works
To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids
It halts creating and optimizing clusters when either:
The centroids have stabilized — there is no change in their values because the clustering has been successful.
The defined number of iterations has been achieved.

reference: https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1

In [10]:
# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.cluster import KMeans
# import data
data=pd.read_csv(r"I:\Classes\OIT_Training\Python for Machine Learning\kmeans clustering.csv")
data.head()


Unnamed: 0,Satisfaction,Loyalty
0,4,-1.33
1,6,-0.28
2,5,-0.99
3,7,-0.29
4,4,1.06


In [None]:
# scatter plot
sns.scatterplot(x="Satisfaction", y="Loyalty",  data=data)
plt.show()


In [None]:
# copy the data and ignore the feature name and store the data into a variable X.
x = data.copy()

# create a variable kmeans using kmeans function and passing the argument 2 in the Kmeans
kmeans = KMeans(2)
kmeans.fit(x)

# Clustering result
clusters = x.copy()
clusters['cluster_pred'] = kmeans.fit_predict(x)

Plot

In [None]:
plt.scatter(clusters['Satisfaction'], clusters['Loyalty'],c=clusters['cluster_pred'], cmap = 'rainbow')
plt.xlabel('Satisfaction')
plt.ylabel('Loyalthy')
plt.show()

## The Problem

The biggest problem here is that Satisfaction is choosen as a feature and loyalty has been neglected. We can see in the figure that all the element to the right of 6 forms one cluster and the other on the left forms another. This is a bias result because our algorithm has discarded the Loyalty feature. It has done the clustering only on the basis of satisfaction. This does not give an appropriate result through which we can analyze things.
Satisfaction was choosen as the feature because it had large values.
So here is the problem both the data are not scaled. First we have to standardize the data, so that both the data have equal weights in our clustering.
We can’t neglect loyalty as it has an important role in the analyses of market segmentation.

We will not go in depth of this as sklearn helps us to scale the data.
The data is scaled around zero mean. Now we can see that both the data are equally scaled and now both will have equal chance of being selected as feature.

In [None]:
from sklearn import preprocessing
x_scaled = preprocessing.scale(x)
x_scaled

## The Elbow Method:
Have we ever wondered why we initialized kmeans with 2 clusters only.
Yes, we could have initialized it with any value we wanted we could have got any number of clusters. But the analyses becomes difficult when there are a large number of clusters. So how we will know the exact number of cluster to start off. Note there are no such exact number as it changes with the problem in hand.
Here the elbow method comes handy when we are confused as to how may clusters do we need. What elbow method does is it starts of with making one cluster to the number of clusters in our sample and with the kmeans inertia value we determine what would be the appropriate number of clusters.
Remember our goal- Our final goal was to minimize the within the cluster sum of square and maximize the distance between clusters.
With this simple line of code we get all the inertia value or the within the cluster sum of square.

In [None]:
wcss = []
for i in range (1, 30):
    kmeans = KMeans(i)
    kmeans.fit(x_scaled)
    wcss.append(kmeans.inertia_)
# visualized the Elbow method
plt.plot(range(1,30), wcss)
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()


This graph looks like elbow and we have to determine that elbow point.
Here the elbow point comes at around 4 and this our optimal number of clusters for the above data which we should choose.
If we look at the figure carefully after 4 when we go on increasing the number of cluster there is no big change in the wcss and it remains constant.
Hurrah..!! we have got the optimal number of clusters for our problem.
We will now quickly perform the kmeans clustering with the new number of clusters which is 4 and then dive into some analysis.


In [None]:
kmeans_new = KMeans(4)
kmeans.fit(x_scaled)
cluster_new = x.copy()
cluster_new['cluster_pred']=kmeans_new.fit_predict(x_scaled)
cluster_new

## Plot the newly cluster:

In [None]:
plt.scatter(clusters['Satisfaction'], clusters['Loyalty'],c=cluster_new['cluster_pred'], cmap = 'rainbow')
plt.xlabel('Satisfaction')
plt.ylabel('Loyalthy')
plt.show()

## Hierarchical clustering

### How Hierarchical Clustering works
Hierarchical clustering starts by treating each observation as a separate cluster. Then, it repeatedly executes the following two steps: (1) identify the two clusters that are closest together, and (2) merge the two most similar clusters. This continues until all the clusters are merged together. This is illustrated in the diagrams below.

In [None]:
# import hierarchical clustering
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering

### Dendrogram
A dendrogram is a diagram that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering. The main use of a dendrogram is to work out the best way to allocate objects to clusters.

The key to interpreting a dendrogram is to focus on the height at which any two objects are joined together. In the example below, we can see that 4 and 22 are most similar, as the height of the link that joins them together is the smallest.

Observations are allocated to clusters by drawing a horizontal line through the dendrogram. Observations that are joined together below the line are in clusters.

Reference: https://www.displayr.com/what-is-dendrogram/

In [None]:
# plot dendrogram 
from scipy.cluster import hierarchy
Z = hierarchy.linkage(x_scaled,'ward')
dn = hierarchy.dendrogram(Z)
plt.show()


In [None]:
# Create dendrogram
dendrogram = sch.dendrogram(sch.linkage(x_scaled, method='ward'))

# Create clusters
hc = AgglomerativeClustering(n_clusters=4, affinity = 'euclidean', linkage = 'ward')

# save clusters for chart
cluster_hc = x.copy()
cluster_hc['cluster_pred'] = hc.fit_predict(x_scaled)
cluster_hc


In [None]:
# plot the cluster
plt.close()
plt.scatter(cluster_hc['Satisfaction'], cluster_hc['Loyalty'],c=cluster_hc['cluster_pred'], cmap = 'rainbow')
plt.xlabel('Satisfaction')
plt.ylabel('Loyalthy')
plt.show()

## Density Based Scan Clustering (DBSCAN)

The key idea is that for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of points. Compared to centroid-based clustering like K-Means, density-based clustering works by identifying “dense” clusters of points, allowing it to learn clusters of arbitrary shape and identify outliers in the data. 

**DBSCAN algorithm requires two parameters –**

**eps** : It defines the neighborhood around a data point i.e. if the distance between two points is lower or equal to ‘eps’ then they are considered as neighbors. If the eps value is chosen too small then large part of the data will be considered as outliers. If it is chosen very large then the clusters will merge and majority of the data points will be in the same clusters. One way to find the eps value is based on the k-distance graph.

**MinPts**: Minimum number of neighbors (data points) within eps radius. Larger the dataset, the larger value of MinPts must be chosen. As a general rule, the minimum MinPts can be derived from the number of dimensions D in the dataset as, MinPts >= D+1. The minimum value of MinPts must be chosen at least 3.
In this algorithm, we have 3 types of data points.

**Core Point**: A point is a core point if it has more than MinPts points within eps.
**Border Point**: A point which has fewer than MinPts within eps but it is in the neighborhood of a core point.
**Noise or outlier**: A point which is not a core point or border point.

Reference: https://www.geeksforgeeks.org/dbscan-clustering-in-ml-density-based-clustering/

In [None]:
from sklearn.cluster import DBSCAN 
db_default = DBSCAN(eps = 0.55, min_samples = 3).fit(x_scaled) 
labels = db_default.labels_ 
labels
# save clusters for chart
cluster_db = x.copy()
cluster_db['cluster_pred'] =db_default.fit_predict(x_scaled)
cluster_db

# plot the cluster
plt.close()
plt.scatter(cluster_hc['Satisfaction'], cluster_hc['Loyalty'],c=cluster_db['cluster_pred'], cmap = 'rainbow')
plt.xlabel('Satisfaction')
plt.ylabel('Loyalthy')
plt.show()


## Gaussian Mixture Model

A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters.

The GaussianMixture object implements the expectation-maximization (EM) algorithm for fitting mixture-of-Gaussian models. It can also draw confidence ellipsoids for multivariate models, and compute the Bayesian Information Criterion to assess the number of clusters in the data.

Reference: https://scikit-learn.org/stable/modules/mixture.html

In [None]:
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=4).fit(x_scaled)
labels = gmm.predict(x_scaled)
labels
# plot the cluter
plt.close()
plt.scatter(cluster_hc['Satisfaction'], cluster_hc['Loyalty'],c=labels, cmap = 'rainbow')
plt.xlabel('Satisfaction')
plt.ylabel('Loyalthy')
plt.show()

## Analysis (The final step):
Through the given figure following things can be interpreted:

The sky blue dots are the people who are less satisfied and less loyal and therefore can be termed as alienated.
The yellow dots are people with high loyalty and less satisfaction.
The purple dots are the people with high loyalty and high satisfaction and they are the fans.
The red dots are the people who are in the midst of things.
The ultimate goal of any businessman would be to have as many people up there in the fans category. We are ready with a solution and we can target the audience as per our analysis. For example, the crowd who are supporters can easily be turned into fans by fulfilling their satisfaction level.

reference:
1. https://medium.com/code-to-express/k-means-clustering-for-beginners-using-python-from-scratch-f20e79c8ad00
2. https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py
3. https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html#spectral-clustering

## Another example

In [None]:
from sklearn.datasets import make_moons
from matplotlib import pyplot
from pandas import DataFrame
np.random.seed(0)
# generate 2d classification dataset
X,y= make_moons(n_samples=1000, noise=0.05)
# scatter plot, dots colored by class value
df = DataFrame(dict(x=X[:,0], y=X[:,1], label=y))
colors = {0:'red', 1:'blue'}
fig, ax = pyplot.subplots()
grouped = df.groupby('label')
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key])
pyplot.show()

In [None]:
#Kmean method
x = df[['x','y']]
kmeans = KMeans(2)
kmeans.fit(x)
cluster_moon = df.copy()
cluster_moon['cluster_kmean']=kmeans.fit_predict(x)
cluster_moon.head(n=10)

In [None]:
# plot the output
plt.scatter(cluster_moon['x'], cluster_moon['y'],c=cluster_moon['cluster_kmean'], cmap = 'rainbow')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

In [None]:
#heirotical ward method

# Create clusters
hc = AgglomerativeClustering(n_clusters=2, affinity = 'euclidean', linkage = 'ward')

# save clusters for chart
cluster_moon['cluster_ward'] = hc.fit_predict(x)
cluster_moon.head(n=10)

In [None]:
# plot the output
plt.scatter(cluster_moon['x'], cluster_moon['y'],c=cluster_moon['cluster_ward'], cmap = 'rainbow')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

## Exercise 3
Used the data above and try the single linkage clustering method and plot the data.

```python
#heirotical single method
# Create clusters
hc = AgglomerativeClustering(n_clusters=2, affinity = 'euclidean', linkage = 'single')
```

## Answer

Exercise 1
1. Import the packages using the python code below:
```Python
#import the packages
import pandas as pd
#import model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from yellowbrick.features import Rank2D
from yellowbrick.regressor import ResidualsPlot
```
2. Import the data named "'bikeshare.csv" as a dataframe.
```python
data =  pd.read_csv(r"C:\Users\XZHU8\Documents\OIT related\Workshop\Python for Machine Learning\bikeshare.csv")
```
3. Create a Python list of feature names:
    X is "season", "month", "hour", "holiday", "weekday", "workingday",
    "weather", "temp", "feelslike", "humidity", "windspeed".
    
4. Use the list to select a subset, X, of the original DataFrame 

5. Use the python syntax below to show the pearson correlaion matrix

6. Use the list to select a subset, y, of the original DataFrame: 
    Y is "riders"

7. Split X and y into training and testing sets 

8. Create a linear regression function and fit the model from the training data, and test using the testing data.

9. Check the coeffiecient.

10. Visulize the residuals of training and testing model 

```python
# Create a Python list of feature names
FeatureNames = ["season", "month", "hour", "holiday", "weekday", "workingday",
    "weather", "temp", "feelslike", "humidity", "windspeed"]
    ```
```python
# Select a subset, X, of the original DataFrame 
X = data[FeatureNames]
```

```python
# visualization
visualizer = Rank2D(algorithm="pearson")
visualizer.fit_transform(X)
visualizer.poof()
```
```Python
Y = data["riders"]
```
```Python
#Split X and y into training and testing sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
```
```Python
# Linear Regression Model
linreg = LinearRegression()
# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)

# make predictions on the testing set
y_pred = linreg.predict(X_test)

# Check the coeffiecient
linreg.coef_
```


```python
#Visulize the residuals of training and testing model
visualizer = ResidualsPlot(linreg)
visualizer.fit(X_train, y_train)  # Fit the training data to the model
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
visualizer.poof()                 # Draw/show/poof the data
```
Exercise 2

```python
from sklearn.datasets import load_breast_cancer
# Load dataset
data = load_breast_cancer()
# Organize our data
y_names = data['target_names']
y = data['target']
feature_names = data['feature_names']
X = data['data']
# Split our data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33,random_state=42)
#fit Gaussian Naive model
gnb = GaussianNB()
cm = ConfusionMatrix(gnb, classes=y_names, label_encoder={0:'malignant', 1: 'benign'}  )
# Fit fits the passed model. This is unnecessary if you pass the visualizer a pre-fitted model
cm.fit(X_train, y_train)
# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
# and then creates the confusion_matrix from scikit-learn.
cm.score(X_test, y_test)
# How did we do?
cm.poof()
```

Exercise 3
```python
#heirotical single method
# Create clusters
hc = AgglomerativeClustering(n_clusters=2, affinity = 'euclidean', linkage = 'single')
# save clusters for chart
cluster_moon['cluster_single'] = hc.fit_predict(x)
cluster_moon.head(n=10)
# plot the output
plt.scatter(cluster_moon['x'], cluster_moon['y'],c=cluster_moon['cluster_single'], cmap = 'rainbow')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
```

In [None]:
data =  pd.read_csv(r"C:\Users\XZHU8\Documents\OIT related\Workshop\Python for Machine Learning\bikeshare.csv")

In [None]:
FeatureNames = ["season", "month", "hour", "holiday", "weekday", "workingday",
    "weather", "temp", "feelslike", "humidity", "windspeed"]
X=data[FeatureNames]
visualizer = Rank2D(algorithm="pearson")
visualizer.fit_transform(X)
visualizer.poof()
y = data["riders"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)


In [None]:
#import the packages
import pandas as pd
#import model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from yellowbrick.features import Rank2D
from yellowbrick.regressor import ResidualsPlot

In [None]:
# Linear Regression Model
linreg = LinearRegression()
# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)

# make predictions on the testing set
y_pred = linreg.predict(X_test)

In [None]:
from yellowbrick.regressor import ResidualsPlot
# Visualzie the training and fitting model and the residual histogram
visualizer = ResidualsPlot(linreg)
visualizer.fit(X_train, y_train)  # Fit the training data to the model
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
visualizer.poof()                 # Draw/show/poof the data