In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Machine Learning
Machine Learning includes those algorithms which can find patterns in existing data to automate tasks on the same data or new data.
There are two types of Machine Learning Algorithms:

|Supervised Learning|Unsupervised Learning|
|-------------------|---------------------|
|Requires Train and test data|Requires Only one set of data|
|Used to predict on new data|Used to find patterns in existing data|
|Inferential|Descriptive|
|Examples:Linear Regression,SVM,Neural Networks|Examples:Clustering|

### What is training data?
The data which is used to let the model understand or `fit` is called as training data.Training data has input features and output target.
### What is testing data?
Based on its understanding of the training data the model will try to `predict` on the test data. In test data the input features are known but the output target is not known(The model will try to find it.)

## Sklearn(Scikit-learn) Templates for algorithms.
### Machine Learning Algorithms
Assume the model name is xyz

1. Import the ML model<br>
`from sklearn.abc import xyz`
2. Create an object of the ML algorithm<br>
    `model = xyz()`
3. Fit on the training data<br>
    `model.fit(X,y)`
4. Predict on test data<br>
    `model.predict(X)`
    
### Data Transformation Algorithms
1. Import the ML model<br>
`from sklearn.abc import xyz`
2. Create an object of the ML algorithm<br>
    `model = xyz()`
3. Fit on the training data<br>
    `model.fit(X,y)`
4. Transform on test data<br>
    `model.transform(X)`<br>
Steps 3 & 4 can be clubbed using
`model.fit_transform(X)`

## Unsupervised Learning
### Clustering
clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).[[1]](https://en.wikipedia.org/wiki/Cluster_analysis)<br>
#### K-Means Clustering
K-means works by selecting k central points, or means, hence K-Means. These means are then used as the centroid of their cluster: any point that is closest to a given mean is assigned to that mean’s cluster.

Once all points are assigned, move through each cluster and take the average of all points it contains. This new ‘average’ point is the new mean of the cluster.

Just repeat these two steps over and over again until the point assignments stop changing.
![](method-k-means-steps-example.png)


In [None]:
df = pd.read_csv('data/train.csv')

In [None]:
# implementation
from sklearn.cluster import KMeans

In [None]:
kmeans = KMeans(n_clusters=3)

In [None]:
# preprocess the dataset
# kmeans.fit(df)

In [None]:
# kmeans.cluster_centres_

### Elbow Method:
Elbow method is used to find the right number of clusters into which the data needs to be clustered.
![](elbow.png)

In [None]:
# implementation of elbow method.

In [None]:
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans

In [None]:
iris = load_iris()
data = pd.DataFrame(data=iris.data,columns=['f1','f2','f3','f4'])

In [None]:
x = list(range(1,10))
y = []
for N in range(1,10):
    kmeans = KMeans(n_clusters=N) # create cluster object
    kmeans.fit(data) # fit the data using kmeans
    data1 = data.copy()
    data1['cluster_index'] = kmeans.predict(data) # predict using the kmeans object
    cluster_centers = kmeans.cluster_centers_ # store the centers of clusters.
    number_of_clusters = cluster_centers.shape[0]
    wss = 0
    for i in range(number_of_clusters):
        wss+= np.square(data[data1['cluster_index']== i] - cluster_centers[i,:]).values.sum()
        # find the within sum of squares for each cluster and sum them
    y.append(wss) # store this value in a list.

# visualization
plt.plot(x,y)
plt.scatter(x,y)
plt.xlabel("Number Of Clusters")
plt.ylabel("Within Sum Of Squares")

In [None]:
# Your Task
# 1.Run Kmeans on the iris data set with n clusters = 3
# 2.find the center of cluster centers 
# 3.plot the cluster centers and center of cluster centers (only the first two columns) as a scatter plot

In [None]:
# Your Task
# plot all the rows as points in a scatter plot (only first two columns)
# 5.plot all the points in the same cluster with the same color
# Hint : to mention color in a plot give the parameter color = 'r' or 'b' or 'g' etc.

### Pros Of Kmeans
* Computationally faster with higher data points

### Cons of Kmeans
* Difficult to find the perfect k value
* Depends on initialization
* Can give sub-optimal results in some cases.

In [31]:
# Home task 
# Find cases where kmeans gives bad results.

# Reading Work
* Read about other clustering methods.
* There is a third type of machine learning algorithms called Reinforcement Learning Algorithms.Read about them.