# Supervised Learning- Classification
## Classification
One of the very famous classification problems in Machine Learning is the IRIS Flower classification problem.  We want to predict the class of Iris given the Sepal, Petal lengths and widths.  The data we will use are in a file called Iris_Data.csv found in 

https://raw.githubusercontent.com/nyp-sit/data/master/Iris_Data.csv

## Basic Data Analysis
Q1. Let's load the data from the url and perform basic data analysis.


*   Check the sample size
*   Check for the features
*   Check if there is any missing values



In [None]:
import pandas as pd
url = 'https://raw.githubusercontent.com/nyp-sit/data/master/Iris_Data.csv'
df = pd.read_csv(url)
print(df.head())

# Check the sample size

# Check the features

# Check for missing values


In [None]:
#@title
import pandas as pd
url = 'https://raw.githubusercontent.com/nyp-sit/data/master/Iris_Data.csv'
df = pd.read_csv(url)
print(df.head())
print('>>> Check the sample size:')
print(df.shape)
print('>>> Check for the features: ')
print(df.describe())
print('>>> Check for missing values')
print(df.info())


Let us first understand the datasets.  It consists of:
*   150 rows of data
*   The 3 **labels** are Iris-virginica, Iris-setosa and Iris-versicolor
*   The 4 **features** are Sepal length, Sepal width, Petal length, Petal width in cm
*   There is no missing values

This is a **multi-class classification** problem, as there are more than 2 classes to be predicted.  



## Exploratory Data Analysis

Q2. Let's perform a univariate analysis on the data with

*   a count plot to show the counts of each category of Iris species.  
*   histograms to show the distribution of the 4 features, petal_width, petal_length, sepal_,length, sepal_width



In [None]:
# count plot using matplotlib



In [None]:
#@title
# count plot using matplotlib

import matplotlib.pyplot as plt

df['species'].value_counts().plot(kind='bar')
plt.show()

In [None]:
# count plot using seaborn



In [None]:
#@title
# count plot using seaborn

import seaborn as sns
ax = sns.countplot(x='species', data=df)

In [None]:
# Historgram to show distribution of features

In [None]:
#@title
# Histogram to show distribution of features
df.hist()
plt.show()

In [None]:
# Histrogram with density plot (kde) using seaborn

In [None]:
#@title
# Histrogram with density plot (kde) using seaborn

sns.distplot(df['sepal_length'], kde=True)
plt.show()
sns.distplot(df['sepal_width'], kde=True)
plt.show()
sns.distplot(df['petal_length'], kde=True)
plt.show()
sns.distplot(df['petal_width'], kde=True)
plt.show()

Q3. Let's perform multivariate analysis on the data with

*  Scatter matrix 
*  Box plot

In [None]:
# Scatter matrix / pair plots

In [None]:
#@title
from pandas.plotting import scatter_matrix
scatter_matrix(df)
plt.show()

In [None]:
# pairplots using sns

In [None]:
#@title
sns.pairplot(df, hue='species')
plt.show()

In [None]:
# box plot using sns

In [None]:
#@title

sns.boxplot(x='species', y='sepal_length', data=df)
plt.show()
sns.boxplot(x='species', y='sepal_width', data=df)
plt.show()
sns.boxplot(x='species', y='petal_length', data=df)
plt.show()
sns.boxplot(x='species', y='petal_width', data=df)
plt.show()


## Data Modelling

Q4. IRIS class prediction is a multiclass classification problem where target variable has three classes.  The goal is to construct a function which will correctly predict the class to which the new point belongs.

We are going to need some data validate the accurary of our model.  We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will use to validate our model.

*  Create a validation test set using the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function


In [None]:
# Create a validation test set by splitting the data into 80/20 
from sklearn.model_selection import train_test_split



In [None]:
#@title
# Create a validation test set by splitting the data into 80/20 
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size = 0.2, random_state=7)

X = train.values[:,0:4]
Y = train.values[:,4]
x_test = test.values[:,0:4]
y_test = test.values[:,4]


Q5.  Proceed to train the data using Logistic Regression and K Nearest Neighbours.  Compute the accuracy score and confusion matrix for both algorithm.

[Accuracy Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)

[Confusion Matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)

[Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

[Nearest Neighbours](https://scikit-learn.org/stable/modules/neighbors.html)


In [None]:
#@title
from sklearn import linear_model, neighbors
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

classifier = linear_model.LogisticRegression(solver='liblinear', multi_class='ovr')
classifier.fit(X,Y)
predictions=classifier.predict(x_test)
print(accuracy_score(y_test, predictions))
print(confusion_matrix(y_test, predictions))

classifier = neighbors.KNeighborsClassifier()
classifier.fit(X,Y)
predictions=classifier.predict(x_test)
print(accuracy_score(y_test, predictions))
print(confusion_matrix(y_test, predictions))



Q6. Use K-fold cross-validation technique to randomly splits the training set into 10 distinct subsets to train and evaluate the Logistic Regression and KNN models 10 times and compare their results.  
[Cross Validation](https://scikit-learn.org/stable/modules/cross_validation.html)

In [None]:
#@title
from sklearn import model_selection

models = {}
models['LR'] = linear_model.LogisticRegression(solver='liblinear', multi_class='ovr')
models['KNN'] = neighbors.KNeighborsClassifier()

results = []
names = []
score = 'accuracy'

for name in models:
  model = models.get(name)
  kfold = model_selection.KFold(n_splits=10)
  cv_results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=score)
  results.append(cv_results)
  names.append(name)

  print('{}: {} ({})'.format(name, cv_results.mean(), cv_results.std()))

How to apply decision tree and SVM?