<a href="https://colab.research.google.com/github/nyp-sit/sdaai-iti103/blob/master/session-2/Introduction_to_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Machine Learning
## Classification
One of the very famous classification problems in Machine Learning is the IRIS Flower classification problem.  We want to predict the class of Iris given the Sepal, Petal lengths and widths.  The data we will use are in a file called Iris_Data.csv found in 

https://raw.githubusercontent.com/nyp-sit/data/master/Iris_Data.csv

## Basic Data Analysis
Q1. Let's load the data from the url and perform basic data analysis.


*   Check the sample size
*   Check for the features
*   Check if there is any missing values



In [0]:
import pandas as pd
url = 'https://raw.githubusercontent.com/nyp-sit/data/master/Iris_Data.csv'
df = pd.read_csv(url)
print(df.head())

# Check the sample size

# Check the features

# Check for missing values


In [0]:
#@title
import pandas as pd
url = 'https://raw.githubusercontent.com/nyp-sit/data/master/Iris_Data.csv'
df = pd.read_csv(url)
print(df.head())
print('>>> Check the sample size:')
print(df.shape)
print('>>> Check for the features: ')
print(df.describe())
print('>>> Check for missing values')
print(df.info())


Let us first understand the datasets.  It consists of:
*   150 rows of data
*   The 3 **labels** are Iris-virginica, Iris-setosa and Iris-versicolor
*   The 4 **features** are Sepal length, Sepal width, Petal length, Petal width in cm
*   There is no missing values

This is a **multi-class classification** problem, as there are more than 2 classes to be predicted.  



## Exploratory Data Analysis

Q2. Let's perform a univariate analysis on the data with

*   a count plot to show the counts of each category of Iris species.  
*   histograms to show the distribution of the 4 features, petal_width, petal_length, sepal_,length, sepal_width



In [0]:
# count plot using matplotlib



In [0]:
#@title
# count plot using matplotlib

import matplotlib.pyplot as plt

df['species'].value_counts().plot(kind='bar')
plt.show()

In [0]:
# count plot using seaborn



In [0]:
#@title
# count plot using seaborn

import seaborn as sns
ax = sns.countplot(x='species', data=df)

In [0]:
# Historgram to show distribution of features

In [0]:
#@title
# Histogram to show distribution of features
df.hist()
plt.show()

In [0]:
# Histrogram with density plot (kde) using seaborn

In [0]:
#@title
# Histrogram with density plot (kde) using seaborn

sns.distplot(df['sepal_length'], kde=True)
plt.show()
sns.distplot(df['sepal_width'], kde=True)
plt.show()
sns.distplot(df['petal_length'], kde=True)
plt.show()
sns.distplot(df['petal_width'], kde=True)
plt.show()

Q3. Let's perform multivariate analysis on the data with

*  Scatter matrix 
*  Box plot

In [0]:
# Scatter matrix / pair plots

In [0]:
#@title
from pandas.plotting import scatter_matrix
scatter_matrix(df)
plt.show()

In [0]:
# pairplots using sns

In [0]:
#@title
sns.pairplot(df, hue='species')
plt.show()

In [0]:
# box plot using sns

In [0]:
#@title

sns.boxplot(x='species', y='sepal_length', data=df)
plt.show()
sns.boxplot(x='species', y='sepal_width', data=df)
plt.show()
sns.boxplot(x='species', y='petal_length', data=df)
plt.show()
sns.boxplot(x='species', y='petal_width', data=df)
plt.show()


## Data Modelling

Q4. IRIS class prediction is a multiclass classification problem where target variable has three classes.  The goal is to construct a function which will correctly predict the class to which the new point belongs.

We are going to need some data validate the accurary of our model.  We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will use to validate our model.

*  Create a validation test set using the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function


In [0]:
# Create a validation test set by splitting the data into 80/20 
from sklearn.model_selection import train_test_split



In [0]:
#@title
# Create a validation test set by splitting the data into 80/20 
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size = 0.2, random_state=7)

X = train.values[:,0:4]
Y = train.values[:,4]
x_test = test.values[:,0:4]
y_test = test.values[:,4]


Q5.  Proceed to train the data using Logistic Regression and K Nearest Neighbours.  Compute the accuracy score for both algorithm.

[Accuracy Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)

[Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

[Nearest Neighbours](https://scikit-learn.org/stable/modules/neighbors.html)

In [0]:
#@title
from sklearn import linear_model, neighbors
from sklearn.metrics import accuracy_score

classifier = linear_model.LogisticRegression(solver='liblinear', multi_class='ovr')
classifier.fit(X,Y)
predictions=classifier.predict(x_test)
print(accuracy_score(y_test, predictions))

classifier = neighbors.KNeighborsClassifier()
classifier.fit(X,Y)
predictions=classifier.predict(x_test)
print(accuracy_score(y_test, predictions))




Q6. Use K-fold cross-validation technique to randomly splits the training set into 10 distinct subsets to train and evaluate the Logistic Regression and KNN models 10 times and compare their results.  
[Cross Validation](https://scikit-learn.org/stable/modules/cross_validation.html)

In [0]:
#@title
from sklearn import model_selection

models = {}
models['LR'] = linear_model.LogisticRegression(solver='liblinear', multi_class='ovr')
models['KNN'] = neighbors.KNeighborsClassifier()

results = []
names = []
score = 'accuracy'

for name in models:
  model = models.get(name)
  kfold = model_selection.KFold(n_splits=10)
  cv_results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=score)
  results.append(cv_results)
  names.append(name)

  print('{}: {} ({})'.format(name, cv_results.mean(), cv_results.std()))

## Regression

In machine learning, regression is used to predict the outcome of an event based on the relationship between variables obtained from the data-set.  Suppose we want to know if money makes people happy.  

Q7. Let's load the [Better Life Index data](https://raw.githubusercontent.com/nyp-sit/data/master/Better_Life.csv) provided by OECD.  Examine and transform the data accordingly.



In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn.linear_model

url = 'https://raw.githubusercontent.com/nyp-sit/data/master/Better_Life.csv'
# load the data set

In [0]:
#@title
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn.linear_model

url = 'https://raw.githubusercontent.com/nyp-sit/data/master/Better_Life.csv'
df = pd.read_csv(url)
life_df = df.loc[(df['Indicator'] == 'Life satisfaction') & (df['Inequality'] == 'Total')]
life_df

Q8. Load the[ GDP per capita](https://raw.githubusercontent.com/nyp-sit/data/master/WEO_Data.csv) data provided by IMF.

In [0]:
url1 = 'https://raw.githubusercontent.com/nyp-sit/data/master/WEO_Data.csv'
# load the data set

In [0]:
#@title
url1 = 'https://raw.githubusercontent.com/nyp-sit/data/master/WEO_Data.csv'
gdp_df = pd.read_csv(url1, encoding='latin-1', thousands=',')
gdp_df.head()

Q9. Prepare the data by merging the OECD's life satisfaction data and IMF's GDP per capita data.  

*   Rename the colums to Country, Life satisfaction and GDP per capita
*   Set the index to 'Country'


In [0]:
#@title
pdf = pd.merge(gdp_df[['Country', '2015']], life_df[['Country', 'Value']] , on='Country', how='inner') 
pdf.rename(columns={'Value':'Life satisfaction','2015':'GDP per capita'}, inplace=True)

Q10. Plot the scatter plot to show the relationship between Life satisfaction and GDP per capita

In [0]:
#@title
ax1=pdf.plot.scatter(x = 'GDP per capita', y = 'Life satisfaction', c='blue', ylim=[0,10])
plt.show()

Q11.  Train the model and make a prediction for Cyprus with a GDP per capita of 22587

In [0]:
#@title
X = np.c_[pdf['GDP per capita']]
y = np.c_[pdf['Life satisfaction']]

lin_reg_model = sklearn.linear_model.LinearRegression()
lin_reg_model.fit(X, y)
X_new = [[22587]]
lin_reg_model.predict(X_new)