## Scikit Learn Library

Scikit-learn (sklearn) is a library in Python that provides many unsupervised and supervised learning algorithms. It’s built upon some of the technology you might already be familiar with, like NumPy, pandas, and Matplotlib.

The functionality that scikit-learn provides include:

 - Regression, including Linear and Logistic Regression algorithms etc
 - Classification, including K-Nearest Neighbors algorithms etc
 - Clustering, including K-Means and K-Means++ etc
 - Model selection
 - Preprocessing, including Min-Max Normalization

Machine learning has three types :
 - Supervised machine learning (class/label/target is given)
 - Un-supervised machine learning (class/label/target is not given)
 - Reinforcement Learning (reward based)

<b>Supervised algorithms:</b>
   
   - <b>Classification algorithms</b> (like KNN, Decision Tree, Naive Bayes etc)
      
   - <b>Regression</b> (like linear regression, logistic regression etc)
      
      <br>
      
<b>Unsupervised algorithms:</b>
   - <b>Clustering Algorithms</b> (like kmeans, kmeans++ etc)

# Supervised Learning Algorithms

## K-Nearest Neighbor

In [34]:
# Load libraries
import pandas
import numpy as np
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier

In [4]:
# Lets apply K-Nearest Neighbour Algorithm on Iris Flower dataset
# Iris flower dataset has four attributes (first four columns) based on which it tells which type of flower it is out of total 3 types

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(r"C:\Users\sohai\Desktop\iris.data", names=names)

dataset

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [11]:
dataset.shape

(150, 5)

In [12]:
dataset.groupby('class').size()

class
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
dtype: int64

In [15]:
# In X, we are having first four columns of every row i-e sepal_length, sepal_width, petal_length, petal_width
# In Y, we are having 4th column of every row which is basically class/label i-e Iris satosa, Iris verginica or Iris versicolor 

array = dataset.values
X = array[:,0:4]
Y = array[:,4]



In [16]:


t_size = 0.20   # training size is 0.2 (20 percent), testing size will be automatically set to 1-0.2 = 0.8 (80 percent) 
seed = 7        #if this seed is set to any constant number then it will generate same random numbers every time 
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=t_size,random_state=seed) #dividing into training and testing


#check by printing what data came into these four variables

print("X_train: ", X_train)
print("Y_train : ", Y_train)
print("X_test : ", X_test)
print("Y_test : ", Y_test)
 



X_train:  [[6.2 2.8 4.8 1.8]
 [5.7 2.6 3.5 1.0]
 [4.6 3.6 1.0 0.2]
 [6.9 3.1 5.4 2.1]
 [6.4 2.9 4.3 1.3]
 [4.8 3.0 1.4 0.3]
 [5.5 3.5 1.3 0.2]
 [5.4 3.9 1.7 0.4]
 [5.1 3.5 1.4 0.3]
 [7.1 3.0 5.9 2.1]
 [6.7 3.3 5.7 2.1]
 [6.8 2.8 4.8 1.4]
 [6.4 2.8 5.6 2.2]
 [6.5 3.0 5.5 1.8]
 [5.7 3.0 4.2 1.2]
 [5.0 3.3 1.4 0.2]
 [6.7 3.1 4.4 1.4]
 [6.0 2.2 4.0 1.0]
 [6.4 2.7 5.3 1.9]
 [4.7 3.2 1.6 0.2]
 [4.6 3.1 1.5 0.2]
 [5.1 3.4 1.5 0.2]
 [7.7 3.8 6.7 2.2]
 [4.3 3.0 1.1 0.1]
 [6.3 3.3 6.0 2.5]
 [5.5 2.4 3.7 1.0]
 [5.0 2.0 3.5 1.0]
 [6.5 2.8 4.6 1.5]
 [5.0 3.4 1.6 0.4]
 [4.4 2.9 1.4 0.2]
 [5.0 3.5 1.6 0.6]
 [6.7 3.1 4.7 1.5]
 [7.3 2.9 6.3 1.8]
 [5.5 2.6 4.4 1.2]
 [5.2 2.7 3.9 1.4]
 [5.7 4.4 1.5 0.4]
 [7.2 3.2 6.0 1.8]
 [5.4 3.4 1.7 0.2]
 [5.8 4.0 1.2 0.2]
 [6.1 2.6 5.6 1.4]
 [5.7 2.5 5.0 2.0]
 [4.8 3.0 1.4 0.1]
 [6.5 3.0 5.8 2.2]
 [4.6 3.2 1.4 0.2]
 [6.6 2.9 4.6 1.3]
 [6.7 3.0 5.2 2.3]
 [6.1 3.0 4.6 1.4]
 [5.7 3.8 1.7 0.3]
 [7.0 3.2 4.7 1.4]
 [4.7 3.2 1.3 0.2]
 [6.5 3.0 5.2 2.0]
 [7.7 2.6 6.9 2.3]
 [

In [17]:

knn = KNeighborsClassifier(n_neighbors=2) # two nearest neighbors i-e K=2
knn.fit(X_train, Y_train)      # training the model
predictions = knn.predict(X_test)  # now testing on test data to get class of test data
print((accuracy_score(Y_test, predictions))) # comparing results predicted by model with actual to get accuracy score


0.9333333333333333


In [10]:
# evaluation metrics other than accuracy
print((classification_report(Y_test, predictions)))   
print(confusion_matrix(Y_test, predictions))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00         7
Iris-versicolor       0.86      1.00      0.92        12
 Iris-virginica       1.00      0.82      0.90        11

       accuracy                           0.93        30
      macro avg       0.95      0.94      0.94        30
   weighted avg       0.94      0.93      0.93        30

[[ 7  0  0]
 [ 0 12  0]
 [ 0  2  9]]


## KNN using validation technique as "k-fold cross validation" instead of "train-test split"

In [36]:
#Suppose we are applying 5-fold cross validation

knn = KNeighborsClassifier(n_neighbors=10)
accuracy = cross_val_score(knn, X, Y, cv=5)
accuracy = np.mean(accuracy)
print(accuracy)

0.9800000000000001


# Tasks :

<b>Q1)</b> Occupancy dataset contains four attributes i-e "Humidity, Light, CO2 and Humidity ratio". Apply KNN to find if occupancy is possible or not (0 or 1) based on "Humidity, Light and Humidity Ratio" only. Train on "Occupancy_train.txt" and Test on "Occupancy_test.txt".
You need to do the following then :
- Run this KNN Algorithm for <b>n_neighbors</b> (K) from 1 to 10. You will get 10 different accuracies. Print all the   accuracies. Then print the highest accuracy and also the value of K at which you got the highest accuracy.
- Run this KNN Algorithm for different random seed values from 1 to 10. Print all accuracies and then print the highest

<b>Q2)</b> Now instead of using built-in library, write your own code for kNN classifier from scratch. Run on iris dataset. Use 80/20 split. Print accuracy and confusion matrix at the end. You must use the following chi squared distance function :
![image.png](attachment:image.png)


