# Lab Session 1 - Introduction to Machine Learning

For this lab session, we will go through a simple machine learning application and create our first model. We will be using the **Fruit Dataset** to create a classifier that can predict Fruit Type (apple, mandarin, orange, and lemon).

## Import required modules

In [3]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D

## First Things First: Look at Your Data

### Question 0

Scikit-learn works with lists, numpy arrays, scipy-sparse matrices, and pandas DataFrames. Using a DataFrame can help make many things easy with less code, so let's practice creating a classifier with a pandas DataFrame. 

**Using `read_csv`, create a dataframe and keep in mind that the dataset file "fruits.txt" should be in the same folder as your python file.**

In [4]:
fruits = pd.read_csv("fruits.txt")

### Question 1

How many data points (**Number of Instances**) and features (**Number of Attributes**) does the fruit dataset have?

(Hint: use `shape`)

In [11]:
fruits.shape

(59, 6)

### Question 2
What is the class distribution? (i.e. how many instances of `apple`, `mandarin`, `orange`, and `lemon`)

Hint: use value_counts()

In [9]:
fruits["name"].value_counts()

apple       19
orange      19
lemon       16
mandarin     5
Name: name, dtype: int64

### Question 3

Using `head` display the first 8 instances of the fruit dataset.

In [12]:
fruits.head(8)

Unnamed: 0,name,subtype,mass,width,height,color_score
0,apple,granny_smith,192,8.4,7.3,0.55
1,apple,granny_smith,180,8.0,6.8,0.59
2,apple,granny_smith,176,7.4,7.2,0.6
3,mandarin,mandarin,86,6.2,4.7,0.8
4,mandarin,mandarin,84,6.0,4.6,0.79
5,mandarin,mandarin,80,5.8,4.3,0.77
6,mandarin,mandarin,80,5.9,4.3,0.81
7,mandarin,mandarin,76,5.8,4.0,0.81


## Building a Model

### Question 4
Split the DataFrame into `X` (the data) and `y` (the labels).

*This function should return* 
* `X` *has shape* `(59, 3)`
* `y` *has shape* `(59,)`.

**For this example, only use `mass`, `width`, and `height` features of each fruit instance**

In [14]:
X = fruits[['mass', 'width', 'height']]
y = fruits['name']

print("X has shape", X.shape)
print("y has shape", y.shape)

X has shape (59, 3)
y has shape (59,)


### Question 5
Using `train_test_split`, split `X` and `y` into training and test sets `(X_train, X_test, y_train, and y_test)`.

**Set the random number generator state to 0 using `random_state=0`**

This function should return a tuple of length 4: `(X_train, X_test, y_train, y_test)`
Print the shape of each of these 4 elements


In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

print("X_train has shape", X_train.shape)
print("X_test has shape", X_test.shape)
print("y_train has shape", y_train.shape)
print("y_test has shape", y_test.shape)

X_train has shape (44, 3)
X_test has shape (15, 3)
y_train has shape (44,)
y_test has shape (15,)


## Building Your First Model: k-Nearest Neighbors

### Question 6
Using `KNeighborsClassifier` create a classifier object using five nearest neighbors (`n_neighbors = 5`).

*This function should return a `sklearn.neighbors.classification.KNeighborsClassifier`.

In [16]:
knn = KNeighborsClassifier(n_neighbors=5)
print(knn)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')


### 
Using your knn classifier object `knn` and `X_train`, `y_train` train the classifier (fit the estimator).

In [17]:
result = knn.fit(X_train, y_train)

### Question 7
Use the trained k-NN classifier model to classify new, previously unseen objects

**Use the following input: fruit with mass `20g`, width `4,3 cm`, height `5,5 cm`**
**Use the following input: a small fruit with mass `100g`, width `6,3 cm`, height `8,5 cm`**


In [18]:
fruit_prediction = knn.predict([[20, 4.3, 5.5]])
print(fruit_prediction)
fruit_prediction = knn.predict([[100, 6.3, 8.5]])
print(fruit_prediction)



['mandarin']
['lemon']


## Evaluating the Model

### Question 8
We can measure how well the model works by computing the accuracy on the test data. This is the fraction of fruits for which the right fruit type was predicted:

**Use `score` estimate the accuracy of the classifier on future data, using the test data**

In [19]:
test_score=knn.score(X_test, y_test)
print("Test set score: {:.2f}".format(test_score))

Test set score: 0.53


## Improving the Model

### Question 9
Try to improve the accuracy, by changing the number of neighbors. What is the optimal number of neighbors?

Now try adding distance weighting by changing the default value of the weights parameter (as described here https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

What is the best accuracy you get? What is the optimal number of neighbors with distance weighting? 

In [27]:
knn = KNeighborsClassifier(n_neighbors=7, weights='distance')
result = knn.fit(X_train, y_train)
test_score=knn.score(X_test, y_test)

print("Test set score: {:.2f}".format(test_score))





Test set score: 0.67
