<a href="https://colab.research.google.com/github/medh01/Intro-to-machine-learning/blob/main/01_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Build a classification model</h1>

For this tutorial, we're going to use scikit-learn<br>
Scikit-learn is a popular and widely used open-source machine learning library for python<br>
Google colab comes with all the main Python libraires installed. So, we don't have to install scikit-learn<br>
To check which version of scikit-learn is installed, we execute the cell below

In [4]:
pip show scikit-learn

Name: scikit-learn
Version: 1.2.2
Summary: A set of python modules for machine learning and data mining
Home-page: http://scikit-learn.org
Author: 
Author-email: 
License: new BSD
Location: /usr/local/lib/python3.10/dist-packages
Requires: joblib, numpy, scipy, threadpoolctl
Required-by: fastai, imbalanced-learn, librosa, lightgbm, mlxtend, qudida, sklearn-pandas, yellowbrick


#Loading a Dataset
Scikit-learn has built-in datasets that are ready to use for machine learning.<br> Scikit-learn provides 6 different `toy datasets`.
* iris dataset (classification)
* diabetes dataset (regression)
* digits dataset (classification)
* the physical exercise Linnerud dataset (regression)
* the wine dataset (classification)
* the breast cancer wisconsin dataset (classification)

For this tutorial, we're going to use the `iris dataset`<br>
**But how to load a dataset in scikit-learn ?**<br>
Each dataset has a corresponding function for loading. It has the following format "load_dataset()".<br> For example, if we want to load the wine dataset we use `load_wine()` function.<br>
We find these functions in `sklearn.datasets`.


In [5]:
from sklearn.datasets import load_iris
iris = load_iris()
print(iris)

{'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
     

A good question to ask right now is what's return type of the load function ?<br>
Execute the cell below to figure it out

In [6]:
type(iris)

sklearn.utils._bunch.Bunch

So the return type of the load function is a `bunch`.<br>
But what's a `bunch`?<br>
According to the scikit-learn documentation, a `bunch` is a container object exposing keys as attributes.<br>Bunch objects extend dictionairies by enabling values to be accessed by key, bunch["KEYNAME"] or by an attribute, bunch.KEYNAME.<br>So, a bunch is just a dictionary that has a set of key-value pairs. <br>Let's take a look at the keys of our bunch object

In [7]:
print(iris.keys())

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])


*   **data:** our features data
*   **target:** our target data
*   **DESCR:** a description of our dataset 
*   **features_names:** the names of our features, in other words the column headers for our data
*   **target_names** are the actual values of the our target "setosa", "versicolor", "virginica"
*   **filename:** the path to the actual file of the data in CSV format

To look at each key's value, we can type iris.KEYNAME or iris["KEYNAME"]<br>
For example if we want to see value of DESCR, we type the following line of code:

In [12]:
iris.DESCR





At this point, we should as this question why our target data is numerical instead of strings?<br>
The answer is that in machine learning we don't deal with text values, so if we have text values we encode them into numerical values.<br>
Like in our example:


*   0 represents setosa 
*   1 represents versicolor
*   2 represents virginica



#Visualize our dataset
If we want to print our dataset in a tabular format, we can use `pandas` library

In [14]:
import pandas as pd
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target
print(df)

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                  5.1               3.5                1.4               0.2   
1                  4.9               3.0                1.4               0.2   
2                  4.7               3.2                1.3               0.2   
3                  4.6               3.1                1.5               0.2   
4                  5.0               3.6                1.4               0.2   
..                 ...               ...                ...               ...   
145                6.7               3.0                5.2               2.3   
146                6.3               2.5                5.0               1.9   
147                6.5               3.0                5.2               2.0   
148                6.2               3.4                5.4               2.3   
149                5.9               3.0                5.1               1.8   

     target  
0         0  

Pandas provides a wide range of function and methods to manipulate dataframes, for example: 
* To print the first n rows of our dataframe we use the method `head(n)`<br>
* To print the last n rows of our dataframe we use the method `tail(n)`

In [16]:
print(df.head(10))

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   
5                5.4               3.9                1.7               0.4   
6                4.6               3.4                1.4               0.3   
7                5.0               3.4                1.5               0.2   
8                4.4               2.9                1.4               0.2   
9                4.9               3.1                1.5               0.1   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  
5       0  
6       0  
7       0  
8       0  
9       0 

In [17]:
print(df.tail(10))

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
140                6.7               3.1                5.6               2.4   
141                6.9               3.1                5.1               2.3   
142                5.8               2.7                5.1               1.9   
143                6.8               3.2                5.9               2.3   
144                6.7               3.3                5.7               2.5   
145                6.7               3.0                5.2               2.3   
146                6.3               2.5                5.0               1.9   
147                6.5               3.0                5.2               2.0   
148                6.2               3.4                5.4               2.3   
149                5.9               3.0                5.1               1.8   

     target  
140       2  
141       2  
142       2  
143       2  
144       2  
145       2  
146       

# Building a classification model using a decision tree
`A Decision Tree` is a supervised machine learning algorithm that can be used for regression and classification problem.
<br>In this tutorial, we're not going to explain how decision trees work but we're going to use it as tool

###Split our dataset into train/test datasets
The first thing, we need to do is splitting our dataset into two, one for training the model and the second for testing the perforance of our model<br>
Scikit-learn has a built-in function called `train_test_split()` that takes as parameters: 
* our features data 
* our target data 
* optional arguments:
  + train_size: the size of our training dataset. If you provide a float it should be between 0 and 1. For example if train_size = 0.7<br>70% of our dataset will be used for training the model and the rest (30%) will be used to test our model.
  + test_size: the size of our test dataset. You should provide either the train_size or the test_size . If none of them is provided, test_size is 0.25 by default
  + random_state: it's our random seed. By setting our random_state, we ensure that each time we run our code the data is split the same way.
  + Shuffle: a boolean value that determines whether the data should be shuffled before splitting or not




In [26]:
# Store the features data
X = iris.data
# store the target data
y = iris.target
# split the data using Scikit-Learn's train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size= 0.8, random_state = 10, shuffle = True)

### Build the decision tree model

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

###Evaluating performance