<a href="https://colab.research.google.com/github/nunocesarsa/SENSECO_School_2021/blob/main/ColabNotebooks/SENSECO_00_Your_%22First%22_ML_model_in_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Scikit-learn:

"Scikit-learn is a free software machine learning library for the Python programming language.". 

Apart from the python libraries being used for deep learning (e.g. Keras, pytorch), Scikit-learn is one of the most commonly used ML libraries. AutoSklearn (the autoML library we will use) is built on top of this library.

Find more info about Scikit-learn at: 
- https://scikit-learn.org/stable/getting_started.html


#Packages & data

- in colab we can install packages using a "special" command: !pip install "package name"


In [None]:
#this updates sci-kit learn on google colab from 0.22.2.post1 to 0.23.2

#telling colab to install a package - somtimes, installing a package will require "restarting runtime" to make it work.
!pip install scikit-learn==0.23.2 

#and importing it
import sklearn

#importing packages that are used "here and there"

#general plotting
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

#to control the color maps
from matplotlib import cm

#probably the most important packages
import numpy as np
import pandas as pd

## Exploring a dataset

Adapted from: https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

Iris is a "famous" dataset that is used as example in many different tutorials. We will also use it for this first part because of that




We can load it as numpy object (numpy is a python library to work with multidimensional data)



In [None]:
from sklearn import datasets

#as a numpy object
iris = datasets.load_iris()
X = iris.data
y = iris.target

X

Or we can load it as pandas dataframe which is a very similar format to the R data frame object

In [None]:
#as a pandas dataframe
data = datasets.load_iris(as_frame=True)['data']
target = datasets.load_iris(as_frame=True)['target']

data

### Ploting

https://matplotlib.org/stable/tutorials/colors/colormaps.html



In [None]:
#Ploting the first and second column using numpy:
plt.figure(2, figsize=(8, 6))
plt.clf()

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1,
            edgecolor='k')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

In [None]:
#Plotting the same, using pandas

data.plot(x='sepal length (cm)',y='sepal width (cm)',kind="scatter", #this section is pandas plot specific
          figsize=(8, 6), #from here its basically a repetition from numpy - because this plot function calls matplotlib
          colormap=plt.cm.Set1,
          c=y,
          edgecolor='k')


### And even a 3D plot

In [None]:
from sklearn.decomposition import PCA #im loading packages as "needed" but the best practice is to load everything together, first in one place.


# To getter a better understanding of interaction of the dimensions
# plot the first three PCA dimensions
fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=-150, azim=110)
X_reduced = PCA(n_components=3).fit_transform(iris.data)
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=y,
           cmap=plt.cm.Set1, edgecolor='k', s=40)
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd eigenvector")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd eigenvector")
ax.w_zaxis.set_ticklabels([])

# Machine learning


There are dozens of possible algorithms (although some of them are different versions of the same). Explore it here:

https://scikit-learn.org/stable/supervised_learning.html#supervised-learning

We'll just try out some. In Scikit learn there is some "separation" between classification and regression tasks, you have to load each type separately. 




In [7]:
# import the regressor for random forests
from sklearn.ensemble import RandomForestClassifier
# import support vecto machine - for classifications
from sklearn.svm import SVC
# importing neural networks - MLP stands for "Multi-layer perceptron" which is a technical name for the more basic types of neural netwroks
from sklearn.neural_network import MLPClassifier

Commonly, in ML there are very complex methods for validation but because we only have 150 samples in this dataset, we cant really do much out of it. 

Scikit-learn again, offers a number of tools for dealing with this, e.g. folding: https://scikit-learn.org/0.16/modules/classes.html#module-sklearn.cross_validation 

This time though, we will make use of pandas inbuilt functions for simplicity. 



In [8]:
train_data = data.sample(frac=0.8,random_state=200)
test_data  = data.drop(train_data.index)

train_target = target.drop(test_data.index)
test_target  = target.drop(train_data.index)

## Step 1: Create model object

Random forest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Support vector machine: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

Artificial neural network: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html


In [9]:
mdl_RFC = RandomForestClassifier()
mdl_SVC = SVC()
mdl_MLP = MLPClassifier()
mdl_MLP 

MLPClassifier()

The model hyperparameters are one of the most important components of machine learning applications. They can vastly affect your model results and its highly recommended to always explore the effect of different parameters. 

These can vastly change the ability can be explored using the following commands and they can also be set using a either a dictionary or a direct expression. 




In [None]:
mdl_RFC.get_params()
#mdl_SVC.get_params()
#mdl_MLP.get_params()

In [None]:
mdl_RFC_2 = RandomForestClassifier(n_estimators=250)
mdl_RFC_3 = RandomForestClassifier(**{"n_estimators":500})

mdl_RFC_2.get_params()
mdl_RFC_3.get_params()

##Step 2: fit the model

In [None]:
mdl_RFC.fit(train_data,train_target)
mdl_SVC.fit(train_data,train_target)
mdl_MLP.fit(train_data,train_target)

##Step 3: Test the models

Scikit learn has again, "dozens!", of methods for testing your accuracy. The usefulness of these depends on your application, naturally. 

Take a look: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

In [None]:
from sklearn.metrics import accuracy_score

print("Random forest: " +  str(accuracy_score(test_target,mdl_RFC.predict(test_data))))
print("Support vector machines: " +  str(accuracy_score(test_target,mdl_SVC.predict(test_data))))
print("Artificial neural networks: " +  str(accuracy_score(test_target,mdl_MLP.predict(test_data))))



And the results can be (should be) very different

What happens though if we change some of the parameters

In [None]:
mdl_RFC_2 = RandomForestClassifier(n_estimators=200) #doubling the number of trees
mdl_SVC_2 = SVC(kernel='sigmoid') # changing the kernel from default to sigmoid type
mdl_MLP_2 = MLPClassifier(hidden_layer_sizes=(20,20))  # changing the neural network structure from one layer with 100 neurons to two layers with 50 neurons each


mdl_RFC_2.fit(train_data,train_target)
mdl_SVC_2.fit(train_data,train_target)
mdl_MLP_2.fit(train_data,train_target)

print("Random forest: " +  str(accuracy_score(test_target,mdl_RFC_2.predict(test_data))))
print("Support vector machines: " +  str(accuracy_score(test_target,mdl_SVC_2.predict(test_data))))
print("Artificial neural networks: " +  str(accuracy_score(test_target,mdl_MLP_2.predict(test_data))))

And changing the hyperparameters can change absolutely the results of the classification. 