# __Introduction to Python for Data Science__
## _CSE Mentor Program - University of Colorado, Denver. Spring-2019_

This workshop is intended to introduce Python to Undergrad and Graduate students in the context of Data Science techniques. 

During three sessions we will covering the basis of the Python Language, the use of Pandas to access and manipulate data and the Scikit-Learn library to do some basic analysis. 

# Session 3 - Introduction to ScikitLearn
In this session we will focus on SciKit Learn, a library specialized in machine learning and data science tools. 
We will cover only a few models in this workshop, however, it will illustrate the process of doing data analysis but not covering the algorithms formal model. 

<hr/>


In [None]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

# Scikit Learn

Now we will focus on using a library to, using probabilitics models, classify or predict new values based on a historical training sample. 

In [None]:
from sklearn import datasets # contains sample datases. 
import sklearn.metrics as metrics  # a module to measure how well out model is.

## 1. Classifying Flowers using Gaussian Naive-Bayes

In machine learning, Naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features.

We call a feature to every single characteristic of our dataset, like age, gender, heigth, weigth, etc. 

__The Gaussian Naive-Bayes relays in the following equiation__
\begin{equation*}
P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)
\end{equation*}

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
iris = datasets.load_iris()

In [None]:
print(iris['DESCR'])
print("="*30)
print("There are {} iris data points.".format(len(iris.data)))
print("Features",iris.feature_names)
print("Data",iris.data[:5])
print("Target Names:",iris.target_names)
print("Target",iris.target[:5])

<hr>

#### Create a model and train it. 

In [None]:
gnb = GaussianNB()  #Gaussian Naive-Bayes

## Fit the Model
gnb.fit(iris.data, iris.target)

#Predict the Lilly 
type_predictions = gnb.predict(iris.data)

type_predictions

In [None]:
for i in range(45,80):
    real_type      = iris.target_names[iris.target[i]]
    predicted_type = iris.target_names[type_predictions[i]]
    
    text = "Specimen "+str(i)+ " predicted " + predicted_type + " real type is " + real_type +"."
    
    if real_type != predicted_type:
        text += " The Prediction was NOT CORRECT."
    
    print (text)


In [None]:
number_of_points  = iris.data.shape[0]  # is the same as len(iris.data)
mislabeled_points = (iris.target != type_predictions).sum()
print("Number of mislabeled points out of a total {} points were: {}".format(number_of_points,mislabeled_points))

## 2. Metrics on the predictions

#### Accuracy
Accuracy is the proximity of measurement results to the true value; precision, the repeatability, or reproducibility of the measurement.

#### Recall
The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.
The best value is 1 and the worst value is 0.

#### F1
F1 score is a measure of a test accuracy. It considers both the precision p and the recall r of the test to compute the score

\begin{equation*}
F1 = 2 \times \frac{precision * recall }{precision + recall}
\end{equation*}

<img width=30% src="./files/recall.png">
<img width=50% src="./files/accuracy_and_all.png">


In [None]:
true_values = iris.target
print("Accuracy is {}".format(metrics.accuracy_score(true_values, type_predictions)))
print("Recall is {}".format  (metrics.recall_score  (true_values, type_predictions, average="weighted", labels=np.unique(true_values) ) ) )
print("F1-Score is {}".format(metrics.f1_score      (true_values, type_predictions, average="weighted", labels=np.unique(true_values) ) ) )

### What are the issues in our code?

Well, we are training our model with all out data, so the accuracy and recall is expected to be high, as our model is "perfectly" fit for our data. This issue is called __overfitting__. 

We usually, try to split our sample intro a train and test datasets to get a better picture of what is going on. 


In [None]:
len(iris.data)

In [None]:
gnb = GaussianNB()  #Gaussian Naive-Bayes

## Fit the Model
gnb.fit(iris.data[:120], iris.target[:120])

#Predict the Lilly 
type_predictions = gnb.predict(iris.data[120:])


In [None]:
true_values = iris.target[120:]
print("Accuracy is {}".format(metrics.accuracy_score(true_values, type_predictions) ))
print("Recall is {}".format(  metrics.recall_score  (true_values, type_predictions, average="weighted", labels=np.unique(true_values)) ))
print("F1-Score is {}".format(metrics.f1_score      (true_values, type_predictions, average="weighted", labels=np.unique(true_values)) ))

## Handwriting Recognition using SVM

SVM (support vector machine) is a machine learning supervised classification algorithm. 
Given a set of training examples, each marked as belonging to a category, an SVM training algorithm builds a model that assigns new examples to a category, making it a non-probabilistic binary linear classifier.

In other words, the SVM sepparates the points using a line.

<span style="border:5px solid black"><img src="./files/Kernel_Machine.png" width=40% style="width:20%;border:5px solid black"></span>



In [None]:
from sklearn import datasets
from sklearn import svm

In [None]:
digits = datasets.load_digits()
print(digits['DESCR'])
print("="*30)
print("There are {} digits data points.".format(len(digits.data)))
print("Data",digits.data[:5])
print("Target Names:",digits.target_names)
print("Target",digits.target[:5])


In [None]:
imageId = -1
img2 = (digits.images[imageId]).reshape([8,8])
plt.imshow(img2,cmap=plt.cm.gray_r)   #gray_r reversed, so 255 is black instead of white
plt.show()
print ("the image represent",digits.target[imageId])

In [None]:
#
# Creating our SVM model 
#
clf = svm.SVC(gamma=0.001, C=100.)   #C-Support Vector Classification.

In [None]:
print("The array has",len(digits.data),"images")

In [None]:
#
# TRAINING PHASE
#
#using the first 1500 images for training the model and the rest for testing.

clf.fit(digits.data[:1500], digits.target[:1500])

In [None]:
#
#  PREDICITON PHASE
#
def predict(imageID,display=False):
    if display:
        img2 = (digits.images[imageID]).reshape([8,8])
        plt.imshow(img2,cmap=plt.cm.gray_r)
        plt.show()
    prediction = clf.predict(digits.data[imageID:imageID+1])[0]
    text = "it is predicted to be a:"+str(prediction)+", when it really is a: "+str(digits.target[imageID:imageID+1][0])
    return prediction == digits.target[imageID:imageID+1][0], text

In [None]:
for i in range(1501,1796):
    ok, text = predict(i)
    if not ok:
        print(">>> id: "+str(i)+" <<<",text)

In [None]:
predict(1573,True)

## Clustering for Wine Classification.

Clustering is a technique that tries to sepparate the dataset into _n_ subsets. For each subset it define a center that defines the characteristics of the subset. Each point in the dataset, will be assigned to the subset (cluster) that minimizes the distance with the cluster center.

There are several techniques to implement clustering. The most used are K-Means (and its variants) and DB-Scan.


<span style="border:5px solid black"><img src="./files/k-means.jpg" width=30% style="width:50%;border:5px solid black"></span>
<span style="border:5px solid black"><img src="./files/dbscan.png" width=30% style="width:50%;border:5px solid black"></span>
<span style="border:5px solid black"><img src="./files/dbscan2.png" width=30% style="width:50%;border:5px solid black"></span>



In [None]:
wine = datasets.load_wine()
print(wine["DESCR"])
print("="*30)
print("There are {} wine sample data points.".format(len(wine.data)))
print("Data",wine.data[:5])
print("Target Names:",wine.target_names)
print("Target",wine.target[:5])

In [None]:
from sklearn.cluster import KMeans
import numpy as np

In [None]:
print("Wine Features")
print("-------------")
for i,data in enumerate(wine.feature_names):
    print(i,data)

In [None]:
#
# Let's use the first two features only (alcohol & malic_acid)
#
wine2d = np.asarray(np.asmatrix(wine.data)[:,:2])
wine2d[:10]

In [None]:
kmeans = KMeans(n_clusters=3, random_state=0).fit(wine2d)
print("Labels\n",kmeans.labels_)
print("Centers\n",kmeans.cluster_centers_)

In [None]:
x = np.asarray(np.asmatrix(wine2d)[:,:1]).reshape(len(wine2d))
y = np.asarray(np.asmatrix(wine2d)[:,1:2]).reshape(len(wine2d))
colorLabel = kmeans.labels_
colorMap   = matplotlib.colors.ListedColormap(['green','blue','orange'])



centroids_x = np.asarray(np.asmatrix(kmeans.cluster_centers_)[:,:1])
centroids_y = np.asarray(np.asmatrix(kmeans.cluster_centers_)[:,1:2])



fig = plt.figure(1, figsize=(8,8))     #Figure of 9inches wide, 3 inches tall.

# equivalent but more general
ax=fig.add_subplot(1, 1, 1)

ax.set_xlabel(wine.feature_names[0])
ax.set_ylabel (wine.feature_names[1])

ax.scatter(x,y, c=colorLabel, cmap=colorMap , s=20 )
ax.scatter(centroids_x,centroids_y, s=100, c="red")

ax.scatter([13,14,12.75],[5,1,1], s=100, c="black",marker="^")
ax.annotate("(13,5)",(13,5))
ax.annotate("(14,1)",(14,1))
ax.annotate("(12.75,1)",(12.75,1))
ax.minorticks_on()

plt.ylim(bottom=0)
plt.xlim(left=10,right=16)
plt.title("My Scatter Plot")

custom_lines = [matplotlib.lines.Line2D([0], [0], marker='o', color="green", lw=0),
                matplotlib.lines.Line2D([0], [0], marker='o', color="blue", lw=0),
               matplotlib.lines.Line2D([0], [0], marker='o', color="orange", lw=0)]

ax.legend(custom_lines, ['Cluster_1','Cluster_2','Cluster_3'])

plt.show()


In [None]:
predict_classes = [[13,5],[14,1],[12.75,1]]
for value in predict_classes:
    predicted_class = kmeans.predict([value])[0]
    centroid = kmeans.cluster_centers_[predicted_class]
    print("Kmeans predicts for point {:^11} to belong a class {} in the cluster with center: {}".format(str(value), predicted_class, centroid))         

In [None]:
print('-'*120)
print("Using Kmeans with 3 clusters, with 2 features, we achieved an accuracy of {}".format(metrics.accuracy_score( wine.target,kmeans.labels_)))
print('-'*120)

In [None]:
best_size=0
best_acc=0
for i in range(1,11):
    kmeans = KMeans(n_clusters=i, random_state=0).fit(wine2d)
    accuracy = metrics.accuracy_score( wine.target,kmeans.labels_)
    if accuracy>best_acc:
        best_acc= accuracy
        best_size=i
    print('-'*120)
    print("Using Kmeans with {} clusters, with two features we achieved an accuracy of {}".format(i,accuracy))
    
print('-'*120,"\n")
print("*"*120)
print("Best results were achieved with {} clusters reaching a {} of accuracy.".format(best_size, best_acc))
print("*"*120,"\n")


#### Let's use all the features

In [None]:
kmeans = KMeans(n_clusters=3, random_state=0).fit(wine.data)
print("Labels\n","-"*30,'\n',kmeans.labels_)
print("Centers\n","-"*30,'\n',kmeans.cluster_centers_)
print('-'*120)
print("Using Kmeans with 3 clusters, with all the features we achieved an accuracy of {}".format(metrics.accuracy_score( wine.target,kmeans.labels_)))
print('-'*120)

In [None]:
best_size=0
best_acc=0
for i in range(1,11):
    kmeans = KMeans(n_clusters=i, random_state=0).fit(wine.data)
    accuracy = metrics.accuracy_score( wine.target,kmeans.labels_)
    if accuracy>best_acc:
        best_acc= accuracy
        best_size=i
    print('-'*120)
    print("Using Kmeans with {} clusters, with all the features we achieved an accuracy of {}".format(i,accuracy))
    
print('-'*120,"\n")
print("*"*120)
print("Best results were achieved with {} clusters reaching a {} of accuracy.".format(best_size, best_acc))
print("*"*120,"\n")


In [None]:
wine2d = np.asarray(np.asmatrix(wine.data)[:,[10,6,9]])
best_size=0
best_acc=0
for i in range(1,11):
    kmeans = KMeans(n_clusters=i, random_state=0).fit(wine2d)
    accuracy = metrics.accuracy_score( wine.target,kmeans.labels_)
    if accuracy>best_acc:
        best_acc= accuracy
        best_size=i
    print('-'*120)
    print("Using Kmeans with {} clusters, with two features we achieved an accuracy of {}".format(i,accuracy))
    
print('-'*120,"\n")
print("*"*120)
print("Best results were achieved with {} clusters reaching a {} of accuracy.".format(best_size, best_acc))
print("*"*120,"\n")