# Data Science in Real World

### Two types of inference tasks:

***Classification*** - output (**Y**) is a categorical or ordinal or discrete in value

e.g. 
- **X**: personal bank information,  **Y**: Risk Level for Loan (High, Medium, Low Risk)
- **X**: youtube watch history, **Y**: Recommendation to watch a certain video (Yes, No)

***Regression*** - output (**Y**) is a numeric and continuous value

e.g. 
- **X**: car attributes, **Y**: price of the car
- **X**: camera view from a car, **Y**: The location of the car (x, y coordinates)

**Question:**

   <ul>
    <li> X: Previous Stock values Y: Stock value on the next trading day </li>
    <li> X: Animal audio Y: Identity of the animal </li>
    <li> X: Different sensory input Y: Potability of water </li>
   </ul>
Which problem the tasks listed above belong to, classification or regression?

## <span style="color:#1A9FFF">1 Sci-learn Library</span>

[Sci-learn](http://scikit-learn.org/stable/) is a machine learning library in Python. It includes many simple and efficient tools for data mining and data analysis. Sci-learn is built based on NumPy, SciPy, and matplotlib. Many popular machine learning models are included and can be easily used in sci-learn package.

It can be used to solve problem like:

<ul>
    <li> Classification </li>
    <li> Regression </li>
    <li> Clustering </li>
    <li> Dimensionality Reduction </li>
    <li> Data Preprocessing </li>
    <li> Model Selection </li>
    <li> ... </li>
</ul>

In this tutorial we only focus on applying sci-learn in supervised learning.

Sci-learn can be install using `conda`. Just follow the steps in [Installing sci-learn](http://scikit-learn.org/stable/install.html).

# <span style="color:#1A9FFF"> 2. Decision Trees </span>

### Warming Up Example

 How to decide the variety of a bear if we know its **color** and **where it lives**? 
 
 <ul>
    <li> feature: ** (color, habitat) ** </li>
    <li> Is the bear black or white ? </li>
    <li> Does the bear live in China ? </li>
</ul>

<table><tr>
<td> <img src="pic/polarbear.jpg" alt="Drawing" style="width: 250px;"/> </td>
<td> <img src="pic/blackbear.jpg"alt="Drawing" style="width: 250px;"/> </td>
<td> <img src="pic/panda.jpg"alt="Drawing" style="width: 250px;"/> </td>
</tr></table>


<img src="pic/illuDeciTree.png">
Solve the classification problem by asking a few questions, which can be implemented by `IF statements`.

## <span style="color:#1A9FFF">2.1. Intro to Decision Trees</span>

- hierarchical data structure implementing the divide-and-conquer strategy 
- a series of IF-THEN rules
- can be used for classification and regression problems
- two parts: decision nodes and leaf nodes

** Decision Nodes ( internal node ) ** - Implementing a rule/function with discrete outcomes labeling the brances, and the decision is made based on input data. It help determine the sub-space.

** Leaf Node ( terminal node ) ** - The "decision" or value that constitutes the output. It help determine the label of the sub-space.

The decision tree is trying to recursively split the feature space so that most of the resulting samples in each sub-space belong to the sample class.

<img src="pic/splitspace.png">

## <span style="color:#1A9FFF">2.2 Decision Trees for Classification</span>

 <ul>
    <li> A Simple Case ( Data Handling, Training, Testing ) </li>
    <li> Real Cases </li>
    <li> Tree Visualization </li>
</ul>

### A Simple Case

#### Import Library

In [None]:
from sklearn import tree

#### Data
<ul>
    <li> Input : **X** ( Feature Matrix ) with size **( n_samples, n_features )**</li>
    <li> Output : **Y** ( Labels ) with size **( n_sample )** </li>
</ul>

In [None]:
X = [[0, 0, 0], [1, 1, 1]]  # input feature matrix, size:[2, 3], 2 samples with dimension 3
Y = [0, 1]    

#### Training

Using `DecisionTreeClassifier`, which define a Decision Tree model, as a classifier (model) for classification. 

Using `.fit()` function to train the model with given data **X** and **Y**.

In [None]:
# Define a model
classifier = tree.DecisionTreeClassifier()

# Training
classifier = classifier.fit(X, Y)

#### Testing ( Prediction )

Using `.predict()` function to do the testing.

In [None]:
X_test = [[2, 2, 1]]  # testing feature matrix, size: [1, 3], one sample with dimension 3

# Testing (Prediction)
print ("The output label is", classifier.predict(X_test))

# Check the Probability belonging to each class
print ("The probabilities to the two classes are", classifier.predict_proba(X_test))

`DecisionTreeClassifier` is capable of both `binary` ( where the labels are [ -1, 1 ] ) classification and `multi-class` ( where the labels are [ 0, … , K-1 ] ) classification.

### Real Case 1 -  Iris dataset

#### Data

The dataset we use is `Iris`, which is a classic and very easy multi-class classification dataset. 

*** Summary of the Iris Dataset ***

<ul>
    <li> 3 classes </li>
    <li> 50 samples per class </li>
    <li> 150 samples in total </li>
    <li> Each feature has 4 dimensions ( sepal length, sepal width, petal length, petal width ) </li>
</ul>


<table><tr>
<td> Iris-setosa<img src="pic/iris-setosa.jpg" alt="Drawing" style="width: 250px;"/> </td>
<td> Iris-versicolor<img src="pic/iris-versicolor.jpg"alt="Drawing" style="width: 250px;"/> </td>
<td> Iris-virginica<img src="pic/iris-virginica.jpg"alt="Drawing" style="width: 250px;"/> </td>
</tr></table>


In [None]:
# Load data directly from ski-learn package
from sklearn.datasets import load_iris

# Load data
iris = load_iris()

# Class names
print ("The names of 3 classes: ", list(iris.target_names))

# Feature names
print ("The names of 4 features:", list(iris.feature_names))

In [None]:
# Get samples
X_iris = iris.data
print ("The shape of Iris dataset :", X_iris.shape)

# Size of data
n_samples  = X_iris.shape[0]
n_features = X_iris.shape[1]
print ("The number of samples:     ", n_samples)
print ("The dim of each feature:   ", n_features)

# Get labels
Y_iris = iris.target
print ("\nThe shape of labels in Iris dataset :", Y_iris.shape)

In [None]:
# Draw the first two dimensions of samples to see their distribution in feature space

%matplotlib inline
import matplotlib.pyplot as plt

# Import some data to play with
X = X_iris[:, :2]  # we only take the first two features.
y = Y_iris

x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

plt.figure(2, figsize=(8, 6))
plt.clf()

# Plot the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1, edgecolor='grey', s=40)
plt.title('Sample Distribution in Feature Space')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())

#### Training

Training Data: 80% of the samples in Iris dataset

In [None]:
import numpy as np

# Randomly permute IDs
rand_ids = np.random.permutation(n_samples)

# Training IDs
n_train = int(n_samples*0.8)
train_ids = rand_ids[0:n_train]

# Get training samples
X_iris_train = X_iris[train_ids]
Y_iris_train = Y_iris[train_ids]


# Training
from sklearn import tree
tree_clf = tree.DecisionTreeClassifier()
tree_clf = tree_clf.fit(X_iris_train, Y_iris_train)

#### Tree Visualization

Using [`Graphviz`](http://www.graphviz.org) for tree structure visualization. Once the decision tree is trained, we can export the tree in Graphviz format using the `export_graphviz` exporter. Using the commands below to install `Graphviz` if Anaconda is installed.

**type in terminal : ** `conda install python-graphviz`

In [None]:
#import Graphviz package
import graphviz

tree.export_graphviz(
                         tree_clf,
                         feature_names = iris.feature_names,  
                         class_names   = iris.target_names,  
                         filled  = True, 
                         rounded = True,  
                         special_characters = True,
                         out_file = 'tree.dot'
                    )  # save tree as .dot file

with open("tree.dot") as f:
    dot_graph = f.read()      # read from .dot file
    
    
graph = graphviz.Source(dot_graph)
graph

# gini    : Gini coefficient to evaluate the purity of samples after spliting
# samples : The number of samples in current node
# value   : The number of samples in each class in current node
# clas    : The majority class of samples in current node

#### Testing

Testing Data: 20% of the samples in Iris dataset

In [None]:
# Testing IDs
test_ids = rand_ids[n_train:]
n_test   = test_ids.shape[0]


# Get testing samples
X_iris_test = X_iris[test_ids]
Y_iris_test = Y_iris[test_ids]


# Testing
Y_iris_pred = tree_clf.predict(X_iris_test)


# Check accuracy
results = (Y_iris_pred == Y_iris_test)
results = results.astype(int)

accuracy = sum(results)*100.0/n_test
print ("The testing accuracy is %0.2f%%" % accuracy)

## Try this:
* ** 1. Vary the percentage of training and testing data (say 30% and 70% respectively) and rerun the above code. Notice any change in the training and testing accuracy?**
    
