# Advanced Topics in Data Science (CS5661). Cal State Univ. LA, CS Dept.
### Instructor: Dr. Mohammad Porhomayoun
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------


# Data Science in Python

#### This is a review of data sceince libraries/packages in python. Feel free to refer to the suggested resources and documentaries for more details.

---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------


# Scikit-Learn Library (sklearn):
Scikit-learn is the Python Machine Learning Library. It includes optimal implementation of various classification, regression and clustering algorithms. It also includes hundreds of commands and functions for data preprocessing and processing along with a number of default datasets to work with.


## The Main Steps to build (train) and use (test/predict) a predictive model in sklearn:

## Step1: Importing the sklearn class (machine learning algorithm) that you would like to use for modeling:

In [2]:
# The following line will import DecisionTreeClassifier "Class"
# DecisionTreeClassifier is name of a "sklearn class" to perform "Decision Tree Classification" 

from sklearn.tree import DecisionTreeClassifier

In [3]:
# Importing the required packages and libraries
# we will need numpy and pandas later
import numpy as np
import pandas as pd


## Step2: Set up the Feature Matrix and Label Vector:

## Let's start with iris data as a popular and simple dataset:


In [4]:
# reading a CSV file directly from Web, and store it in a pandas DataFrame:
# "read_csv" is a pandas function to read csv files from web or local device:

iris_df = pd.read_csv('https://raw.githubusercontent.com/mpourhoma/CS5661/master/iris.csv')


In [5]:
# checking the dataset by printing every 10 lines:
iris_df[0::10]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
10,5.4,3.7,1.5,0.2,setosa
20,5.4,3.4,1.7,0.2,setosa
30,4.8,3.1,1.6,0.2,setosa
40,5.0,3.5,1.3,0.3,setosa
50,7.0,3.2,4.7,1.4,versicolor
60,5.0,2.0,3.5,1.0,versicolor
70,5.9,3.2,4.8,1.8,versicolor
80,5.5,2.4,3.8,1.1,versicolor
90,5.5,2.6,4.4,1.2,versicolor


In [6]:
# Defining a function to convert "categorical" labels to "numerical" labels (Optional)
# Notice that the latest version of Scikit-Learn can also handdle categorical labels. So, this step is optional.

def categorical_to_numeric(x):
    if x == 'setosa':
        return 0
    elif x == 'versicolor':
        return 1
    elif x == 'virginica':
        return 2
    
# Applying the function on species column and adding corrsponding label column:
iris_df['label'] = iris_df['species'].apply(categorical_to_numeric)

# checking the dataset by printing every 10 lines:
iris_df[0::10]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,label
0,5.1,3.5,1.4,0.2,setosa,0
10,5.4,3.7,1.5,0.2,setosa,0
20,5.4,3.4,1.7,0.2,setosa,0
30,4.8,3.1,1.6,0.2,setosa,0
40,5.0,3.5,1.3,0.3,setosa,0
50,7.0,3.2,4.7,1.4,versicolor,1
60,5.0,2.0,3.5,1.0,versicolor,1
70,5.9,3.2,4.8,1.8,versicolor,1
80,5.5,2.4,3.8,1.1,versicolor,1
90,5.5,2.6,4.4,1.2,versicolor,1


In [7]:
# Creating the Feature Matrix for iris dataset:

# create a python list of feature names that would like to pick from the dataset:
feature_cols = ['sepal_length','sepal_width','petal_length','petal_width']

# use the above list to select the features from the original DataFrame
X = iris_df[feature_cols]  

# print the first 5 rows
X[::5]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


In [8]:
# select a Series of labels (the last column) from the DataFrame
y = iris_df['species']  # or: iris_df['label']

# checking the label vector by printing every 10 values
y[::10]

0          setosa
10         setosa
20         setosa
30         setosa
40         setosa
50     versicolor
60     versicolor
70     versicolor
80     versicolor
90     versicolor
100     virginica
110     virginica
120     virginica
130     virginica
140     virginica
Name: species, dtype: object

## Step3: Defining (instantiating) an "object" from the sklearn class:

In [9]:
# In the following line, "my_decisiontree" is instantiated as an "object" of DecisionTreeClassifier "class". 

my_decisiontree = DecisionTreeClassifier()


## Step4: Traning Stage: Traning a predictive model using the training dataset:
#### Traning Stage called Fitting in sklearn
#### Method "fit" is used for many sklearn classes

In [10]:
# We can use the method "fit" of the "object my_decisiontree" along with training dataset and labels to train the model.

my_decisiontree.fit(X, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

## Step5: Testing (Prediction) Stage: Making prediction on new observations (Testing Data) using the trained model:
##### Now, Suppose that we have a new observation (a new data sample) with Known features [6, 3, 5.9, 2.9], and Unknown label. What would be our predition for the label of this new observation?
#### Testing Stage is called Predict in sklearn
#### Method "predict" is used for many sklearn classes

In [11]:
# We can use the method "predict" of the *trained* object my_decisiontree on one or more testing data sample to perform prediction:

X_Testing = [[6, 3, 5.9, 2.9]]

y_predict = my_decisiontree.predict(X_Testing)

print(y_predict)

['virginica']


In [12]:
# We can use the method "predict" of the *trained* object knn on one or more testing data sample to perform prediction:
# Two new data samples:

X_Testing = [[6, 3, 5.9, 2.9],[3.2, 3, 1.9, 0.3]]

y_predict = my_decisiontree.predict(X_Testing)

print(y_predict)

['virginica' 'setosa']


# Evaluating the accuracy of our classifier:

#### 1- Let's split the iris dataset RANDOMLY into two new datasets: Training Set (e.g. 70% of the dataset) and Testing Set (30% of the dataset).
#### 2- Let's pretend that we do NOT know the label of the Testing Set!
#### 3- Let's Train the model on only Training Set, and then Predict on the Testing Set!
#### 4- After prediction, we can compare the "predicted labels" for the Testing Set with its "actual labels" to evaluate the accuracy of our Decision Tree Classifier!

#### We will learn more about model and accuracy evaluation in future tutorials!

In [15]:
# Randomly splitting the original dataset into training set and testing set
# The function"train_test_split" from "sklearn.cross_validation" library performs random splitting.
# "test_size=0.3" means that pick 30% of data samples for testing set, and the rest (70%) for training set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [16]:
# print the size of the traning set:
print(X_train.shape)
print(y_train.shape)


(105, 4)
(105,)


In [17]:
# print the size of the testing set:
print(X_test.shape)
print(y_test.shape)


(45, 4)
(45,)


In [18]:
print(X_test)
print('\n')
print(y_test)

     sepal_length  sepal_width  petal_length  petal_width
14            5.8          4.0           1.2          0.2
98            5.1          2.5           3.0          1.1
75            6.6          3.0           4.4          1.4
16            5.4          3.9           1.3          0.4
131           7.9          3.8           6.4          2.0
56            6.3          3.3           4.7          1.6
141           6.9          3.1           5.1          2.3
44            5.1          3.8           1.9          0.4
29            4.7          3.2           1.6          0.2
120           6.9          3.2           5.7          2.3
94            5.6          2.7           4.2          1.3
5             5.4          3.9           1.7          0.4
102           7.1          3.0           5.9          2.1
51            6.4          3.2           4.5          1.5
78            6.0          2.9           4.5          1.5
42            4.4          3.2           1.3          0.2
92            

#### Training ONLY on the training set:

In [19]:
# Training ONLY on the training set:

my_decisiontree.fit(X_train, y_train)


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

### Testing on the testing set:

In [20]:
# Testing on the testing set:

y_predict = my_decisiontree.predict(X_test)

print(y_predict)

['setosa' 'versicolor' 'versicolor' 'setosa' 'virginica' 'versicolor'
 'virginica' 'setosa' 'setosa' 'virginica' 'versicolor' 'setosa'
 'virginica' 'versicolor' 'versicolor' 'setosa' 'versicolor' 'versicolor'
 'setosa' 'setosa' 'versicolor' 'versicolor' 'virginica' 'setosa'
 'virginica' 'versicolor' 'setosa' 'setosa' 'versicolor' 'virginica'
 'versicolor' 'virginica' 'versicolor' 'virginica' 'virginica' 'setosa'
 'versicolor' 'setosa' 'versicolor' 'virginica' 'virginica' 'setosa'
 'versicolor' 'virginica' 'versicolor']


# Accuracy Evaluation:
#### After prediction, we can now compare the "predicted labels" for the Testing Set with its "actual labels" to evaluate the accuracy of our KNN Classifier!

In [21]:
# Function "accuracy_score" from "sklearn.metrics" will perform element-to-element comparision and returns the 
# percent of correct predictions:

from sklearn.metrics import accuracy_score

# Example:
y_pred    = [0, 2, 1, 1]
y_actual  = [0, 1, 2, 1]

score = accuracy_score(y_actual, y_pred)

print(score)

0.5


In [22]:
# We can now compare the "predicted labels" for the Testing Set with its "actual labels" to evaluate the accuracy 
# Function "accuracy_score" from "sklearn.metrics" will perform the element-to-element comparision and returns the 
# portion of correct predictions:

from sklearn.metrics import accuracy_score

score = accuracy_score(y_test, y_predict)

print(score)

0.9555555555555556


### checking the results:

In [23]:
results = pd.DataFrame()

results['actual'] = y_test 
results['prediction'] = y_predict 

print(results)

         actual  prediction
14       setosa      setosa
98   versicolor  versicolor
75   versicolor  versicolor
16       setosa      setosa
131   virginica   virginica
56   versicolor  versicolor
141   virginica   virginica
44       setosa      setosa
29       setosa      setosa
120   virginica   virginica
94   versicolor  versicolor
5        setosa      setosa
102   virginica   virginica
51   versicolor  versicolor
78   versicolor  versicolor
42       setosa      setosa
92   versicolor  versicolor
66   versicolor  versicolor
31       setosa      setosa
35       setosa      setosa
90   versicolor  versicolor
84   versicolor  versicolor
77   versicolor   virginica
40       setosa      setosa
125   virginica   virginica
99   versicolor  versicolor
33       setosa      setosa
19       setosa      setosa
73   versicolor  versicolor
146   virginica   virginica
91   versicolor  versicolor
135   virginica   virginica
69   versicolor  versicolor
128   virginica   virginica
114   virginica   vi

In [None]:
# How about using only two feature rather than all 4 for classification?
# Try this:
# feature_cols = ['sepal_length','sepal_width']
