<a href="https://colab.research.google.com/github/jrandym/OCSTA-Spring-Conference-2019/blob/master/Intro_to_DS_Spring_Conference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Data Science - OCSTA Spring Conference



**Instructor:**  
Sylvana Yelda  
*Sr. Data Scientist, Kollective*  
syelda@gmail.com

IMPORTANT: Please copy this notebook so that you can edit your own version. Go to File --> Save a copy in Drive. Then you may close the original version, and rename your copy to whatever you would like.

## Introduction

Data science combines statistics, data analysis and visualization, and machine learning with a goal of analyzing and extracting insight from large amounts of data. 

In this notebook, you will get a brief introduction to a typical data science workflow using the programming language, Python. Don't worry if you aren't familiar with Python, or even programming in general. The goal of this hour-long course is to give you a basic understanding of the steps a data scientist might take when trying to answer a specific question with data. During the 3-day course, we will go into more details of each of these steps, including a tutorial on Python.

<b>Python</b> is an open-source general-purpose programming language used in a wide variety of applications. Python was created by Guido Van Rossum in the early 1990s and is now used extensively at places like NASA, Google, Amazon, Netflix, YouTube, and Apple. 

<b>Google Colaboratory</b> is a research tool for data science and machine learning education and collaboration. It is based on the open-source <b>Jupyter Notebooks</b>, interactive documents that include text and code. The documents are ordinary files with a suffix .ipynb. They can be uploaded, downloaded, and shared like any other digital document. The notebooks are composed of individual cells containing either text, programming code, or the output of a calculation. The code cells can contain code written in many different programming languages such as Python, R, and Javascript.

## Notebook Basics

* Each cell in this notebook can either contain text or code. This cell contains text called 'Markdown'. 

* You can "run" a cell (whether it has text or code) in a few different ways:
- with the cell selected, press either:
    - Shift-Enter
    - Alt-Enter (this will run the cell then insert a new cell below)
    
* Code cells can also be run by clicking the 'Run Cell' icon on the very left of the cell.
   
* New cells can be inserted above or below existing cells. This can be done from the Insert menu or by pointing your mouse to just below or above the middle of an existing cell.

* Don't forget to save your work from time to time!

## What we'll do today

The steps taken in any data science project typically are not linear, and you may go back and forth between some steps. Below are many of the most common tasks, some of which we will cover today:


*   define question
*   gather and prepare data
*   exploratory analysis
*   machine learning
*   evaluate results
*   present findings

## Defining the Problem

Very broadly, machine learning can be split into regression problems and classification problems. *Today, we will build a machine learning model that classifies iris flowers based on a few characteristics.* We will use the well-known "iris dataset", which contains 150 observations of iris flowers. 

Three types of iris flowers are found in the dataset: 
* iris setosa
* iris  virginica
* iris versicolor

Each observation contains the following measurements in centimeters (cm):
* sepal length
* sepal width
* petal length
* petal width



## Load and Prepare Data

In [0]:
import pandas as pd
import numpy as np

Let's read in the data. The location of the file is below. Create a variable called 'url' with the link stored as a string. Then read it using pandas' read_csv() method.

https://raw.githubusercontent.com/syelda/ocsta-intro-to-ds/master/iris_data_intro_to_ds.csv

Call the data 'data'. 

When we first load up a dataset, we want to start exploring the data by printing out the data and various metrics, as well as visualizing the data. Data is almost always messy, and we need to understand it and identify any potential problems before beginning any analysis. Some of the things we might look for are any missing data, weird data points, and outliers, just to name a few.

Let's start by examining the top 5 rows of the dataset using the head() method.

### A quick sidenote about dataframes

Dataframes are a common tool used in data science. A dataframe is like a spreadsheet. It is simply a table filled with data, where each row represents a single record (or observation), and each column represents a field. Dataframes make it very easy to work with data at all stages of data science.

Subsets of a dataframe or individual columns can be extracted using square brackets:  
* df[['column1', 'column2']]  # notice the double brackets -- this returns a dataframe
* df['column1']  # single brackets returns a series (or column)

Methods (i.e., functions) can be applied to dataframes using the 'dot' notation:  
* df.head()
* df.describe()

Finally, dataframes can be filtered by using conditions in the square brackets:  
* df.loc[df['column1'] > 5]
* df.loc[(df['column1'] > 5) & (df['column2'] == 'blue')]
* df.loc[df['column1'] > 5, 'column2']

---

Ok, back to the analysis...

It is important to know if we have any missing data, as this can affect our models later on. We can use the isnull() and sum() methods together on the dataset to know how many data points are missing.

Let's look at the rows that have missing values.

The describe() method is a quick way to get summary statistics on our dataframe.

Visualizations can be very insightful at this stage of our analysis. Let's plot every variable against every other variable. This can be done easily using seaborn's pairplot.

In [0]:
import matplotlib.pyplot as plt
import seaborn as sns

In [0]:
# Note that you have to drop NAs for this to work
sns.pairplot(data.dropna(), hue='class');

What are some observations you can make from the pairplot above? Any issues you can identify?

## Data Wrangling

First let's address the outliers. We saw that there were several sepal length data points for the iris versicolor flower that have near zero values. Let's examine these further and decide what to do with them.

In [0]:
data.loc[data['class'] == 'Iris-versicolor', 'sepal_length_cm'].hist();

In [0]:
data.loc[(data['class'] == 'Iris-versicolor') &
         (data['sepal_length_cm'] < 1.0)]

Any ideas what may be going on here? Do you see anything in common amongst these 5 data points?

Could these observations have been recorded in units of meters instead of centimeters? It looks like that may be the case. 

Let's say we checked our hunch with the field researchers and we found that this is indeed what happened. Now we can fix this error in the dataset.

In [0]:
data.loc[((data['class'] == 'Iris-versicolor') &
         (data['sepal_length_cm'] < 1.0)),
         'sepal_length_cm'] *= 100.0

data.loc[data['class'] == 'Iris-versicolor', 'sepal_length_cm'].hist();

Now let's handle the observations with missing data. Recall that we were missing 5 data points for petal_width_cm.

In [0]:
data.loc[data['petal_width_cm'].isnull()]

All of the missing points are for iris setosa and for the same type of measurement (petal width). It would not be ideal to just remove these, as this could potentially bias our analysis. Instead, we can fill in the missing data (a process known as 'data imputation'). There are many clever ways of doing this. For this exercise, we will use mean imputation. If we know that the values for a measurement fall in a certain range, we can fill in empty values with the average of that measurement.

In [0]:
average_petal_width = data.loc[data['class'] == 'Iris-setosa', 
                               'petal_width_cm'].mean()

data.loc[((data['class'] == 'Iris-setosa') &
         (data['petal_width_cm'].isnull())),
         'petal_width_cm'] = average_petal_width

Print out the rows for which we just filled in data.

In [0]:
data.loc[(data['class'] == 'Iris-setosa') &
         (data['petal_width_cm'] == average_petal_width)]

Let's check for null values one more time.

Let's look at our data one more time now that it's all cleaned up!

In [0]:
sns.pairplot(data, hue='class');

Let's review the general takeaways so far.

* Examine the observed ranges and compare with the expected ranges (if possible); it is OK to use domain knowledge whenever possible to define that expected range

* Address missing data in one way or another; data can be replaced or dropped, but justification is needed

* Never clean/transform your data manually as that is not easily reproducible; always use code as a record of how you cleaned your data

* Visualize as much as you can about the data at this stage of the analysis so you can confirm everything looks correct

## Exploratory Analysis

Now to the fun part! We can start exploring the data at a deeper level using visualizations and statistics, as needed. This will give us the insight we may need while building and interpreting our machine learning model. Some of the questions we want to answer here include:
* how are my data distributed?
* are there any correlations in the data?
* are there any confounding factors that may explain the apparent correlations?

Let's start by plotting the pairplot again, this time without any color-coding.

Much of the data appear to be normally distributed, which may be important if any models we use assume a normal distribution. However, there are some interesting features in the petal distributions. Could this be because of the different species? Let's look at this again but color-code by species.

The strange distributions in the petal measurements appear to be related to the fact that we've plotted data for different distributions. This is important for our classification model! We see just with this plot that Iris-setosa has a completely different distribution in petal width and length as compared to the other 2 species. This will really help our model distinguish between the species. However, our model may have a slightly harder time distinguishing between Iris-versicolor and Iris-virginica, whose distributions have more of an overlap.

From the plot above, can you identify any possible correlations? It looks like there are correlations between petal length and petal width, as well as sepal length and sepal width. 

## Build a Model with ML

We are finally at the stage where we can model the data! The large majority of a data scientist's time is spent gathering, cleaning, and understanding the data. Only after those critical steps are completed do we actually start any work on modeling. Bad data will only lead to bad models, in other words, **garbage in = garbage out**.

When building a machine learning model, we must split our dataset into a training set and a test set. 

* A training set is a subset of the data, selected randomly, that is used to train our models.

* A testing set is the remaining subset of the data (mutually exclusive from the training set) that is used to validate (or 'score') our models. It is critical that you test your model with new data (i.e., data that was *not* used to build the model).

In [0]:
# We can extract the data in this format from pandas like this:
all_inputs = data[['sepal_length_cm', 'sepal_width_cm',
                   'petal_length_cm', 'petal_width_cm']]

# Similarly, we can extract the class labels
all_labels = data['class']

Let's look at a subset of the inputs. Print out the first 5 entries of all_inputs.

Let's split the data now.

In [0]:
from sklearn.model_selection import train_test_split

(training_inputs,
 testing_inputs,
 training_classes,
 testing_classes) = train_test_split(all_inputs, all_labels, 
                                     test_size=0.2, random_state=1)

There are many different types of models that we can use for this problem. We will use decision tree classifiers today, as they are simple to understand and yet can be quite powerful. 

Decision trees can be thought of as a series of yes/no questions about the data, each time getting closer to finding out the class of each entry — until they either classify the data set perfectly or simply can't differentiate a set of entries. 

Decision tree classifiers can take many different parameters. But for the purposes of this workshop, we will just use a basic decision tree.

More info can be found here:  
https://scikit-learn.org/stable/modules/tree.html

In [0]:
from sklearn.tree import DecisionTreeClassifier

# Create ('instantiate') the classifier
decision_tree_classifier = DecisionTreeClassifier()

# Train the classifier on the training data
decision_tree_classifier.fit(training_inputs, training_classes)

Next, we need to validate the classifier on the testing set using classification accuracy.

In [0]:
decision_tree_classifier.score(testing_inputs, testing_classes)

Wow! Our simple model scored very high! Let's think about this for a minute. We used a random subset of 80% of our data to train our model. What if we used a different subset? What if one subset of our data has mostly data for two species of flower and not the third? This could result in overfitting.

This is where cross validation comes in.

**k-fold cross validation** involves partitioning the original dataset into k parts of equal or near-equal size, each called a fold. A series of k models is trained, one per fold. e.g., in a 10-fold, model 1 is trained using folds 2-10 as the training set, and evaluated using fold 1 as the test set, and so on. So every data point is used to test the model. This will result in 10 accuracy values, 1 per fold. In other words, cross validation measures the effectiveness of your model.

https://scikit-learn.org/stable/_images/grid_search_cross_validation.png

In [0]:
from sklearn.model_selection import cross_val_score

decision_tree_classifier = DecisionTreeClassifier()

# cross_val_score returns a list of the scores, which we can visualize
# to get a reasonable estimate of our classifier's performance
cv_scores = cross_val_score(decision_tree_classifier, all_inputs, all_labels, 
                            cv=10)
plt.hist(cv_scores)
plt.title('Average score: {:.2f} +/- {:.2f}'.format(np.mean(cv_scores), np.std(cv_scores)));

With k-fold cross validation, we assess that our classifier has a mean classification accuracy of 0.96.

Next steps that could be taken (if we only had the time!):

* parameter tuning using **grid search** 
* train other types of models and compare to our decision tree classifier

## Conclusions

We built a simple, but quite accurate, machine learning model that can predict a species of Iris flower based on a set of measurements. We learned about the general process a data scientist goes through and many of the tools used in data projects. I hope you found it fun and interesting!

### Acknowledgements

Some of the analysis in this notebook was adapted from educational material created by Randal S. Olson under the [Creative Commons Attribution License](https://creativecommons.org/licenses/by/4.0/). 