# oneEMITUniversity: Machine Learning in Python

## Agenda 

- Introduction to Jupyter Notebook Environment
- Introduction to Pandas
- Introduction to Scikit-learn and Toy Datasets
- Machine Learning Example: Iris Dataset (K nearest neighbors)
- Machine Learning Example: Movie Review Classification (Natural Language Processing)

## Introduction to Pandas

"Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series." Visit https://pandas.pydata.org/ for more information.


In [None]:
# import the relevant libraries
import pandas as pd
import numpy as np

# create a new dataframe of random integers between 0 and 10 with 8 rows and 5 columns
#df = pd.DataFrame(np.random.randint(low=0, high=10, size=(8, 5)), columns=['a', 'b', 'c', 'd', 'e'])

rocket = ['Rocket','Brown', 12,'Dog']
mrbusiness = ['Mr. Business','Orange', 6, 'Cat']
jerome = ['Jerome', 'White', 1000, 'Horse']
nacho = ['Nacho', 'Gray', 100, 'Pterodactyl']
lassie = ['Lassie','Brown', 20, 'Dog']
piglet = ['Piglet', 'Pink', 5,'Pig']
marvin = ['Marvin','Green',8, np.nan]

df = pd.DataFrame([rocket, mrbusiness, jerome, nacho, lassie, piglet, marvin],columns=['Name','Color','Weight (lbs)','Species'])
print('This is the original dataframe:')
display(df)

In [None]:
# Generates descriptive statistics that summarize the central tendency, 
# dispersion and shape of a dataset’s distribution, excluding NaN values.
print('Descriptive statistical summary of the dataframe:')
display(df.describe())

In [None]:
# Verify the shape of the dataframe
print('Shape of your dataframe: {}'.format(df.shape))
print('There are {} rows.'.format(df.shape[0]))
print('There are {} columns.'.format(df.shape[1]))

In [None]:
# Look at the first five rows
print('These are the first three rows of the original dataframe:')
display(df.head(3))

# Look at the last five rows
print('These are the last three rows of the original dataframe:')
display(df.tail(3))

In [None]:
# look at a subset of your dataframe
print('Grab only the rows that where color is brown.')
display(df[df['Color'] == 'Brown'])

In [None]:
# count the values in a certain column
print('How many of each value are there in the Species column?')
display(df['Species'].value_counts())

In [None]:
# see if any values are null
print('Check how many null values are in each column')
display(df.isnull().sum())

In [None]:
# You can also read in information from csv, xlsx, txt files
df = pd.read_excel('example_dataframe.xlsx')
df

## Introduction to Scikit-learn and Toy Datasets

### Scikit-learn
Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, and k-means. Visit http://scikit-learn.org/stable/index.html for more information.

### We've learned two ways so far to import your data: creating your own dataframe and reading in info from a file. You can also work with one of sklearn's toy datasets for a super quick start.

scikit-learn comes with a few small standard datasets that do not require to download any file from some external website.
- load_boston():	Load and return the boston house-prices dataset (regression).
- load_iris():	Load and return the iris dataset (classification).
- load_diabetes():	Load and return the diabetes dataset (regression).
- load_digits():	Load and return the digits dataset (classification).
- load_linnerud():	Load and return the linnerud dataset (multivariate regression).
- load_wine():	Load and return the wine dataset (classification).
- load_breast_cancer():	Load and return the breast cancer wisconsin dataset (classification).
- load_sample_images():	Load sample images for image manipulation.

These datasets are useful to quickly illustrate the behavior of the various algorithms implemented in the scikit. They are however often too small to be representative of real world machine learning tasks. Visit http://scikit-learn.org/stable/datasets/index.html for more information.

## What can we do with these datasets? Many types of analysis. Let's try classification first.
## What is classification?
One of the major areas of data science problems is classification. With classification algorithms, you take an existing dataset and use what you know about it to generate a predictive model for use in classification of future data points. If your goal is to use your dataset and its known subsets to build a model for predicting the categorization of future data points, you’ll want to use classification algorithms.

When implementing supervised classification, you should already know your data’s subsets — these subsets are called categories. Classification helps you see how well your data fits into the dataset’s predefined categories so that you can then build a predictive model for use in classifying future data points.


To classify things, you look at the different characteristics of something. Characteristics can be things like height, length, etc. 

Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance. Some examples of classification tasks are:

Deciding whether an email is spam or not.
Deciding what the topic of a news article is, from a fixed list of topic areas such as "sports," "technology," and "politics."
Deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial institution, the act of tilting to the side, or the act of depositing something in a financial institution.

### The Iris Dataset

This example follows https://github.com/amueller/introduction_to_ml_with_python/blob/master/01-introduction.ipynb.

Let's perform a classification problem using one of the toy datasets. Let's see if we can identify what species a random iris is  based on only a few of their physical characteristics. Namely, the length and width of their petal and of their sepal. 

#### Flower Education Break for the non-botanists in the room:

<img src='https://www.wpclipart.com/plants/diagrams/plant_parts/petal_sepal_label.png',width=200,height=200>

The iris data set contains information abotu 150 different irises. There are 3 different species with 50 examples of each.

What we want to predict: species of iris plant. 

Information we know about each plant: 
1. sepal length in cm 
2. sepal width in cm 
3. petal length in cm 
4. petal width in cm 

#### Based on this, can we predict the classification of other iris flowers? Let's find out.


In [172]:
# First let's load the dataset
from sklearn.datasets import load_iris
iris_dataset = load_iris()

# These datasets have a  little bit of a unique format. Let's explore it briefly.
print("What's contained in the iris_dataset?")
print("Keys of iris_dataset: {}".format(iris_dataset.keys()))

What's contained in the iris_dataset?
Keys of iris_dataset: dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])


In [None]:
# Find out the different types of each piece of information.
print("Type of 'data' data: {}".format(type(iris_dataset['data'])))
print("Type of 'target' data: {}".format(type(iris_dataset['target'])))
print("Type of 'target_names' data: {}".format(type(iris_dataset['target_names'])))
print("Type of 'DESCR' data: {}".format(type(iris_dataset['DESCR'])))
print("Type of 'feature_names' data: {}".format(type(iris_dataset['feature_names'])))

In [None]:
print('Check the Iris Dataset Description for detailed info: \n')
print(iris_dataset['DESCR'][:970] + "\n...")

### The "raw data" contains the information about the length and width of the petal and sepal.

In [173]:
# View the raw data.
print('What does the raw data look like? \n')
print("The shape of raw data is: {}".format(iris_dataset['data'].shape))
print("First five rows of data:\n{}\n".format(iris_dataset['data'][:5]))

# Find out the feature names.
print("What does each column refer to? Check the feature_names.")
print("Feature names:\n{}".format(iris_dataset['feature_names']))

What does the raw data look like? 

The shape of raw data is: (150, 4)
First five rows of data:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]

What does each column refer to? Check the feature_names.
Feature names:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


### The "target data" contains the information about the species of the irises.

In [174]:
# View the target data.
print("What does the target data look like?\n")
print("The Shape of the target data: {}".format(iris_dataset['target'].shape))
print("The Target data:\n{}\n".format(iris_dataset['target']))

# Find out the target names.
print("Each row corresponds to one flower. Each flower has a type and so there is one target label for each row of data. \n")
print("What are the target names? {}".format(iris_dataset['target_names']))

What does the target data look like?

The Shape of the target data: (150,)
The Target data:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

Each row corresponds to one flower. Each flower has a type and so there is one target label for each row of data. 

What are the target names? ['setosa' 'versicolor' 'virginica']


### Put the iris data in a pandas dataframe for easy data manipulation!

In [None]:
# create dataframe from the iris data
# label the columns using the strings in iris_dataset.feature_names
iris_dataframe = pd.DataFrame(iris_dataset['data'], columns=iris_dataset.feature_names)
display(iris_dataframe.head())

### Let's visualize our data to start to increase our understanding of what we're working with.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

# Plot Sepal Length vs Petal Length
plt.scatter(iris_dataframe['sepal length (cm)'], iris_dataframe['petal length (cm)'], c=iris_dataset['target'])
plt.title("Petal Length vs Sepal Length")
plt.xlabel('sepal length (cm)')
plt.ylabel('petal length (cm)')

In [None]:
# Plot Sepal Width vs Petal Width
plt.scatter(iris_dataframe['sepal width (cm)'], iris_dataframe['petal width (cm)'], c=iris_dataset['target'])
plt.title("Petal Width vs Sepal Width")
plt.xlabel('sepal width (cm)')
plt.ylabel('petal width (cm)')

## Machine Learning Example: The Iris Dataset

### Let's get down to business and see if we can build a classification model that can predicts the species of an iris based on its petal/sepal length/width.

## Train Set / Test Set

The simplest way for us to get a handle on the ability of a predictive model to perform on future data is to try to simulate this eventuality. Although we cannot literally gain access to the future before it occurs we can reserve some of our currently available data and treat it as if were data from the future. 

The simplest partition possible for cross-sectional data is a two-way random partition to generate a learning (or training) set and a test set (sometimes instead referred to as a validation set). The thinking underlying such a division is that:

- The data available for analytics fairly represents the real world processes we wish to model
- The real world processes we wish to model are expected to remain relatively stable over time so that a well-constructed model built on last month’s data is reasonably expected to perform adequately on next month’s data 

If our assumptions are more or less correct then the data we have today is a reasonable representation of the data we expect to have in the future. Holding back some of today’s data for testing is therefore a fair approximation to having future data for testing.  

The learn partition has a single and essential role: it provides the raw material from which the predictive model is generated. 

Among other uses, the test partition is employed to evaluate the performance of the model. Any given sized tree built on the learn data commits to specific predictions which can be compared to test data actual outcomes

In [None]:
# Build training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], iris_dataset['target'], random_state=0)

# X refers to our iris data
# y refers to our target data
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))

print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

### What is knn? 
This example follows from https://www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-clustering/.

In this article, we will talk about another widely used classification technique called K-nearest neighbors (KNN) . Our focus will be primarily on how does the algorithm work and how does the input parameter effect the output/prediction. KNN can be used for both classification and regression predictive problems. However, it is more widely used in classification problems in the industry. 

Let’s take a simple case to understand this algorithm. Following is a spread of red circles (RC) and green squares (GS) :

Insert picture here.

You intend to find out the class of the blue star (BS) . BS can either be RC or GS and nothing else. The “K” is KNN algorithm is the nearest neighbors we wish to take vote from. Let’s say K = 3. Hence, we will now make a circle with BS as center just as big as to enclose only three datapoints on the plane. Refer to following diagram for more details:

Insert next picture here.

The three closest points to BS is all RC. Hence, with good confidence level we can say that the BS should belong to the class RC. Here, the choice became very obvious as all three votes from the closest neighbor went to RC. The choice of the parameter K is very crucial in this algorithm. Next we will understand what are the factors to be considered to conclude the best K.

How do we choose the factor K?

We can implement a KNN model by following the below steps:

1. Load the data
2. Initialise the value of k
3. For getting the predicted class, iterate from 1 to total number of training data points
    1. Calculate the distance between test data and each row of training data. Here we will use Euclidean distance as our distance metric since it’s the most popular method. The other metrics that can be used are Chebyshev, cosine, etc.
    2. Sort the calculated distances in ascending order based on distance values
    3. Get top k rows from the sorted array
    4. Get the most frequent class of these rows
    5. Return the predicted class

In [180]:
# Lucky for us we don't have to implement the KNN algorithm from scratch. 
# These libraries have already done this for us!
# Let's import it.
from sklearn.neighbors import KNeighborsClassifier

# Initalize a KNN classifier for k=2
k = 2
knn = KNeighborsClassifier(n_neighbors=k)

# Fit the model using X as training data and y as target values
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=2, p=2,
           weights='uniform')

### Let's see what it  looks like to make a prediction on one random example. Create a new random example of an iris flower and see if we can predict what species it would be.

In [181]:
# Create a new example.
# This example would have a sepal length of 5 cm, sepal width of 2.9 cm, petal length of 1 cm, and petal width of 0.2 cm.
X_new = np.array([[5, 2.9, 1, 0.2]])
print("X_new.shape: {}".format(X_new.shape))
print("X_new:", X_new)

X_new.shape: (1, 4)
X_new: [[5.  2.9 1.  0.2]]


### Predict a response for our new example.

In [182]:
# predict the response
prediction = knn.predict(X_new)
print("Prediction Class: {}".format(prediction))
print("Predicted target name: {}".format(
       iris_dataset['target_names'][prediction]))

ValueError: query data dimension must match training data dimension

### Let's now make predictions on our entire test set and evaluate our accuracy.

In [179]:
y_pred = knn.predict(X_test)
print("We have {} test examples.".format(X_test.shape[0]))
print("The test set predictions:\n {}".format(y_pred))
print("The actual test set target values are:\n{}".format(y_test))

We have 3 test examples.
The test set predictions:
 ['Martian' 'Dog' 'Dog']
The actual test set target values are:
6     Martian
11        Dog
4         Dog
Name: Species, dtype: object


### Let's evaluate the accuracy.

In [None]:
print("How many of the predictions are the same as the actual values?? How many are different??")
print(y_pred == y_test)

num_diff = len(y_pred) - (y_pred == y_test).sum()

print('You can see above that {} prediction(s) is/are incorrect.'.format(num_diff))

### Use knn.score to evaluate accuracy.

In [None]:
# Returns the mean accuracy on the given test data and labels.
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))

## Test example (Delete below for final version)

In [None]:
marvin = ['Marvin','Green',10, 'Martian']
martian2 = ['Marvin','Green',9, 'Martian']
piglet2 = ['Marvin','Pink',8, 'Pig']
martian3 = ['Marvin','Green',9, 'Martian']
piglet3 = ['Marvin','Pink',8, 'Pig']
dog2 = ['Marvin','Brown',15, 'Dog']

animals = [rocket, mrbusiness, jerome, nacho, lassie, piglet, marvin, martian2, piglet2, martian3, piglet3, dog2]
df = pd.DataFrame(animals,columns=['Name','Color','Weight (lbs)','Species'])
s1 = pd.get_dummies(df['Color'])
df = pd.concat([df.drop('Color',axis=1), s1], axis=1)

print('This is the original dataframe:')
display(df)

# Split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['Name','Species'],axis=1), df['Species'], random_state=0)
display(X_train)
display(X_test)

# Initalize a KNN classifier for k=2
knn = KNeighborsClassifier(n_neighbors=2)

# Fit the model using X as training data and y as target values
knn.fit(X_train, y_train)

# Predict
y_pred = knn.predict(X_test)

X_test = pd.read_csv('somefile.csv')

# Evaluate (returns the mean accuracy on the given test data and labels)
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))
print('Prediction')
print(y_pred)
print('Actuals')
print(y_test)

## Machine Learning Example: Movie Review Classification (NLP)

This example leverages the Document Classification activity from Chapter 6 of Natural Language Processing with Python: https://www.nltk.org/book/ch06.html. There are many other exercises in the book we recommend.

 As with sklearn, nltk also provides some example datasets in its corpus.  One such example is a list of movie reviews.

In [None]:
# Load the movie_reviews.
from nltk.corpus import movie_reviews

Let's store our movie reviews in a list. Each element of the list will store information about one review. We will have the review itself and also its corresponding category (either positive or negative).

In [None]:
# Create a list of tuples where each tuple contains (the movie review, its corresponding category)
documents = [(list(movie_reviews.words(fileid)), category) \
             for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)]

Lets look at the first two examples of our reviews.

In [None]:
# Example review 1
print("Let's look at our first review.")
print('The classification of this review is: {}'.format(documents[1][1]))
print('The Review Text:\n{}...'.format(" ".join(documents[1][0][:100])))
print('--------------------------------')
print('\n')

# Example review 2
print("Let's look at our second review.")
print('The classification of this review is: {}'.format(documents[2][1]))
print('The Review Text:\n{}...'.format(" ".join(documents[2][0][:100])))
print('--------------------------------')

The main idea here is to look at which words are contained in the review text and see if we can use that information to help us decide which class the review belongs to. For example, maybe if the review contains the word "amazing" 300 times, then the review is positive. What are the steps we need to take to accomplish this?
    - Make a list of all the words in the entire corpus. 
    - Count how many times each word appears in the corpus.
    - Sort the list from most frequently used word to least frequently used.
    - Select the top 2000 words.
    - For each document, count how many times each of these 2000 words appears.
        - Right now, each review is a series of sentences that we parse into meaningful messages as we read them. The machine cannot do this. It can't read in the same sense that we can. Instead, we'll get the machine to represent each document as a list of the words contained in the review as well as the frequency of each word. 
    - Separate the reviews into test set and train set.
    - Train/fit a Naive Bayes Classifier.
    - Test the model's accuracy. 

In [None]:
# import random
# random.shuffle(documents)

import nltk

# Create a frequency dictionary with all the words 
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())

# Take the top 2000 most frequent words.
word_features = list(all_words)[:2000]

# Function to determine words in one specific review.
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

# Create the review representations for each review.
featuresets = [(document_features(d), c) for (d,c) in documents]

A classifier based on the Naive Bayes algorithm.  In order to find the
probability for a label, this algorithm first uses the Bayes rule to
express P(label|features) in terms of P(label) and P(features|label):

\begin{equation*}
P(\text{label | features}) = \frac{\text{P(label)}*\text{P(features | label)}}{\text{P(features)}}
\end{equation*}

The algorithm then makes the 'naive' assumption that all features are
independent, given the label:

\begin{equation*}
P(\text{label | features}) = \frac{\text{P(label)}*P(f_1\text{| label)}*...*P(f_n\text{| label})}{\text{P(features)}}
\end{equation*}

Rather than computing P(features) explicitly, the algorithm just
calculates the numerator for each label, and normalizes them so they
sum to one:

\begin{equation*}
P(\text{label | features}) = \frac{\text{P(label)}*P(f_1\text{| label)}*...*P(f_n\text{| label})}
{\sum_{l} P(l)*P(f_1|l)*...*P(f_n|l)}
\end{equation*}


In [None]:
# Split the reviews into a training set and a test set.
train_set, test_set = featuresets[100:], featuresets[:100]

# Initalize and train our Naive Bayes Classifier. (Documentation at https://www.nltk.org/_modules/nltk/classify/naivebayes.html)
classifier = nltk.NaiveBayesClassifier.train(train_set)

# Evaluate the accuracy of the classifer on the test set.
print(nltk.classify.accuracy(classifier, test_set))

In [None]:
# Let's take a look at one example:
review_features, classification = test_set[0]

print('The classification of this example review is: {}.'.format(classification))
for feature in list(review_features.keys())[:10]:
    print(feature, review_features[feature])

In [None]:
# TODO: show results on one test example


What this tells you is the ratio of occurences in negative to positive, or visa versa, for every word. So here, we can see that the term "recognizes" appears 8.1 more times as often in positive reviews as it does in negative reviews. "Unimaginative" appears 7.8 more times as often in negative reviews as it does in positive reviews.

In [None]:
# Determine the most relevant features, and display them.
classifier.show_most_informative_features(20)