# Introduction to Machine Learning 1

General machine learning work flow:
1. Choose a class of model
2. Choose model hyperparameters
3. Fit the model to the training data ("training")
4. Use the model to predict labels for new data
    - If labels are known (test data, aka 'gold' data), evaluate the performance. 

### Three types of ML:
https://jakevdp.github.io/PythonDataScienceHandbook/05.01-what-is-machine-learning.html

1. Regression: predicting continuous values
2. Classification: predicting discrete labels
3. Clustering: inferring labels on unlabeled data

In [None]:
# Turns on/off pretty printing 
%pprint

# Every returned Out[] is displayed, not just the last one. 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn               # sklearn is the ML package we will use
import seaborn as sns        # seaborn graphical package

## Regression: predicting continuous labels
- Given years of experience (2 years, 10 years), predict the salary (50K? 90K?) 
- Both parameters are continuous numerical values --> regression
- We'll load CSV files directly from a web address. (Yes we can do that!)  

In [None]:
# CSV file found on 
# https://github.com/csjcode/course-machinelearning-az/blob/master/Machine-Learning-A-Z/Part%202%20-%20Regression/Section%204%20-%20Simple%20Linear%20Regression/Salary_Data.csv
# CSV files on GitHub are rendered. Click on "Raw" to get to the raw file. 
# This salary data has cleaner correlation. 
url = "https://raw.githubusercontent.com/csjcode/course-machinelearning-az/master/Machine-Learning-A-Z/Part%202%20-%20Regression/Section%204%20-%20Simple%20Linear%20Regression/Salary_Data.csv"
dataset = pd.read_csv(url)
dataset.columns = ['years_experience', 'salary']

# https://github.com/bokeh/bokeh/blob/master/examples/app/export_csv/salary_data.csv
# This salary data has more variability. 
# url = "https://raw.githubusercontent.com/bokeh/bokeh/master/examples/app/export_csv/salary_data.csv"
# dataset = pd.read_csv(url)

In [None]:
dataset.info()
dataset.head()
dataset.tail()

In [None]:
plt.scatter(dataset['years_experience'], dataset['salary'])
plt.show()

### Preparing data for machine learning. 
Need to create:
- Input data, which we will call X. 1+ columns of data points ("features"). 
    - We have only 1 "feature", however, which is years of experience.  
- Target data, which we will call y. A series of data points. 
    - Target is salary dollar amount. 

In [None]:
x = dataset['years_experience']    # series: lower-case x
X = dataset[['years_experience']]  # dataframe with only one column. Uppercase X. 
y = dataset['salary']              # series

In [None]:
x.head()              # Won't be using these, just for illustration
X.head()         # input feature(s)
y.head()         # output target values

In [None]:
# sklearn provides a function for splitting data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

In [None]:
len(X_train)
len(X_test)

In [None]:
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
# Predicting the Test set results
y_pred = regressor.predict(X_test)

In [None]:
X_test[:5]    # test set, years of expereince
y_test[:5]    # test set, real salaries
y_pred[:5]    # salaries predicted by regressor
                 # <-- hopefully not too far away from real numbers! 

### Plotting data and prediction
1. On training set
2. On test set

In [None]:
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

In [None]:
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

In [None]:
# How about someone with just 0.5 year of experience? How about 15? 
newdf = pd.DataFrame({'years_experience':[0.5, 15]})
newdf
regressor.predict(newdf)

## Classification: predicting discrete labels

- Textbook example using sklearn's pre-loaded data set. 
- For detailed explanation, see the textbook section:
 https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive-bayes.html
- Given a short text, can we identify topic labels? 

In [None]:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()
data.target_names

In [None]:
categories = ['talk.religion.misc', 'soc.religion.christian',
              'sci.space', 'comp.graphics']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

In [None]:
type(train)

In [None]:
dir(train)

In [None]:
train.data[5]

In [None]:
train.target[5]

In [None]:
train.target_names

In [None]:
len(train.data)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

model = make_pipeline(TfidfVectorizer(), MultinomialNB())

In [None]:
model.fit(train.data, train.target)
labels = model.predict(test.data)

In [None]:
type(labels)
labels[:10]

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(test.target, labels)

In [None]:
mat

In [None]:
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label')
plt.show()

In [None]:
tests = ['sending a payload to the ISS', 'I met Santa Claus once']
preds = model.predict(tests)
print(preds)

In [None]:
print(train.target_names[1])
print(train.target_names[2])

## Under the hood with CounterVectorizer and TF-IDF
`TfidfVectorizer()` actually takes care of multiple steps:
- Tokenizes text and gets rid of stop words and punctuation
- Builds a token count vector
- Converts raw token count into TF-IDF (Term Frequency - Inverse Document Frequency)

Textbook section on TF-IDF: https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html#Text-Features

Better explanation here: http://www.tfidf.com/

In [None]:
# import CountVectorizer and NLTK 
from sklearn.feature_extraction.text import CountVectorizer
import nltk

In [None]:
sents = ['A rose is a rose is a rose is a rose.',
         'Oh, what a fine day it is.',
        "It ain't over till it's over, I tell you!!"]

In [None]:
# Initialize a CoutVectorizer to use NLTK's tokenizer instead of its 
# default one (which ignores punctuation and stopwords). 
# Minimum document frequency set to 1, but with larger corpora you can set it to a higher number.  
foovec = CountVectorizer(min_df=1, tokenizer=nltk.word_tokenize)

In [None]:
# sents turned into sparse vector of word frequency counts
sents_counts = foovec.fit_transform(sents)
# foovec now contains vocab dictionary which maps unique words to indexes
foovec.vocabulary_

In [None]:
# sents_counts has a dimension of 3 (document count) by 19 (# of unique words)
sents_counts.shape

In [None]:
# this vector is small enough to view in full! 
sents_counts.toarray()

In [None]:
# Convert raw frequency counts into TF-IDF (Term Frequency -- Inverse Document Frequency) values
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
sents_tfidf = tfidf_transformer.fit_transform(sents_counts)

In [None]:
# TF-IDF values
# raw counts have been normalized against document length, 
# terms that are found across many docs are weighted down
sents_tfidf.toarray()