# Machine Learning Recipes

## 0. General info
- Course: Machine Leaning Recipes
- Channel: Youtube
- Written by: Josh Gordon 
- Course [link](https://www.youtube.com/playlist?list=PLOU2XLYxmsIIuiBfYad6rFYQU_jL2ryal)

## 4. Basic Pipeline
"Learning is the process of using training data to adjust the parameters 
of the model."

In [1]:
# import iris dataset
from sklearn import datasets
iris = datasets.load_iris()

# Classifier as f(x) = y
X = iris.data   # features
y = iris.target  # labels

from sklearn.cross_validation import train_test_split
# Randomly split the given dataset in half - train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5)


def fitAndScore(classifier):
    my_classifier.fit(X_train, y_train)
    # Predict labels from the test set
    predictions = my_classifier.predict(X_test)

    from sklearn.metrics import accuracy_score
    # Measure the classifier score based on the predictions vs the known labels
    return accuracy_score(y_test, predictions)

from sklearn import tree
my_classifier = tree.DecisionTreeClassifier()
print("DecisionTreeClassifier:", fitAndScore(my_classifier))

from sklearn.neighbors import KNeighborsClassifier
my_classifier = KNeighborsClassifier()
print("KNeighborsClassifier:", fitAndScore(my_classifier))

DecisionTreeClassifier: 0.946666666667
KNeighborsClassifier: 0.946666666667


**TensorFlow playground**: Change and adjust parameters to observe a Neural Network learn [link](http://goo.gl/cv7Dq5)

## 5. Writing a Classifier
The following classifier is *based* on the k-Nearest Neighbors algorithm. In this algorithm the classifier predicts an input based on the distance to the k Nearest points. 

For a 2D distance example (dataset with 2 features categories), the Euclidean distance is calculated: `distance = sqrt((X2-X1)² + (Y2-Y1)²)`

The Euclidean distance works the same way, regardless on the size of the dimensions, hence: `distance = sqrt((X2-X1)² + (Y2-Y1)² + ... + (N2-N1)²)`

Pros and Cons on this algorithm:
- Pros: It's a relatively simple algorithm
- Cons: Computationally intensive, since for each prediction it's necessary to iterate through the whole training dataset. Additionally it's hard to represent relationships between features.

In [17]:
from scipy.spatial import distance
#import random

def euc(a,b):
    return distance.euclidean(a,b)

# Basic classifier based on k-Nearest Neighbors
class ScrappyKNN():
    # Minimal methods for a classifier: fit, predict 
    # fit requires the features and label sets (paremeters)
    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train
        
    # predict requires the features test set - outputs the predicted labels
    def predict(self, X_test):
        predictions = []
        for row in X_test:
            #label = random.choice(self.y_train)
            label = self.closest(row)
            predictions.append(label)
        return predictions
    
    def closest(self, row):
        closest_dist = euc(row, self.X_train[0])
        closest_index = 0
        for i in range(1, len(self.X_train)):
            dist = euc(row, self.X_train[i])
            if dist < closest_dist:
                closest_dist = dist
                closest_index = i
        return self.y_train[closest_index]

# import iris dataset
from sklearn import datasets
iris = datasets.load_iris()

# Classifier as f(x) = y
X = iris.data   # features
y = iris.target  # labels

from sklearn.cross_validation import train_test_split
# Randomly split the given dataset in half - train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5)

my_classifier = ScrappyKNN()
my_classifier.fit(X_train, y_train)
predictions = my_classifier.predict(X_test)

from sklearn.metrics import accuracy_score
# Measure the classifier score based on the predictions vs the known labels
print(accuracy_score(y_test, predictions))

0.96


## 6. Image Classifier with TensorFlow
TensorFlow Codelab. In this codelab we retrain (aka transfer learning) an existing image classifier, Inception. It was trained on 1.2 million images and is open source. Retraining in this example takes 20 min vs 2 weeeks of the initial training on a computer with 8 GPUs.

**Keys to train a good classifier are diversity (different image types from one label) and quantity of images**.

This lab uses TensorFlow for Poets which is a library for TensorFlow, similar to sklearn (fit, predict ...).

## 7. Image Classifier with TF.Learn
Based on an TensorFlow image on Docker, the episode shows the process of: Installation, Download dataset, Visualize images, Train a classifier, Evaluate the accuracy and finally how to Visualize the weights graphically.