<a href="https://colab.research.google.com/github/prteek/data-science/blob/master/MachineLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning
*I am always ready to learn although I do not always like being taught. -Winston Churchill*

In [0]:
# This cell is not required to be executed (i.e. ignore any error) if Notebook is run locally or in Binder
# Authorise and mount google drive to access code and data files
import os
from google.colab import drive
drive.mount('/content/drive')
project_folder = '/content/drive/My Drive/git_repos/data-science/'
os.chdir(project_folder)

In [0]:
%%capture
# To supress the output when calling another file
%run ./WorkingWithData.ipynb

### Overfitting and underfitting
The simplest way to address problem of over or underfitting is to split data set, so that a part of it is used to train the model after which we measure the model's performance on the remaining third.

Another problem can be if we use test/train split not just as to judge a model but also to choose from among many models. In that case, although each individual model may not be overfit, the "chose a model the performs best on test set" is a meta-training that makes the test set function as a second training set. (Of course the model that performed the best on test set is going to perform well on the test set.)<br/>
In such situation, we should split the data into three parts: a *training* set for building models, a *validation* set for choosing among trained models, and a *test* set for judging a final model.

In [0]:
def split_data(data, prob):
    """split data into fractions prob, (1-prob)"""
    results = [], []
    for row in data:
        if random.random() <= prob:
            results[0].append(row)
        else:
            results[1].append(row)
    return results

def train_test_split(x,y, test_pct):
    data = list(zip(x,y))
    test, train = split_data(data, test_pct)
    x_test, y_test   = zip(*test)
    x_train, y_train = zip(*train)  
    return x_train, x_test, y_train, y_test

y = [i for i in range(100,110)]
x = [j for j in range(10)]
test_pct = 0.3

print(train_test_split(x,y, test_pct))

((0, 1, 2, 4, 5, 6, 8), (100, 101, 102, 104, 105, 106, 108), (3, 7, 9), (103, 107, 109))


### Correctness

|         *         |   SPAM           |  not SPAM      |
|:------------------|:----------------:|:--------------:|
|predict "Spam"     | True Positive    | False Positive |
|predict "not Spam" | False Negative   | True Negative  |

**Accuracy** Fraction of correct predictions  
**Precision** How accurate our *positive* predictions were  
**Recall** What fraction of positives, the model identified  

*Accuracy* by itself can be misleading consider this:

|      *    |   Leukemia |  no Leukemia |   Total   |
|:----------|:----------:|:------------:|:---------:|
|"Luke"     | 70         | 4,930        | 5,000     |
|not "Luke" | 13,930     | 981,070      | 995,000   |
|total      | 14,000.    | 986,000      | 1,000,000 |


In [0]:
def accuracy(tp, fp, fn, tn):
    correct = tp + tn
    total   = tp + fp + tn + fn
    return correct/total

def precision(tp, fp, fn, tn):
    return (tp)/(tp+fp)

def recall(tp, fp, fn, tn):
    return (tp)/(tp+fn)

print("Accuracy (very high):", round(accuracy(70,4930,13930,981070),4))
print("Precision (poor):", round(precision(70,4930,13930,981070),4))
print("Recall (extremely poor):", round(recall(70,4930,13930,981070),4))

def f1_score(tp, fp, fn, tn):
    p = precision(tp, fp, fn, tn)
    r = recall(tp, fp, fn, tn)
    return 2/(1/p+1/r)

print("F1 Score (extremely poor):", round(f1_score(70,4930,13930,981070),4))

Accuracy (very high): 0.9811
Precision (poor): 0.014
Recall (extremely poor): 0.005
F1 Score (extremely poor): 0.0074
