# Learning Machine Learning

These are notes made following the Udacity course, an intro to ML.

The code snippets are the key lines for each concept being worked on, but mostly can't be run in this notebook.

## Lesson 2: Naive Bayes

This is a supervised classification algorithm. The idea is based on figuring out, which source something came from based on the probabilities that thing does stuff.

Think of text analysis, two people who send emails with certain word probabilities. Given an email with certain words, you can figure out the probability of it coming from each one. It's naive because it ignores word order.

One particular feature of Naive Bayes is that it’s a good algorithm for working with text classification. When dealing with text, it’s very common to treat each unique word as a feature, and since the typical person’s vocabulary is many thousands of words, this makes for a large number of features. The relative simplicity of the algorithm and the independent features assumption of Naive Bayes make it a strong performer for classifying texts.

In [8]:
import numpy as np

# These numbers are random, so this isn't a very useful classifier
features_train=np.array([[1,1],[5,8],[7,0],[1,5],[4,5],[1,2],[7,7],[3,1],[0,5]])
labels_train=np.array([1,2,1,1,2,2,1,1,1])


# Import gaussian naive Bayes model
from sklearn.naive_bayes import GaussianNB

#Generate a classifier
clf=GaussianNB()
clf.fit(features_train, labels_train)

print(clf.predict([[2,3]]))


[1]


Below is the full code written at the end of lesson 1 (it won't run on it's own) 

The timing measurements show that the prediction is much faster than the training section, by like 30 times.

In [None]:
from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()
t0 = time()
clf.fit(features_train, labels_train)
print ("training time:", round(time()-t0, 3), "s")


t1 = time()
pred = clf.predict(features_test)
print ("prediction time:", round(time()-t1, 3), "s")

from sklearn.metrics import accuracy_score

Accuracy = accuracy_score(labels_test, pred)

print(Accuracy)


## Lesson 3: Support Vector Machines (SVMs)

Another Classifier. Very popular and really good, quite new.

Broadly SVMs maximise the margin between the sets, but can tolerate some degree of outliers.
The important thing about how we use these is kernels. A Kernel takes a low dimensional system to a high dimensional system, which then allows the SVM to find a linear seperation.

They are cubic with data size, so are difficult to use on large datasets. They are also very prone to noise, and vulnerable to overfitting.

A few important parameters:
- Kernel
- Gamma
- C

Control of these parameters is important to avoid overfitting.

### C
Controls the tradeoff between having points correct and having a smooth boundary. A large value of C means more points will be correct (fewer points allowed to be outliers)

### Gamma
Defines the reach of each training boundary. Low values far, high values close reach.
A low value and high reach tends to smooth the decision boundary and effectively reduces the impact of the values close to the boundary relative to the many other points.






In [None]:
from sklearn.svm import SVC

clf = SVC(kernel='linear')

t0 = time()
clf.fit(features_train, labels_train)
print ("training time:", round(time()-t0, 3), "s")

t1 = time()
pred = clf.predict(features_test)
print ("prediction time:", round(time()-t1, 3), "s")

from sklearn.metrics import accuracy_score

Accuracy = accuracy_score(labels_test, pred)

print(Accuracy)

## Lesson 4 Decision Trees

Allow asking multiple linear questions, one after the other. These are prone to overfit, but very simple process. You can build very big classifiers with these.

There are several important parameters for tuning the accuracy and avoiding overfitting, such as:

__MinSample Split__
This governs how small you go before you stop splitting
Default 2, but this can be reduced to avoid overfitting

### Entropy
The DT makes decisions looking at splitting the data to create subsets, which are as pure as possible.
Entropy goes between 0 (All one class) and 1.0 (Evenly split between classes. 
The entropy is calculated as:

$$-\Sigma_{P_{i}} P_{i}log_{2}(P_{i})$$

where $P_{i}$ is the fraction of the total for each class i.

### Information Gain

The decision tree acts to maximise information gain, which is the entropy of the parent minus the weighted average of the entropy of the children that could be created.

### Bias-Variance dilemma

High bias means it ignores data, doesn't learn.
High variance means it learns very quickly, so doesn't retain much.
The balance between these two is really key to optimising any algorithm


In [None]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()

clf.fit(features_train, labels_train)

pred = clf.predict(features_test)


## Lesson 5: Choose your own: Adaboost

Adaboost (adaptive boosting), like Random Forest is an ensemble method, which combines outputs from various weak classifiers to give a better classification. As it goes on, it identifies hard to classify data and later classifiers focus on these.

Each iteration gives a classification and this is evaluated. Wrongly classified datapoints have the weight increased, while correctly classified ones have their weight decreased and this modified dataset is then evaluated again. This process is iterated and these are then combined. When combining the stumps, the weight of the stumps are taken into account

__Basic ideas:__

1. Combines weak learners, usually stumps (a tree with just two leaves)
2. Each stump gets a weighting
3. Each stump takes the previous mistakes into account

__Detailed:__

1. The weight for getting each datapoint right is initially equal.
2. Find the best initial weak classifier
3. Give the stump a weight, sum of weights of incorrect samples (low is good)
4. Negative weight means the answers are reversed
5. New weights for incorrectly classified samples are increased by $ sample weight \times e^{Stump weight}$
6. New weights for correctly classified samples are decreased by $ sample weight \times e^{-Stump weight}$
7. These weights then need normalising to one
8. A new collection of samples is built by randomly picking from the original, but using the weights to give the probability of each sample. Hence previously wrongly classified samples are likely to appear multiple times.
9. We then go back to step 1 with the new samples, as many times as we like. 
10. We then use the weighted average of this classification.

n_estimators is a key parameter. A large number means that you're likely to overfit.

In [None]:
from sklearn.ensemble import AdaBoostClassifier

clf= AdaBoostClassifier(n_estimators=25)

## Lesson 6: Datasets and Questions
### Finding Fraudsters in Enron email

Persons of interest (POI) are people who were indicted or settled out of court.

- 35 people, 30 at Enron. 
- 150 people emails in total.
- Only 4 email boxes of those POI.

This is not necessarily a complete list. Many people may not have been caught and many people

What's more, we don't really know whether there are enough of these people to give the accuracy we want. Accuracy vs. Training set size needs consideration. Using more data is often a better way of improving outcomes, more than 


This lesson just makes you explore the data. One point highlighted was that if data comes from different sources you have to be careful about merging it, without introducing biases.

## Lesson 7: Regression

Previous work has had a constrained output, a discrete, binary answer. We now look at continuous supervised learning, where the output is no longer discrete. This implies there's an order, like age or salary.

We are trying to find an equation for a line of best fit to the data. These algorithms work to minimise the sum of square errors. The squared comes in to ensure that there is one unique solution, as well as making the computation easier.

- Ordinary least squares
- Gradient descent

We evaluate the fit with R-squared.



In [None]:
from sklearn.linear_model import LinearRegression

reg=LinearRegression()

reg.fit(feature_train, target_train)

print("Slope is: \t"+ str(reg.coef_))
print("Intercept is: \t"+ str(reg.intercept_))
print("R-squared for training data is: \t"+ str(reg.score(feature_train, target_train)))
print("R-squared for test data is: \t"+ str(reg.score(feature_test, target_test)))


## Lesson 8: Outliers in Regression

As we saw in Lesson 8, outliers can have a very strong effect on regression. swapping an outlier from training to test sets can make a huge difference to the prediction and the score.

Outliers are often problems that we want to ignore, although sometimes they could be the most interesting/key points!

A simple process to detect outliers is:

1. train
2. remove data with highest residual error (say 10% highest)
3. train again

2 and 3 could be iterated


In [None]:
def outlierCleaner(predictions, ages, net_worths):
    """
        Clean away the 10% of points that have the largest
        residual errors (difference between the prediction
        and the actual net worth).

        Return a list of tuples named cleaned_data where 
        each tuple is of the form (age, net_worth, error).
    """
    import numpy as np
    
    cleaned_data = []

    errors = abs(np.subtract(net_worths,predictions))
    maxErr = np.percentile(errors,90)
    
    for error,age,net_worth in zip(errors,ages,net_worths):
        if error<=maxErr:
            cleaned_data.append((age,net_worth,error))
    print (len(cleaned_data))
    
    return cleaned_data


## Lesson 9: Clustering

Unsupervised learning is for datasets without labels. Clustering is one method, which determines groups. e.g. Netflix wanting to know which movies to suggest. Do they go in your cluster?

### k-means algorithm

Each cluster could be given a cluster centre and it is this idea that is used in k-means.

1. Determine how many clusters to look for (n).
2. Create n random cluster centres
3. Assign each point to the nearest cluster centre
4. Optimise: Move the centres to minimise the sum of quadratic distances for each cluster
5. Iterate 3. and 4.

Need to determine the number of clusters and can also determine the max number of iterations and the number of initialisations it incorporates.

The initial conditions are really key to the end result. There is always a risk of ending up in a local minima, so we avoid this by running the algorithm several times and determining the best outcome.





In [None]:
from sklearn.cluster import KMeans

clf = KMeans(n_clusters=2)
clf.fit(finance_features)

pred=clf.labels_

## Lesson 10: Feature Scaling

Features that you want to use should be on a similar scale, i.e. normalised between 0 and 1.

$$ x' = \frac{x-x_{min}}{x_{max}-x_{min}}$$

Outliers will distort this scaling function quite severely.

In [1]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler

scaler=MinMaxScaler()
finance_features_scaled=scaler.fit_transform(finance_features)


## Lesson 11: Text Learning

Text is key to many web 

### Bag of words

Making a frequency count for each key words, that you can use to define your features.

Stopwords are words that you want to ignore, e.g. "the".



In [14]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
StopW=stopwords.words('english')

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?']
vectoriser = CountVectorizer(stop_words=StopW)
X = vectoriser.fit_transform(corpus)
print(vectoriser.get_feature_names())
print(X.toarray())

print (vectoriser.vocabulary_.get("first"))
print (vectoriser.vocabulary_.get("first"))



['document', 'first', 'one', 'second', 'third']
[[1 1 0 0 0]
 [2 0 0 1 0]
 [0 0 1 0 1]
 [1 1 0 0 0]]
1
1


In [19]:
len(StopW)

179

### Stemming

Equally, lots of words form differently, but are basically the same. e.g. response, responded, responsive.

These can all be grouped into one stem e.g. "respon". This is normally taken from a tool made by professional linguists.


In [3]:
from nltk.stem.snowball import SnowballStemmer
import string

text_string='Python string method translate() returns a copy of the string in which all characters have been translated using table (constructed with the maketrans() function in the string module), optionally deleting all characters found in the string deletechars.'

### remove punctuation
text_string = text_string.translate(text_string.maketrans("", "", string.punctuation))

stemmer = SnowballStemmer("english")

words=text_string.split(" ")

stems=[]
for word in words:
    stems.append(stemmer.stem(word))

print(stems)


['python', 'string', 'method', 'translat', 'return', 'a', 'copi', 'of', 'the', 'string', 'in', 'which', 'all', 'charact', 'have', 'been', 'translat', 'use', 'tabl', 'construct', 'with', 'the', 'maketran', 'function', 'in', 'the', 'string', 'modul', 'option', 'delet', 'all', 'charact', 'found', 'in', 'the', 'string', 'deletechar']


### TF IDF

Term frequency - like bag if words
Inverse document frequency - Compares the document to how common a word is in this document compared to the rest of the corpus

This rates the rare words higher than the more common words.
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
StopW=stopwords.words('english')

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?']
vectoriser = CountVectorizer(stop_words=StopW)
X = vectoriser.fit_transform(corpus)
print(vectoriser.get_feature_names())
print(X.toarray())

print (vectoriser.vocabulary_.get("first"))
print (vectoriser.vocabulary_.get("first"))

In [None]:


from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
StopW=stopwords.words('english')

vectoriser = TfidfVectorizer(stop_words=StopW)
X = vectoriser.fit_transform(word_data)

print(vectoriser.get_feature_names())
print(X.toarray())

## Lesson 12: Feature Selection

"Make things as simple as possible, but no simpler." Albert Einstein

Choosing the features used to train your algorithms is vital to making them work, so this needs to be done very carefully.

### Adding a new feature

- Use intuition: what might be a useful indicator of what your looking for
- Code it up
- Visualise
- Learn and repeat

Beware of bugs! If it looks too good to be true, you are probably tracking labels inaccurately.

### Getting rid of a feature

Why get rid?
- It's noisy
- It's highly correlated with another feature
- It slows down the computation time

Features != Information

There are several go-to methods of automatically selecting your features in sklearn. Many of them fall under the umbrella of univariate feature selection, which treats each feature independently and asks how much power it gives you in classifying or regressing.

There are two big univariate feature selection tools in sklearn: SelectPercentile and SelectKBest. The difference is pretty apparent by the names: SelectPercentile selects the X% of features that are most powerful (where X is a parameter) and SelectKBest selects the K features that are most powerful (where K is a parameter).

### Regularisation

There is a sweet spot between too few features (high bias) and too many (high variance). Finding the features that give us this sweet spot can be done automatically through regularisation.

#### Lasso Regression

Want to minimise SSE, but also the number of features. The gain of a new feature has to be large enough to counter a penalty for including that additional feature.

Each feature is assessed and the coefficient of regression for each feature is set to zero if those features are not informative enough.

