#### Model:

"A specification of a mathematical (or probabilistic) relationship that exists between different variables."

#### Machine Learning:

"refer\[s\] to creating and using models that are learned from data."

#### Overfitting:

"producing a model that performs well on the data you train in on but that generalizes poorly to any new data."

* "This could involve learning *noise* in the data."
* "Or this could involve learning to identify specific inputs rather that whatever factors are actually predictive for the desired output."

"The other side of it is underfitting, producing a model that doesn't perform well even on the training data, although typically when this happens you decide your model isn't good enough and keep looking for a better one."

* "So how do we make sure our models aren't too complex?"
- "The most fundamental approach involves using different data to train the model and to test the model."

"The simplest way to do this is to split your data set, so that (for example) two-thirds of it is used to train the model, after which we measure the model's performance on the remaining third:"

In [1]:
def split_data(data,prob):
    """
    Splits data into [prob, 1-prob]
    """
    results = [], []
    for row in data:
        results[0 if random.random() < prob else 1].append(row)
    return results

"Often, we'll have a matrix *x* of input variables and a vector *y* of output variables. In that case, we need to make sure to put corresponding values together in either the training data or the test data:"

In [2]:
def train_test_split(x,y,test_pct):
    data = zip(x,y)
    train, test = split_data(data,1-test_pct)
    x_train, y_train = zip(*train)
    x_test, y_test = zip(*test)
    return x_train, x_test, y_train, y_test

"If the model was overfit to the training data, then it will hopefully perform really porrly on the (completely separate) test data."

* "However, there are a couple of ways this can go wrong."
* "The first is if there are common patterns in the test and train data that wouldn't generalize to a larger data set."
    * "For example, imagine that your data set consists of user activity, one row per user per week. In such a case, most user will appear in both the training data and the test data, and certain models might learn to *identify* user rather than discover relationships involving *attributes*."
* "A bigger problem is if you use the test/train split not just to judge a model but also to *choose* from among many models."
    * "In that case, although each individual model may not be overfit, the 'choose a momdel that performs best on the test set' is a meta-training that makes the test set function as a second training set."
    * "In such a situation, you should split the data into three parts: a *training* set for building models, a *validation* set for choosing among trained models, and a *test* set for judging the final model."

#### Correctness

"Imagine building a model to make a *binary* judgment: 'Is this email spam?'":

"Given a set of labeled data and such predictive model, every data point lies in one of four categories:

* True positive: 'The message is spam, and we correctly predicted spam';
* False positive (Type I Error): 'This message is not spam, but we predicted spam';
* False negative (Type II Error): 'This message os spam, but we predicted not spam';
* True negative: 'This message is not spam, and we correctly predicted not spam'.

#### Confusion Matrix ####

|CM                  | Spam               | Not Spam           |
|--------------------|--------------------|--------------------|
| predicted Spam     | True positive      | False positive     |
| predicted Not Spam | False negative     | True negative      |

##### Accuracy #####

In [3]:
def accuracy(tp,fp,fn,tn):
    correct = tp + tn
    total = tp + fp + fn + tn
    return correct/total

In [4]:
print(accuracy(70,4930,13930,981070))

0.98114


"That seems like a pretty impressive number. But clearly this is not a good test, which means that we probably shouldn't put a lot of credence in raw accuracy."

"It's common to look at the combination of *precision* and *recall*."

**"Precision measures how accurate our positive predicitions were: "**

In [5]:
def precision(tp,fp,fn,tn):
    return tp/(tp+fp)

In [6]:
print(precision(70,4930,13930,981070))

0.014


**"And recall measures what fraction of the positives our model identified: "**

In [7]:
def recall(tp,fp,fn,tn):
    return tp/(tp+fn)

In [8]:
print(recall(70,4930,13930,981070))

0.005


"Sometimes precision and recall are combined into the **F1 score**, which is defined as: "

In [9]:
def f1_score(tp,fp,fn,tn):
    p = precision(tp,fp,fn,tn)
    r = recall(tp,fp,fn,tn)

    return 2 * p * r / (p + r)

In [10]:
print(f1_score(70,4930,13930,981070))

0.00736842105263158


**"This is the *harmonic mean* of precision and recall and necessarily lies between them."** 

"Usually the choice of a model involves a trade-off between precision and recall.
* A model that predicts 'yes' when it's even a little bit confident will probably have a high recall but a low precision;
* a model that predicts 'yes' only when it's extremely confident is likely to have a low recall and a high precision."

"Alternatively, you can think of this as a trade-off between false positives and false negatives:
* Saying 'yes' too often will give you lots of false positives;
* saying 'no' too often will give you lots of false negatives."

#### The Bias-Variance Trade-off ####

"If your model has high bias (which means it performs poorly even on your training data) then one thing to try e adding more features."

"If your model has high variance, then you can similarly remove features. But another solution is to obtain more data (if you can)."

#### Feature Extraction and Selection

Features: "whatever inputs we provide to our model."

"when your data doesn't have enough features, your model is likely to underfit. And when your data has too many features, it's easy to overfit. But what are features and where do they come from?"

"In the simplest case, features are simply given to you. If you want to predict someone's salary based on her years of experience, then years of experience is the only feature you have."

"Things become more interesting as your data becomes more complicated. Imagine trying to build a spam filter to predict whether an email is junk or not. Most models won't know what to do with a raw email, which is just a collection of text. You'll have to extract features. For example:

* Does the email contain the word 'Viagra'?
* How many times does the letter 'd' appear?
* What was the domain of the sender?

The first is simply a yes or no, which we typically encode as a 1 or 0.
The second is a number.
And the thirs is a choice from a discrete set of options.

**Pretty much always, we'll extract features from our data that fall into one of these three categories. What's more, the type of features we have constrains the type of models we can use.**"

"How do we choose features? That's where a combination of *experience* and *domain expertise* comes into play."

* "If you've received lots of emails, then you probably have a sense that the presence of certain words might be a good indicator of spamminess."
* "And you might also have a sense that the number of d's is likely not a good indicator of spamminess."