# INFO-4604/5604 HW2: Linear Classification 

### Solution by: Jessie Smith (Jess)


## Assignment overview

News agencies, governments and corporations sometimes track social media during natural disasters to try to monitor unfolding events. But because no single person or group of people can read all available Twitter data, organizations may turn to natural language processing methods to try and understand what is happening as disasters unfold. 

While this approach is powerful, inferring events from NLP can be tricky. For instance, say a person [tweets](https://twitter.com/AnyOtherAnnaK/status/629195955506708480) that "LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE." This tweet includes the word "ablaze", which may signal to a computer that there is an unfolding disaster. However, in this particular case, the person is speaking metaphorically. A simple computer system using keywords (e.g. ablaze) might be fooled into thinking the tweet is reporting an actual fire.

In this assignment, you will predict if a given tweet actually refers to a natural disaster. This exercise is motivated by real-world disaster monitoring systems, and can help you to gain practice with supervised binary classification and natural language processing.

__Note__: This dataset originally comes from [Kaggle](https://www.kaggle.com/c/nlp-getting-started/overview). But it has been modified for this problem set. Information about the data from this problem set that you find on Kaggle will almost certainly be wrong.

### What to hand in

You will submit the assignment on Canvas. Submit a single Jupyter notebook named `hw2lastname.ipynb`, where lastname is replaced with your last name. **Please also submit a PDF or HTML version of your notebook to Canvas**.

Please clearly mark all deliverables. You are encouraged to create additional cells in whatever way makes the presentation more organized and easy to follow. You are allowed to import additional Python libraries.

### Submission policies

- **Collaboration:** You are allowed to work with one partner. You are still expected to write up your own solution. Each individual must turn in their own submission, and list your collaborators after your name.
- **Late submissions:** Each student may use up to 5 late days over the semester. You have late days, not late hours. This means that if your submission is late by any amount of time past the deadline, then this will use up a late day. If it is late by any amount beyond 24 hours past the deadline, then this will use a second late, and so on. Once you have used up all late days, late assignments will be given at most 80% credit after one day and 60% credit after two days.


## Getting started

In this assignment, you will experiment with perceptron and logistic regression in `sklearn`. Much of the code has already been written for you. We will use a class called `SGDClassifier` (which you should read about in the [sklearn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)), which  implements stochastic gradient descent (SGD) for a variety of loss functions, including both perceptron and logistic regression, so this will be a way to easily move between the two classifiers.

The code below will load the datasets. There are two data collections: the "training" data, which contains the tweets that you will use for training the classifiers, and the "testing" data, which are tweets that you will use to measure the classifier accuracy. The test tweets are instances the classifier has never seen before, so they are a good way to see how the classifier will behave on data it hasn't seen before. However, we still know the labels of the test tweets, so we can measure the accuracy.

For this problem, we will use what are called "bag of words" features, which are commonly used when doing classification with text. Each feature is a word, and the value of a feature for a particular tweet is number of times the word appears in the tweet (with value $0$ if the word does not appear in the tweet).

A note on labels: **If `Y_train` or `Y_test` are 1 this means the tweet refers to a real disaster; if the values are 0, it means the tweet does not refer to a real disaster** 

Run the block of code below to load the data. You don't need to do anything yet. Move on to "Problem 1" next.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

df_train = pd.read_csv('train.csv')

Y_train = df_train["target"]
text_train = df_train["text"]

vec = CountVectorizer()
X_train = vec.fit_transform(text_train)
feature_names = np.asarray(vec.get_feature_names())

df_test = pd.read_csv('test.csv')
Y_test = df_test["target"]
text_test = df_test["text"]

X_test = vec.transform(text_test)


In [6]:
df_train.head()

Unnamed: 0.1,Unnamed: 0,id,keyword,location,text,target
0,0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,2,5,,,All residents asked to 'shelter in place' are ...,1
2,3,6,,,"13,000 people receive #wildfires evacuation or...",1
3,4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
4,5,8,,,#RockyFire Update => California Hwy. 20 closed...,1


In [8]:
df_test.head()

Unnamed: 0.1,Unnamed: 0,id,keyword,location,text,target
0,1,4,,,Forest fire near La Ronge Sask. Canada,1
1,7,13,,,I'm on top of the hill and I can see a fire in...,1
2,16,24,,,I love fruits,0
3,18,26,,,My car is so fast,0
4,26,38,,,Was in NYC last week!,0


In [5]:
#Deliverable 1
print(len(df_train), len(df_test))

6120 1493


In [16]:
#ROWS, COLUMNS OF TRAINING DATA (Number of columns are the features)
X_train.get_shape()

(6120, 18594)

In [22]:
X_train[0]

<1x18594 sparse matrix of type '<class 'numpy.int64'>'
	with 13 stored elements in Compressed Sparse Row format>

In [24]:
#Percentage of instances about actual disasters in the training data
print(len(df_train), df_train['target'].sum())
print(df_train['target'].sum()/len(df_train))

6120 2644
0.4320261437908497


## Problem 1: Understand the data [3 points]

Before doing anything else, take time to understand the code above.

The variables `df_train` and `df_test` are dataframes that store the training (and testing) datasets, which are contained in comma-separated files where the first column is the label and the second column is the text of the tweet.

The [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class converts the raw text into a bag-of-words into a feature vector representation that `sklearn` can use.

You should print out the values of the variables and write any other code needed to answer the following questions.

#### Deliverable 1.1

How many training instances are in the dataset? How many test instances?

Training instances: 6120
Testing instances: 1493

#### Deliverable 1.2

How many features are in the training data?

There are 18594 features (columns) in the training data


#### Deliverable 1.3

What is the distribution of labels in the training data? That is, what percentage of instances are about actual disasters?

About 43% of the instances are about actual disasters

## Problem 2: Perceptron [3 points]

The code below trains an [`SGDClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) using the perceptron loss, then it measures the accuracy of the classifier on the test data, using `sklearn`'s [`accuracy_score`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) function. 

The `fit` function trains the classifier. The feature weights are stored in the `coef_` variable after training. The `predict` function of the trained `SGDClassifier` outputs the predicted label for a given instance or list of instances.

Additionally, this code displays the features and their weights in sorted order, which you may want to examine to understand what the classifier is learning. In general, in binary classification, the 0 class is considered the "negative" class.

There are 3 keyword arguments that have been added to the code below. It is important you keep the same values of these arguments whenever you create an `SGDClassifier` instance in this assignment so that you get consistent results. They are:

- `max_iter` is one of the stopping criteria, which is the maximum number of iterations/epochs the algorithm will run for.

- `tol` is the other stopping criterion, which is how small the difference between the current loss and previous loss should be before stopping.

- `random_state` is a seed for pseudorandom number generation. The algorithm uses randomness in the way the training data are sorted, which will affect the solution that is learned, and even the accuracy of that solution.

Note: *Wait a minute $-$ in class we learned that the loss function is convex, so the algorithm will find the same minimum regardless of how it is trained. Why is there random variation in the output? The reason is that even though there is only one minimum value of the loss, there may be different weights that result in the same loss, so randomness is a matter of tie-breaking. What's more, while different weights may have the same loss, they could lead to different classification accuracies, because the loss function is not the same as accuracy. (Unless accuracy was your loss function... which is possible, but uncommon because it turns out to be a difficult function to optimize.)
Note that different computers may still give different answers, despite keeping these settings the same, because of how pseudorandom numbers are generated with different operating systems and Python environments.*

To begin, run the code in the cell below without modification.

In [25]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, precision_score

classifier = SGDClassifier(loss='perceptron', max_iter=1000, tol=1.0e-12, random_state=123, eta0=100)
classifier.fit(X_train, Y_train)

print("Number of SGD iterations: %d" % classifier.n_iter_)
print("Training accuracy: %0.6f" % accuracy_score(Y_train, classifier.predict(X_train)))
print("Testing accuracy: %0.6f" % accuracy_score(Y_test, classifier.predict(X_test)))

print("\nFeature weights:")
args = np.argsort(classifier.coef_[0])

print("\n - lowest")
for a in args[0:5]:
    print("\t %s: %0.4f" % (feature_names[a], classifier.coef_[0][a]))
   
print("\n - highest")
for a in args[-5:]:
    print("\t %s: %0.4f" % (feature_names[a], classifier.coef_[0][a]))

Number of SGD iterations: 35
Training accuracy: 0.987908
Testing accuracy: 0.781648

Feature weights:

 - lowest
	 zy3hpdjnwg: -0.7900
	 qzlpfhpwdo: -0.6970
	 better: -0.5112
	 f7wqpcekg2: -0.5112
	 sun: -0.5112

 - highest
	 storm: 0.8829
	 sunburned: 0.9294
	 hurricane: 0.9294
	 massacre: 1.0688
	 earthquake: 1.2547


#### Deliverable 2.1

Based on the training accuracy, do you conclude that the data are (mostly) linearly separable? Why or why not?

Yes, because the accuracy is nearly 1.0 (which would indicate that the data was perfectly linearly sepearable), but there may be a few outliers or we may have just reached the stopping criteria before that 1.0 accuracy was reached.

#### Deliverable 2.2

Which feature most increases the likelihood that the tweet does not refer to a real disaster, and which feature most increases the likelihood that the tweet refers to a real disaster? 

1. The feature that increases the likelihood of NOT referring to a real disaster is the word 'zy3hpdjnwg' existing in the tweet (umm... not sure who would even be tweeting that giberish?!), because is has the lowest weight: -0.7900
2. The feature that increases the likelihood of REFERRING to a real disaster is the word 'earthquake' existing in the tweet, because it has the highest weight: 1.2547 (also, this data may be a bit biased if that is the highest weighted feature, because it means there were a lot of positively classified tweets with 'earthquake' but there are other natural disasters not represented in this list of the 5 highest words!)

#### Deliverable 2.3 
One technique for improving the resulting model with perceptron is to take an average of the weight vectors learned at different iterations of the algorithm, rather than only using the final weights that minimize the loss. That is, calculate $\bar{\mathbf{w}} = \sum_{t=1}^T \mathbf{w}^{(t)}$ where $\mathbf{w}^{(t)}$ is the weight vector at iteration $t$ of the algorithm and $T$ is the number of iterations, and then use $\bar{\mathbf{w}}$ when making classifications on new data.

To use this technique in your classifier, add the keyword argument `average=True` to the `SGDClassifier` function. Try it now using the cells below.

Compare the initial training/test accuracies to the training/test accuracies after doing averaging. What happens? Why do you think averaging the weights from different iterations has this effect?

In [26]:
classifier = SGDClassifier(loss='perceptron', max_iter=1000, tol=1.0e-12, random_state=123, eta0=100, average=True)
classifier.fit(X_train, Y_train)

print("Number of SGD iterations: %d" % classifier.n_iter_)
print("Training accuracy: %0.6f" % accuracy_score(Y_train, classifier.predict(X_train)))
print("Testing accuracy: %0.6f" % accuracy_score(Y_test, classifier.predict(X_test)))

print("\nFeature weights:")
args = np.argsort(classifier.coef_[0])

print("\n - lowest")
for a in args[0:5]:
    print("\t %s: %0.4f" % (feature_names[a], classifier.coef_[0][a]))
   
print("\n - highest")
for a in args[-5:]:
    print("\t %s: %0.4f" % (feature_names[a], classifier.coef_[0][a]))

Number of SGD iterations: 35
Training accuracy: 0.977124
Testing accuracy: 0.811788

Feature weights:

 - lowest
	 full: -1.6394
	 sun: -1.3797
	 better: -1.3130
	 also: -1.3088
	 book: -1.2684

 - highest
	 storm: 2.0315
	 earthquake: 2.0326
	 floods: 2.0351
	 fires: 2.1129
	 hiroshima: 2.1832


## Deliverable (interpretation): 

Using the average of the weights rather than the final weights made the training accuracy decrease a small amount (-0.01), but made the testing accuracy increase by a somewhat descent amount (+0.3)! This could have happened because maybe without averaging, our final weights ended up at a local minimum -- but by averaging, we can take the best weights from the entire training process (not just where we ended up), which can make for a more generalizable solution, which is why we achieved better testing accuracy because we weren't overfit to the training data.

## Problem 3: Logistic regression [4 points]

For this problem, create a new `SGDClassifier`, this time setting the `loss` argument to `'log'`, which will train a logistic regression classifier. Set `average=False` for the remaining problems.

Once you have trained the classifier, you can use the `predict` function to get the classifications, as with perceptron. Additionally, logistic regression provides probabilities for the predictions. You can get the probabilities by calling the `predict_proba` function. This will give a list of two numbers; the first is the probability that the class is $0$ and the second is the probability that the class is $1$.


For the first task, add the keyword argument `alpha` to the `SGDClassifier` function. This is the regularization strength, called $\lambda$ in lecture. If you don't specify `alpha`, it defaults to $0.0001$. Experiment with other values and see how this affects the outcome.

#### Deliverable 3.1: 

Calculate the training and testing accuracy when `alpha` is one of $[0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0]$. Create a plot where the x-axis is `alpha` and the y-axis is accuracy, with two lines (one for training and one for testing). You can borrow the code from HW1 for generating plots in Python. Use [a log scale for the x-axis](https://matplotlib.org/examples/pylab_examples/log_demo.html) so that the `alpha` values are spaced evenly.

[your solution should be plotted below]

In [61]:
%matplotlib inline
import matplotlib.pyplot as plt
import altair as alt

# starter code
alphas = [0.0001,0.001,0.01,0.1,1.0,10.0,100.0]
training_acc = []
testing_acc = []
for a in alphas:
    classifier = SGDClassifier(loss='log', max_iter=1000, alpha=a, tol=1.0e-12, random_state=123, eta0=100, average=False)
    classifier.fit(X_train, Y_train)
    train_acc = accuracy_score(Y_train, classifier.predict(X_train))
    test_acc = accuracy_score(Y_test, classifier.predict(X_test))
    print('ALPHA: ', a)
    print("Training accuracy: %0.6f" % train_acc)
    print("Testing accuracy: %0.6f" % test_acc)
    training_acc.append(train_acc)
    testing_acc.append(test_acc)

test_or_train = ['train'] * len(alphas) + ['test'] * len(alphas)
df = pd.DataFrame({"alpha": alphas + alphas,
                   "data type":test_or_train,
                   "accuracy":training_acc + testing_acc})

alt.Chart(df).mark_line().encode(
    alt.X('alpha', scale=alt.Scale(type='log')),
    color="data type", 
    y='accuracy')

ALPHA:  0.0001
Training accuracy: 0.983333
Testing accuracy: 0.807770
ALPHA:  0.001
Training accuracy: 0.901634
Testing accuracy: 0.813798
ALPHA:  0.01
Training accuracy: 0.808660
Testing accuracy: 0.773610
ALPHA:  0.1
Training accuracy: 0.710784
Testing accuracy: 0.695244
ALPHA:  1.0
Training accuracy: 0.674183
Testing accuracy: 0.665104
ALPHA:  10.0
Training accuracy: 0.657026
Testing accuracy: 0.661755
ALPHA:  100.0
Training accuracy: 0.567484
Testing accuracy: 0.580040


#### Deliverable 3.2

Examine the classifier probabilities using the `predict_proba` function when training with different values of `alpha`. What do you observe? How does `alpha` affect the prediction probabilities, and why do you think this happens?


## My Answer: 
(Code is below to support my theory)
It would appear that as the alpha value is increased, the probability of predicting a "1" (probability of predicting that this text is indicative of a natural disaster) decreases, and the probability of predicting a "0" (that this text is not indicative of a natural disaster) increases. With a really low alpha value, the probability of predicting a "1" is really high and the probability of predicting a "0" is really low. But as alpha is increased, the probability of predicting either a "0" or a "1" is almost the same, it's basically 50/50.

I think this happens because since alpha is the regularization strength, when you have a really low alpha, the algorithm can overfit to the training data because it isn't making a bit impact. But when alpha is really big it will impact the minimization of our loss (because our loss gets really BIG so we try to make the weights really SMALL to minimize the loss). So basically our algorithm starts off really overfit (basically only predicting "1"s) with a small alpha. But with an alpha that is too big, our algorithm is underfit because there is too much bias that we added.

In [71]:
#Code to justify my thoughts for ^^
for a in alphas:
    classifier = SGDClassifier(loss='log', max_iter=1000, alpha=a, tol=1.0e-12, random_state=123, eta0=100, average=False)
    classifier.fit(X_train, Y_train)
    print("ALPHA VALUE: ", a)
    print(classifier.predict_proba(X_train))

ALPHA VALUE:  0.0001
[[0.1073517  0.8926483 ]
 [0.07474336 0.92525664]
 [0.04847949 0.95152051]
 ...
 [0.00976061 0.99023939]
 [0.04878059 0.95121941]
 [0.00382533 0.99617467]]
ALPHA VALUE:  0.001
[[0.35321656 0.64678344]
 [0.16869801 0.83130199]
 [0.13702526 0.86297474]
 ...
 [0.03844073 0.96155927]
 [0.18279678 0.81720322]
 [0.03216899 0.96783101]]
ALPHA VALUE:  0.01
[[0.53629373 0.46370627]
 [0.32635411 0.67364589]
 [0.30038956 0.69961044]
 ...
 [0.18228622 0.81771378]
 [0.32799007 0.67200993]
 [0.18481425 0.81518575]]
ALPHA VALUE:  0.1
[[0.54681201 0.45318799]
 [0.46917619 0.53082381]
 [0.44543801 0.55456199]
 ...
 [0.44182658 0.55817342]
 [0.46094343 0.53905657]
 [0.41797239 0.58202761]]
ALPHA VALUE:  1.0
[[0.51459256 0.48540744]
 [0.50241214 0.49758786]
 [0.49491548 0.50508452]
 ...
 [0.5094737  0.4905263 ]
 [0.49789611 0.50210389]
 [0.49055035 0.50944965]]
ALPHA VALUE:  10.0
[[0.50239332 0.49760668]
 [0.50106973 0.49893027]
 [0.50015043 0.49984957]
 ...
 [0.50220857 0.49779143]


#### Deliverable 3.3: 

Now remove the `alpha` argument so that it goes back to the default value. We'll now look at the effect of the learning rate. By default, `sklearn` uses an "optimal" learning rate based on some heuristics that work well for many problems. However, it can be good to see how the learning rate can affect the algorithm.

For this task, add the keyword argument `learning_rate` to the `SGDClassifier` function and set the value to `invscaling`. This defines the learning rate at iteration $t$ as: $\eta_t = \frac{\eta_0}{t^a}$, where $\eta_0$ and $a$ are both arguments you have to define in the `SGDClassifier` function, called `eta0` and `power_t`, respectively. Experiment with different values of `eta0` and `power_t` and see how they affect the number of iterations it takes the algorithm to converge. You will often find that it will not finish within the maximum of $1000$ iterations.

In [79]:
etas = [10, 100, 1000, 10000]
power_ts = [0.5, 1.0, 2.0]
for e in etas:
    for p in power_ts:
        classifier = SGDClassifier(loss='log', 
                                   max_iter=1000, 
                                   learning_rate='invscaling', 
                                   eta0=e, 
                                   power_t=p, 
                                   tol=1.0e-12, 
                                   random_state=123,
                                   average=False)
        classifier.fit(X_train, Y_train)
        print("ETA_0: ", e)
        print("POWER_T: ", p)
        print("CONVERGENCE ITERS: ", classifier.n_iter_)

ETA_0:  10
POWER_T:  0.5
CONVERGENCE ITERS:  177




ETA_0:  10
POWER_T:  1.0
CONVERGENCE ITERS:  1000




ETA_0:  10
POWER_T:  2.0
CONVERGENCE ITERS:  1000
ETA_0:  100
POWER_T:  0.5
CONVERGENCE ITERS:  94




ETA_0:  100
POWER_T:  1.0
CONVERGENCE ITERS:  1000




ETA_0:  100
POWER_T:  2.0
CONVERGENCE ITERS:  1000
ETA_0:  1000
POWER_T:  0.5
CONVERGENCE ITERS:  105




ETA_0:  1000
POWER_T:  1.0
CONVERGENCE ITERS:  1000




ETA_0:  1000
POWER_T:  2.0
CONVERGENCE ITERS:  1000
ETA_0:  10000
POWER_T:  0.5
CONVERGENCE ITERS:  88
ETA_0:  10000
POWER_T:  1.0
CONVERGENCE ITERS:  126
ETA_0:  10000
POWER_T:  2.0
CONVERGENCE ITERS:  1000




Fill in the table below with the number of iterations for values of `eta0` in $[10.0, 100.0, 1000.0, 10000.0]$ and values of `power_t` in $[0.5, 1.0, 2.0]$. You may find it easier to write python code that can output the markdown for the table, but if you do that place the output here. If it does not converge within the maximum number of iterations (set to $1000$ by `max_iter`), record $1000$ as the number of iterations. You will need to read the documentation for this class to learn how to recover the actual number of iterations before reaching the stopping criterion.

| `eta0`   | `power_t` | # Iterations |
|:----------|:-:|:------------:|
| $10.0$    | $0.5$     |    $177$       |
| $10.0$    | $1.0$     |       $1000$       |
| $10.0$    | $2.0$     |   $1000$           |
| $100.0$   | $0.5$     |   $94$           |
| $100.0$   | $1.0$     |   $1000$           |
| $100.0$   | $2.0$     |   $1000$           |
| $1000.0$  | $0.5$     |   $105$           |
| $1000.0$  | $1.0$     |   $1000$           |
| $1000.0$  | $2.0$     |   $1000$           |
| $10000.0$ | $0.5$     |   $88$           |
| $10000.0$ | $1.0$     |   $126$           |
| $10000.0$ | $2.0$     |   $1000$           |

$\eta_t = \frac{\eta_0}{t^a}$

#### Deliverable 3.4

Describe how `eta0` and `power_t` affect the learning rate based on the formula (e.g., if you increase `power_t`, what will this do to the learning rate?), and connect this to what you observe in the table above.

## My answer: 
eta0 and power_t affect the learning rate because $\eta_t = \frac{\eta_0}{t^a}$, so the larger that eta0 is, the larger the learning rate is. And subsequently, the larger that power_t is, the smaller the learning rate is. This means that in theory, with a large eta_0 (and large learning rate) the model might learn VERY quickly because it is allowed to take bigger steps to minimize loss. Inversley, making power_t larger (and making the learning rate smaller) might make the model learn more slowly because of taking small steps towards the minimum loss, it might even never converge by the max iterations of 1000. That being said, with a large learning rate, you could also just get unlucky and overshoot the mark continuously and never converge either, so you have to be careful to not let it get TOO large.

Now taking this theoretical background and appllying it to the results in the table, my theory is pretty solid. The largest size of eta_0 (10000) produced some of the lowest iterations before convergence (88 and 126). And the largest power_t consistently led to the model never converging before 1000 iterations!

<hr/>

Now remove the `learning_rate`, `eta0`, and `power_t` arguments so that the learning rate returns to the default setting. For this final task, we will experiment with how high the probabiity must be before an instance is classified as positive.

The code below includes a function called `threshold` which takes as input the classification probabilities of the data (called `probs`, which is given by the function `predict_proba`) and a threshold (called `tau`, a scalar that should be a value between $0$ and $1$). It will classify each instance as $1$ if the probability of being $1$ is greater than `tau`, otherwise it will classify the instance as $0$. Note that if you set `tau` to $0.5$, the `threshold` function should give you exactly the same output as the classifier `predict` function.

You should find that increasing the threshold causes the accuracy to drop. This makes sense, because you are classifying some things as 0 even though it's more probable that they are 1. So why do this? Suppose you care more about accurately identifying tweets about natural disasters than missing tweets about disasters (e.g. maybe you forward these tweets to first responders.) You thus want to be confident that when you classify a tweet as 1 that it really is 1.

There is a metric called _precision_ which measures something like accuracy but for one specific class. Whereas accuracy is the percentage of tweets that were correctly classified, the precision of 1 would be the percentage of tweets classified as 1 that were correctly classified. (In other words, the number of tweets classified as 1 whose correct label was 1, divided by the number of tweets classified as 1.)

You can use the [`precision_score`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score) function from `sklearn` to calculate the precision. It works much like the `accuracy_score` function.

#### Deliverable 3.5

Calculate the testing precision when the value of `tau` for thresholding is one of $[0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99]$. Create a plot where the x-axis is `tau` and the y-axis is precision.

[your solution should be plotted below]

In [97]:
# use this function for deliverable 3.5
def threshold(probs, tau):
    return np.where(probs[:,1] > tau, 1, 0)

# your logistic regression code here
taus = [0.5,0.6,0.7,0.8,0.9,0.95,0.99]
precisions = []
for t in taus:
    classifier = SGDClassifier(loss='log', max_iter=1000, tol=1.0e-12, random_state=123)
    classifier.fit(X_train, Y_train)
    proba = (classifier.predict_proba(X_test))
    this_precision = precision_score(Y_test, threshold(proba, t))
    precisions.append(this_precision)
    print("TAU: ", t, "Testing precision: %0.6f" % this_precision)
    
df = pd.DataFrame({"tau": taus,
                   "precision":precisions})

alt.Chart(df).mark_line().encode(
    x="tau",
    y=alt.Y("precision",scale=alt.Scale(zero=False)))

TAU:  0.5 Testing precision: 0.809091
TAU:  0.6 Testing precision: 0.848178
TAU:  0.7 Testing precision: 0.890661
TAU:  0.8 Testing precision: 0.920548
TAU:  0.9 Testing precision: 0.954704
TAU:  0.95 Testing precision: 0.972350
TAU:  0.99 Testing precision: 0.989474


#### Deliverable 3.6

Describe what you observe with thresholding (e.g., what happens to precision as the threshold increases?), and explain why you think this happens.

## My answer
As the threshold increases, the precision increases -- which makes sense because the precision is is giving us the percentage of 1s that we correctly classified as 1. If we are only classifying 1s that have a 99% likelihood of being a 1 (e.g., if tau is 0.99), then our precision is going to be really high, because we are only classifying 1s that we are REALLY sure (99% sure) are actually "1"s. Alternatively, if we set our tau really low, then our precision will be lower, because if we are allowing things that have a 50% probability of being a 1 (if tau is 0.5) to be classified as a 1, then we will be more likely to get some incorrectly classified 1s in there, making our precision value decrease -- all of this is represented in the chart: as the tau gets higher, so does the precision, as the tau gets lower, so does the precision.
<hr/>

## Problem 4: Sparse learning [5604: 5 points; 4604: +3 EC points]

Add the `penalty` argument to `SGDClassifier` and set the value to `'l1'`, which tells the algorithm to use L1 regularization instead of the default L2. Recall from lecture that L1 regularization encourages weights to stay at exactly $0$, resulting in a more "sparse" model than L2. You should see this effect if you examine the values of `classifier.coef_`.

#### Deliverable 4.1: 

Write a function to calculate the number of features whose weights are nonzero when using L1 regularization. Calculate the number of nonzero feature weights when `alpha` is one of $[0.00001, 0.0001, 0.001, 0.01, 0.1]$. Create a plot where the x-axis is `alpha` and the y-axis is the number of nonzero weights, using a log scale for the x-axis.

[your solution should be plotted below]

In [109]:
# your code here
def numNonZero(a):
    classifier = SGDClassifier(alpha=a, loss='log', penalty='l1', max_iter=1000, tol=1.0e-12, random_state=123)
    classifier.fit(X_train, Y_train)
    weights = classifier.coef_[0]
    non_zero = 0
    for w in weights:
        if w > 0.0 or w < 0.0:
            non_zero += 1
    return non_zero
    
alphas = [0.00001,0.0001,0.001,0.01,0.1]
num_nonzeros = []
for a in alphas:
    num_nonzeros.append(numNonZero(a))
    
df = pd.DataFrame({"alpha": alphas,
                   "non zero weights":num_nonzeros})


alt.Chart(df).mark_line().encode(
    alt.X('alpha', scale=alt.Scale(type='log')),
    y='non zero weights')



[Briefly explain your plot in a few sentences. What happens as you increase `alpha`; does that make sense?]

## My answer:
As alpha is increased, the number of non zero weights gets closer to zero (and eventually becomes zero). This does make sense because as we raise the L1 regularization hyperparam, the model is incentivized to keep the weights close to zero, so as alpha is increased (and thus as our regularization strength is increased), the weights in theory should get close to zero. And this chart proves this theory, because as alpha is increased, most of the weights (or all of the weights) become zero.