# CPSC 330 hw4

Note: this assignment covers 2 weeks' worth of material (lectures 6-9).

Note: the assignments will get gradually more open-ended as we progress through the course. In many cases there is no single correct solution.

In [110]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Add more imports below as needed



In [2]:
plt.rcParams['font.size'] = 16

## Instructions
rubric={points:5}

Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md). 

In particular, please make sure all your output is showing up in the final rendered version. **Also, please do not delete the question cells or move the questions around.** This will make things easier for the TAs.

Also: if available, you are welcome to use scikit-learn functions for any of the tasks below, such as confusion matrix. You are not required to implement them yourselves. 

## Writing quality/quantity
rubric={points:5}

The TAs have reported a couple issues with the first few assignments: in some cases, submissions simply show the code output with no commentary; please write at least a sentence explaining your output in each question. In other cases, the TAs have come across multi-paragraph answers where a couple of sentences would have sufficed. Thus, we are now allocating the above points for well-structured answers of a reasonable length. In general, 1-3 sentences is good.

## Customer churn data

[Customer churn](https://en.wikipedia.org/wiki/Customer_attrition) refers to the notion of customers leaving a subscription service like Netflix. In this exercise, we will try to predict customer churn in a dataset where most of the customers stay with the service and a small minority cancel their subscription. To start, please download the [Kaggle telecom customer churn dataset](https://www.kaggle.com/becksddf/churn-in-telecoms-dataset). As usual, do not push the CSV to your repo. One you have the data, you should be able to run the following code:

In [3]:
df = pd.read_csv('bigml_59c28831336c6604c800002a.csv', encoding='latin-1')

In [4]:
df_train, df_test = train_test_split(df, test_size=0.1, random_state=123)

In [5]:
df_train.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
380,SD,137,510,373-5732,no,no,0,242.1,118,41.16,...,93,16.24,218.6,50,9.84,14.7,2,3.97,3,False
2352,VA,118,408,404-2877,no,no,0,154.8,71,26.32,...,73,20.74,159.6,81,7.18,12.8,4,3.46,0,False
693,NJ,92,510,420-8242,no,yes,29,155.4,110,26.42,...,104,16.02,254.9,118,11.47,8.0,4,2.16,3,False
527,NJ,95,415,379-6652,no,yes,22,40.9,126,6.95,...,90,11.34,264.2,91,11.89,11.9,7,3.21,0,False
2556,WA,118,510,422-2571,no,no,0,113.0,80,19.21,...,87,12.76,204.3,115,9.19,10.8,4,2.92,2,False


The last column (`churn`) is the target. "True" means the customer left the subscription (churned) and "False" means they stayed.


## Exercise 1

#### 1(a)
rubric={points:4} 

Perform some exploratory data analysis on the training set. In particular:

- How many rows and columns are there?
- How many True/False target values are there?

Come up with **two** more exploratory questions you would like to answer (similar to the above two), and explore those as well. Briefly discuss your results in 1-3 sentences.

You are welcome to use `pandas_profiling` (see Lecture 6) but you don't have to.

#### 1(b)
rubric={points:10}

In preparation for building a classifier, set up a `ColumnTransformer` that performs whatever feature transformations you deem sensible. This can include dropping features if you think they are not helpful. Remember that by default `ColumnTransformer` will drop any columns that aren't accounted for when it's created.

In each case, briefly explain your rationale with 1-2 sentences. You do not need an explanation for every feature, but for every group of features that are being transformed the same way. For example, "I am doing transformation X to the following categorical features: `a`, `b`, `c` because of reason Y," etc.

In [23]:
X_train = df_train.drop(columns=["churn"])
X_test  = df_test.drop(columns=["churn"])

y_train = df_train["churn"]
y_test  = df_test["churn"]

#### 1(d)
rubric={points:3}

The original dataset had a feature called `area code`. Let's assume we encoded this feature with one-hot encoding.

1. The area codes were numbers to begin with. Why do we want to use one-hot encoding on this feature?
2. What were the possible values of `area code`? 
3. What new binary feature(s) were created to replace `area code`? 
4. For each possible value of `area code`, how is this value represented in the transformed data? For example, for a different feature called `international plan` has two values, "no" and "yes". In the transformed data, there is new feature called `international plan_yes`, where "yes" is represented as 1.0 and "no" is represented as 0.0. (There may also be a new feature called `international plan_no` depending on what hyperparameters you used in your `OneHotEncoder` - either way is fine.)

#### 1(e)
rubric={points:4}

Create a `DummyClassifier` using `strategy='prior'`. Report the following scoring metrics via cross-validation: accuracy, precision, recall, F1-score. Briefly comment on your results, including any warnings the code produces (2 sentences max).

#### 1(f)
rubric={points:10} 

- Train and score a logistic regression classifier on the dataset. 
- Report the same metrics as in the previous part.
- Are you satisfied with the results? Use your `DummyClassifier` results as a reference point. Discuss in a few sentences. 

#### 1(g)
rubric={points:5}

Set the `class_weight` parameter of your logistic regression model to `'balanced'`. Report the same metrics as in the previous part. Do you prefer this model to the one in the previous part? Discuss your results in a few sentences.

## Exercise 2: Hyperparameter optimization

#### 2(a)
rubric={points:5}

Now let's tune the hyperparameters of our `LogisticRegression` using `GridSearchCV` to maximize cross-validation accuracy. Jointly optimize `C` (choose some reasonable values) and `class_weight` (`None` vs. `'balanced'`). What values of `C` and `class_weight` are chosen and what is the best cross-validation accuracy?

#### 2(b)
rubric={points:3}

This time, do the same optimization but have `GridSearchCV` optimize F1-score instead of accuracy. Discuss any changes in your results.

_Checkpoint_: It should be choosing `class_weight='balanced'` when you're optimizing F1-score. I am getting an F1-score around 0.48. You don't need to have exactly this, but if you're seeing something completely different there may be a problem.

#### 2(c)
rubric={points:5}

This dataset is fairly small so the code should be fairly fast to run. This makes it a good time to use more folds in cross-validation. Using your best model from the previous part, compute the F1-score using 20-fold cross-validation and print out the sub-scores. How big would you say the "error bars" are on your average score? (The answer is somewhat subjective, just try to be reasonable.) Briefly discuss.

## Exercise 3: test set

#### 3(a)
rubric={points:5}

- Evaluate your final model's F1 score on the test data. 
- Is it within your "error bars" from the previous part? (If it isn't, that doesn't mean you need to redo the previous part!) 
- Do you think you overfit on the cross-validation sets?
- Briefly discuss your results (1-2 sentences).

#### 3(b)
rubric={points:2}

Create a confusion matrix for your final model - one on the training set and one on the test set.

#### 3(c)
rubric={points:5}

It is possible to read the precision and recall directly off these confusing matrices by normalizing them - see the `normalize` argument in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html). 

1. Compute your final model's precision on the test set.
2. Normalize the confusion matrix such that you can read the precision off it, and show how you can do this.
3. Compute your final model's recall on the test set.
4. Normalize the confusion matrix such that you can read the recall off it, and show how you can do this.

#### 3(d)
rubric={points:5}

The function below plots histograms of the predicted probability, split by the **true class**, for a given model and dataset. This is similar to the animated plots from lecture. 

Call this function twice, once for each of two models:

1. A pipeline with your chosen value of `C` and `class_weight='balanced'`
2. A pipeline with your chosen value of `C` and `class_weight=None`

Then, discuss your results: how do the two plots differ? Is `class_weight` changing things the way you expected? 


In [122]:
def make_hists(model, X, y):

    negative_examples = X[y == 0]
    positive_examples = X[y == 1]

    plt.hist(model.predict_proba(negative_examples)[:,1], alpha=0.5, bins=30, label="0", density=True)
    plt.hist(model.predict_proba(positive_examples)[:,1], alpha=0.5, bins=30, label="1", density=True)
    plt.legend(loc='upper right')

    plt.xlabel("predicted probability")
    plt.ylabel("normalized counts")

## Exercise 4: Precision and recall by hand

Below is the confusion matrix of a machine learning system that predicts whether a cancer is malignant or not. Let's consider malignant to be the "positive class".

|    Actual/Predicted      | Predicted Benign | Predicted Malignant |
| :------------- | -----------------------: | -----------------------: |
| **Actual Benign**       | 6 | 238 |
| **Actual Malignant**       | 20 | 194 |

#### 4(a)
rubric={points:2}

Would you consider this an imbalanced dataset? Why or why not? Max 2 sentences.

    

#### 4(b)
rubric={points:2}

Based on the above confusion matrix, what is the recall? 

    

#### 4(c)
rubric={points:5}

Do you consider this to be a good classifier? What additional information might you need to answer this question? Briefly discuss in 1-3 sentences.

    

## Exercise 5: Very short answer questions
rubric={points:10}

Answer each of the following questions in **1-2 sentences**. Each one is worth 2 points.

1. One can think of `predict` as thresholding the output of `predict_proba` at some threshold. What is one scoring metric we talked about which is independent of this threshold? Briefly explain.
2. What is the difference between stratified cross-validation and regular cross-validation?
3. What is an advantage of ensembling multiple models as opposed to just choosing one of them?
4. What is an disadvantage of ensembling multiple models as opposed to just choosing one of them?
5. By default, `StackingClassifier` uses `LogisticRegression` as its "meta-model". Explain the significance of the coefficients learned by this `LogisticRegression` model.

## Submission to Canvas

**IF YOU ARE WORKING WITH A PARTNER PLEASE FORM THE GROUP ON CANVAS BEFORE SUBMITTING** - see instructions [here](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md#partners).

When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`.
2. Save your notebook.
3. Convert your notebook to `.html` format using the `convert_notebook()` function below **or** by `File -> Export Notebook As... -> Export Notebook to HTML`.
4. Run the code `submit()` below to go through an interactive submission process to Canvas.
>For this step, you will need a Canvas *Access Token* token. If you haven't already got one, log-in to Canvas, click `Account` (top-left of the screen), then `Settings`, then scroll down until you see the `+ New Access Token` button. Click that button, give your token any name you like and set the expiry date to Dec 31, 2020. Then click `Generate token`. Save this token in a safe place on your computer as you'll need it for all assignments. Treat the token with as much care as you would an important password. 

Note: for those having trouble with the Jupyter widgets and the dropdowns: if you add the argument `no_widgets=True` to your `submit` call, it should let you do a text-based entry of your key and avoid the dropdowns altogether. If this doesn't work, you probably need to upgrade to the latest version of `canvasutils` with `pip install canvasutils -U` from your terminal with your environment activated.


In [None]:
from canvasutils.submit import submit, convert_notebook

# Note: the canvasutils package should have been installed as part of your environment setup - 
# see https://github.com/UBC-CS/cpsc330/blob/master/docs/setup.md

In [None]:
# convert_notebook("hw4.ipynb", "html")  # uncomment and run when you want to try convert your notebook to HTML (or you can convert manually from the File menu)

In [None]:
# submit(course_code=53561, token=False)  # uncomment and run when ready to submit 