<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# KNN Classification and Imputation: Cell Phone Churn Data

_Authors: Kiefer Katovich (SF)_

---

In this lab you will practice using KNN for classification (and a little bit for regression as well).

The dataset is one on "churn" in cell phone plans. It has information on the usage of the phones by different account holders and whether or not they churned or not.

Our goal is to predict whether a user will churn or not based on the other features.

We will also be using the KNN model to **impute** missing data. There are a couple of columns in the dataset with missing values, and we can build KNN models to predict what those missing values will most likely be. This is a more advanced imputation method than just filling in the mean or median.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.neighbors import KNeighborsClassifier

### 1. Load the cell phone "churn" data containing some missing values.

In [47]:
churn = pd.read_csv('../data/churn_missing.csv')

### 2. Examine the data. What columns have missing values?

In [3]:
# A:
churn.isnull().sum()

state               0
account_length      0
area_code           0
intl_plan           0
vmail_plan        400
vmail_message     400
day_mins            0
day_calls           0
day_charge          0
eve_mins            0
eve_calls           0
eve_charge          0
night_mins          0
night_calls         0
night_charge        0
intl_mins           0
intl_calls          0
intl_charge         0
custserv_calls      0
churn               0
dtype: int64

### 3. Convert the `vmail_plan` and `intl_plan` colums to binary integer columns.

Make sure that if a value is missing that you don't fill it in with a new value! Preserve the missing values.

In [24]:
# A:
churn['vmail_plan'] = churn.vmail_plan.map({'yes': 1, 'no': 0})
churn['intl_plan'] = churn.intl_plan.map({'yes': 1, 'no': 0})

### 4. Create dummy coded columns for state and concatenate it to the churn dataset.

> **Remember:** You will need to leave out one of the state dummy coded columns to serve as the "reference" column since we will be using these for modeling.

In [25]:
# A:
dum_state = pd.get_dummies(churn.state)
churn = pd.concat([churn, dum_state], axis = 1)

### 5. Create a version of the churn data that has no missing values.

Calculate the shape

In [26]:
# A:
churn_nona = churn.drop(['vmail_plan', 'intl_plan'], axis = 1)
churn_nona.shape

(6666, 171)

### 6. Create a target vector and predictor matrix.

- Target should be the `churn` column.
- Predictor matrix should be all columns except `area_code`, `state`, and `churn`.

In [36]:

# A:
X = churn_nona.drop(['churn', 'state'], axis = 1)
y = churn_nona.churn

In [37]:
X.isnull().sum()

AK    3333
AL    3333
AR    3333
AZ    3333
CA    3333
CO    3333
CT    3333
DC    3333
DE    3333
FL    3333
GA    3333
HI    3333
IA    3333
ID    3333
IL    3333
IN    3333
KS    3333
KY    3333
LA    3333
MA    3333
MD    3333
ME    3333
MI    3333
MN    3333
MO    3333
MS    3333
MT    3333
NC    3333
ND    3333
NE    3333
      ... 
ME       0
MI       0
MN       0
MO       0
MS       0
MT       0
NC       0
ND       0
NE       0
NH       0
NJ       0
NM       0
NV       0
NY       0
OH       0
OK       0
OR       0
PA       0
RI       0
SC       0
SD       0
TN       0
TX       0
UT       0
VA       0
VT       0
WA       0
WI       0
WV       0
WY       0
Length: 169, dtype: int64

### 7. Calculate the baseline accuracy for `churn`.

In [38]:
# A:
y.sum()/len(y)

0.07245724572457246

In [44]:
X = churn.drop(['vmail_plan', 'vmail_message', 'churn', 'state'], axis = 1)
X = churn[['']]

In [45]:
y = churn['churn']

### 8. Cross-validate a KNN model predicting `churn`. 

- Number of neighbors should be 5.
- Make sure to standardize the predictor matrix.
- Set cross-validation folds to 10.

Report the mean cross-validated accuracy.

In [46]:
# A:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X, y)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

### 9. Iterate from k=1 to k=49 (only odd k) and cross-validate the accuracy of the model for each.

Plot the cross-validated mean accuracy for each score. What is the best accuracy?

In [10]:
# A:

### 10. Imputing with KNN

K-Nearest Neighbors can be used to impute missing values in datasets. What we will do is estimate the most likely value for the missing data based on a KNN model.

We have two columns with missing data:
- `vmail_plan`
- `vmail_message`

**10.A Create two subsets of the churn dataset: one without missing values for `vmail_plan` and `vmail_message`, and one with the missing values.**

In [11]:
# A:

First we will impute values for `vmail_plan`. This is a categorical column and so we will impute using classification (predicting whether the plan is yes or no, 1 vs. 0).

**10.B Create a target that is `vmail_plan` and predictor matrix that is all columns except `state`, `area_code`, `churn`, `vmail_plan`, and `vmail_message`.**

> **Note:** We don't include the `churn` variable in the model to impute. Why? We are imputing these missing values so that we can use the rows to predict churn with more data afterwards. If we imputed with churn as a predictor then we would be cheating.

In [12]:
# A:

**10.C Standardize the predictor matrix.**

In [13]:
# A:

**10.D Find the best K for predicting `vmail_plan`.**

You may want to write a function for this. What is the accuracy for predicting `vmail_plan` at the best K? What is the baseline accuracy for `vmail_plan`?

In [14]:
# A:

**10.E Fit a `KNeighborsClassifier` with the best number of neighbors.**

In [15]:
# A:

**10.F Predict the missing `vmail_plan` values using the subset of the data where it is misssing.**

You will need to:
1. Create a new predictor matrix using the same predictors but from the missing subset of data.
- Standardize this predictor matrix *using the StandardScaler object fit on the non-missing data*. This means you will just use the `.transform()` function. It is important to standardize the new predictors the same way we standardized the original predictors if we want the predictions to make sense. Calling `.fit_transform()` will reset the standardized scale.
- Predict what the missing vmail plan values should be.
- Replace the missing values in the original with the predicted values.

> **Note:** It may predict all 0's. This is OK. If you want to see the predicted probabilities of `vmail_plan` for each row you can use the `.predict_proba()` function instead of `.predict()`.  You can use these probabilities to manually set the criteria threshold.

In [16]:
# A:

### 11. Impute the missing values for `vmail_message` using the same process.

Since `vmail_message` is essentially a continuous measure, you need to use `KNeighborsRegressor` instead of the `KNeighborsClassifier`.

KNN can do both regression and classification! Instead of "voting" on the class like in classification, the neighbors will average their value for the target in regression.

In [17]:
# A:

### 12. Given the accuracy (and $R^2$) of your best imputation models when finding the best K neighbors, do you think imputing is a good idea?

In [18]:
# A:

### 13. With the imputed dataset, cross-validate the accuracy predicting churn. Is it better? Worse? The same?

In [19]:
# A: