# Introduction to Data Science
## Homework 3

Student Name: FP2

Student Netid: 
***

### Part 1
Mutual Information, Entropy, Conditional Entropy, and Information Gain are terms that one encounters frequently in data science discussions.  These quantities are closely related.  Given discrete random variables $X$ and $Y$: 

$$\text{Mutual Information} = \sum_{y \in Y} \sum_{x \in X} p(x, y) \cdot log\frac{p(x, y)}{p(x)p(y)}$$

$$\text{Entropy} = H(Y) = -\sum_{y \in Y} p(y) \cdot log(p(y))$$

$$\text{Conditional Entropy} = H(Y \mid X) = \sum_{x \in X} p(x) \cdot H(Y \mid X = x)$$

Your task: show mathematically that $\text{Mutual Information} = \text{Information Gain}$, where $\text{Information Gain} = H(Y) – H(Y \mid X)$. Give the derivation below.  

Place your answer here! You can type in math by using $\LaTeX$ (see the question for some examples). You can also hand write this, scan it in, and insert it as an image into this notebook. Try $\LaTeX$!  

NB: data science documents often are written in $\LaTeX$, because it is especially convenient for typesetting math formulas.  Here is one beginner's guide -- skip to Section 7 for the math part. [Google: introduction latex princeton ]

### Part 2
1\. It is essential to be able to differentiate between the two analytics modes: data mining and use of the results of data mining.  Label each case as describing either data mining `(DM)`, or the use of the results of data mining `(USE)`.  [Replace `(ANS)` below.]

a) `(ANS)` Choose customers who are most likely to respond to an on-line ad.

b) `(ANS)` Discover rules that indicate when an account has been defrauded.

c) `(ANS)` Find patterns indicating what customer behavior is more likely to lead to response to an on-line ad.

d) `(ANS)` Estimate probability of default for a credit application.

e) `(ANS)` Predict whether a customer is pregnant

2\. Plumbing Inc. has been selling plumbing supplies for the last 20 years. The owner, Joe, decides that next year it is finally time to diversify by adding gardening tools to his products. Having had success using customer data to build predictive models to guide direct mail campaigns for special plumbing offers, he considers that data mining could help him to identify a subset of customers who should be good prospects for his new set of products. Is Joe ready to solve this as a supervised learning problem? What would you suggest as the target variable?  Be precise. Is there anything else that you would recommend that Joe do to achieve his business goal?

Pleace your answer here.

### Part 3
This is a hands-on task where we will build a tree-structured predictive model as discussed in class and in the book. For this part, we will be using the data in `data/cell2cell_data_80_percent.csv`.

These historical data consist of 31,892 customers: 15,855 customers that churned (i.e., left the company) and 16,036 that did not churn (see the `"churndep"` variable). Here are the data set's 11 attributes describing the customers: 

```
Pos.  Var. Name  Var. Description
----- ---------- --------------------------------------------------------------
1     revenue    Mean monthly revenue in dollars
2     outcalls   Mean number of outbound voice calls
3     incalls    Mean number of inbound voice calls
4     months     Months in Service
5     eqpdays    Number of days the customer has had his/her current equipment
6     webcap     Handset is web capable
7     marryyes   Married (1=Yes; 0=No)
8     travel     Has traveled to non-US country (1=Yes; 0=No)
9     pcown      Owns a personal computer (1=Yes; 0=No)
10    creditcd   Possesses a credit card (1=Yes; 0=No)
11    retcalls   Number of calls previously made to retention team
```

The 12th column, the target variable `"churndep"`, equals 1 if the customer churned, and 0 otherwise. 

**`VVV` VERY IMPORTANT NOTE `VVV`**

**Don't forget to exclude the target variable `"churndep"` when fitting your models. You don't want to include the target when fitting!!!**

1\. Load the data into a pandas `DataFrame()`.

In [None]:
import pandas as pd

# Any code here!

data = None

2\. Using the following two functions for Entropy and Information Gain (don't forget to run this cell!), figure out what is the maximum information gain for each feature. Make a bar plot with feature names along the x-axis and maximum information gain on the y-axis. Which one is the largest? Don't forget some of the features are binary.

In [None]:
import numpy as np
import math

def entropy(target):
    # Get the number of users
    n = len(target)
    # Count how frequently each unique value occurs
    counts = np.bincount(target).astype(float)
    # Initialize entropy
    entropy = 0
    # If the split is perfect, return 0
    if len(counts) <= 1 or 0 in counts:
        return entropy
    # Otherwise, for each possible value, update entropy
    for count in counts:
        entropy += math.log(count/n, len(counts)) * count/n
    # Return entropy
    return -1 * entropy

def information_gain(feature, threshold, target):
    '''
    This function takes three things:
    feature - A list of all the possible values this feature has, e.g. data['revenue']
    threshold - A number to threshold a continuous variable on, e.g. 1.2
    target - A list of all the target values in the same order as feature, e.g. data['churndep']
    '''
    # Dealing with numpy arrays makes this slightly easier
    target = np.array(target)
    feature = np.array(feature)
    # Record if the feature vector is above the threshold
    feature = (feature <= threshold)
    # Initialize information gain with the parent entropy
    ig = entropy(target)
    # For both sides of the threshold, update information gain
    for level, count in zip([0, 1], np.bincount(feature).astype(float)):
        ig -= count/len(feature) * entropy(target[feature == level])
    # Return information gain
    return ig

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

# Code here

3\. Now build and fit a tree-structured model using `DecisionTreeClassifier()` [(manual page)](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) using the 11 attributes to predict the `"churndep"` target variable. Make sure to use `criterion='entropy'` when instantiating an instance of `DecisionTreeClassifier()`. For all other settings you should use the default options (this means you don't have to set anything).

**Remember, don't forget to exclude the target variable `"churndep"` when fitting your models. You don't want to fit on the target!!!**

In [None]:
import sklearn
from sklearn.tree import DecisionTreeClassifier

# Code here

4\. Load in another data set `data/cell2cell_data_20_percent.csv`. This data is of the same format as the other file we read in. Using the classifier built and fit in 3.3, predict `"churndep"` on the original data and the new data that you just loaded in. How well does it predict? (I.e., what is the accuracy on both data sets?)

In [None]:
import pandas as pd
from sklearn import metrics

data_new = None

# Code here!!!

tree_original_accuracy = None
tree_new_accuracy = None

# This lines will be used for grading. DO NOT REMOVE IT. Make sure it prints out the correct value!!!
print "Original and new tree accuracy = %.4f and %.4f" % (tree_original_accuracy, tree_new_accuracy)

### Part 4
The default options for your decision tree may not be optimal. We need to analyze whether tuning the parameters can improve the accuracy of the classifier.  For the following options `max_depth`, `min_samples_split`, and `min_samples_leaf`:

1\. Generate a range of 10 values of each that make sense to test

In [None]:
# Code here

min_samples_split_values = None
min_samples_leaf_values = None
max_depth_values = None

2\. For the values of `max_depth`, `min_samples_split`, and `min_samples_leaf` you chose in 4.1 build a new decision tree classifier (on the original data we read in, the variable `data`) and record the classifier's accuracy on both the original data (the variable `data`) and the new data we read in (the variable `data_new`). You should now generate three plots, each with 10 points for the original data and 10 points for the new data. The values you chose are on the x-axis, the accuracies you calculated are on the y-axis, and there will be two lines/curves per plot (one for `data` and the other for `data_new`).

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
for value in min_samples_split_values:
    # Code here

In [None]:
for value in min_samples_leaf_values:
    # Code here

In [None]:
for value in max_depth_values:
    # Code here

3\. Now that you have read Chapter 4 of your textbook, let's try fitting some linear models: a logistic regression (`sklearn.linear_model.LogisticRegression()`, [manual](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)) and SVM (`sklearn.svm.LinearSVC()`, [manual](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)). For each of these models, fit them on the first set of data we read in (the variable `data`) and report the accuracy on both sets of data we read in (`data` and `data_new`). When fitting each model, you should keep all parameters as their defaults.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

# Code here

logistic_regression_original_accuracy = None
logistic_regression_new_accuracy = None
svm_original_accuracy = None
svm_new_accuracy = None

# These lines will be used for grading. DO NOT REMOVE THEM. Make sure they print out the correct values!!!
print "Original and new logistic regression accuracy = %.4f and %.4f" % (logistic_regression_original_accuracy, logistic_regression_new_accuracy)
print "Original and new SVM accuracy = %.4f and %.4f" % (svm_original_accuracy, svm_new_accuracy)