# Introduction to Data Science
## Homework 2

Student Name: Jesse Swanson

Student Netid: js11133

### Part 1: Case study (5 Points)
- Read [this article](http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html) in the New York Times.
- Use what we've learned in class and from the book to describe how one could set Target's problem up as a predictive modeling problem, such that they could have gotten the results that they did.  Formulate your solution as a proposed plan using our data science terminology.  Include all the aspects of the formulation that you see as relevant to solving the problem.  Be precise but concise.

To solve Target's problem of identifying pregnant women as early as possible, we need to collect data on each consumer. This data must be connected to some unique user ID in order for the users to be directly targeted by marketing techniques. The data collected likely contain numerous features such where you live, estimated salary, recent purchases, etc.. The goal of a predictive modeling problem is to use the features of the data to classify consumers as pregnant or not pregnant. This is a supervised learning problem. Therefore, each vector we collect must contain information about if the consumer is preganant or an early mother. This can likely be inferred by purchasing habits such as buying diapers, baby clothes, etc.. Once we know a consumer is a mother, we can look back in their history and attempt to identify features in their vector that correlate with being a pregnant woman.

After collecting a data set of consumers that are both pregnant and not pregnant, we need to train a model on the data. We are trying to select a model for a classification problem since we want to answer the question: "Is this user pregnant?" There are many different types of models to choose from such as decision trees, linear classifiers, support vector machines. It is best to start with a simple model such as a decision tree and build complexity as needed. Simple models are often more transparent and easier for audiences to grasp. For a model like the decision tree, we will iterate through each of the features of the data set to determine which feature gives the most information gain. Features with the most information gain will be used to split the data first to maximize classification efficiency and reduce overfitting. After training the model on the data set, it is important to test the model on holdout data in order to check if we overfitted the model. Overfitting the model results in a higher than realistic accuracy for the model. This is because the model is too targeted towards the training data and does not generalize.

Once this model is released to production, we will need the ability to retrain the model based on feedback data. Concept drift causes our model to be less accurate over time due to unforseen changes in the underlying data. For example, a pregnant mother in the 1930s likely had different purchasing behaviours than a modern mother. A model trained on a 1930s mother would likely not be accurate for a modern mother.

### Part 2: Exploring data in the command line (4 Points - 1 each)
For this part we will be using the data file located in `"advertising_events.csv"`. This file consists of records that pertain to some online advertising events on a given day. There are 4 comma separated columns in this order: `userid`, `timestamp`, `domain`, and `action`. These fields are of type `int`, `int`, `string`, and `int` respectively. Answer the following questions using Linux/Unix bash commands. All questions can be answered in one line (sometimes, with pipes)! Some questions will have many possible solutions. Don't forget that in IPython notebooks you must prefix all bash commands with an exclamation point, i.e. `"!command arguments"`.

[Hints: You can experiment with whatever you want in the notebook and then delete things to construct your answer later.  You can also use ssh to use the actual bash shell in your terminal and then just paste your answers here. Recall that once you enter the "!" then filename completion should work. Also, these are standard data exploration commands that are quick and easy to use in a terminal or in the notebook. We don't cover command line operations formally in this class, but these are worth learning (and thus are part of the HW). Be resourceful. Use whatever online cheat sheets or Stackoverflow to answer the question.]

1\. How many records (lines) are in this file (look up wc)?

In [1]:
!wc -l "advertising_events.csv"

   10341 advertising_events.csv


2\. How many unique users are in this file? (hint: consider the 'cut' command and use pipe operator '|')

In [3]:
!cut -f 1 -d , advertising_events.csv | sort | uniq | wc -l

     732


3\. Rank all domains by the number of visits they received in descending order. (hint: consider the 'cut', 'uniq' and 'sort' commands and the pipe operator).

In [4]:
!sort advertising_events.csv | uniq | sort -t , -k3 -k1 | cut -d , -f 3 | uniq -c | sort -r | awk '{$1=$1}1' |cut -d ' ' -f 2

google.com
facebook.com
youtube.com
yahoo.com
baidu.com
wikipedia.org
amazon.com
qq.com
twitter.com
taobao.com


4\. List all records for the user with user id 37. (hint: this can be done using 'grep')

In [6]:
!grep "^37," advertising_events.csv 

37,648061658,google.com,0
37,642479972,google.com,2
37,644493341,facebook.com,2
37,654941318,facebook.com,1
37,649979874,baidu.com,1
37,653061949,yahoo.com,1
37,655020469,google.com,3
37,640878012,amazon.com,0
37,659864136,youtube.com,1
37,640361378,yahoo.com,1
37,653862134,facebook.com,0
37,648828970,youtube.com,0


### Part 3: Dealing with data Pythonically (16 Points)

In [7]:
# You might find these packages useful. You may import any others you want!
import pandas as pd
import numpy as np

1\. Load the data set `"ads_dataset.tsv"` into a Python Pandas data frame called `ads`.

In [115]:
ads = pd.read_csv('ads_dataset.tsv', sep='\t')

2\. Write a Python function called `getDfSummary()` that does the following:
- Takes as input a data frame
- For each variable in the data frame calculates the following features:
  - `number_nan` to count the number of missing not-a-number values
  - Ignoring missing, NA, and Null values:
    - `number_distinct` to count the number of distinct values a variable can take on
    - `mean`, `max`, `min`, `std` (standard deviation), and `25%`, `50%`, `75%` to correspond to the appropriate percentiles
- All of these new features should be loaded in a new data frame. Each row of the data frame should be a variable from the input data frame, and the columns should be the new summary features.
- Returns this new data frame containing all of the summary information

Hint: The pandas `describe()` [(manual page)](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) method returns a useful series of values that can be used here.

In [120]:
def getDfSummary(input_data):
    d_num_nan = input_data.isna().sum().to_frame(name='number_nan')
    d_num_dist = input_data.nunique().to_frame(name='number_distinct')
    d_desc = input_data.describe().T
    del d_desc['count']
    frames = [d_number_nan, d_num_dist]
    output_data = d_num_nan.join(d_num_dist).join(d_desc)
    return output_data

3\. How long does it take for your `getDfSummary()` function to work on your `ads` data frame? Show us the results below.

Hint: `%timeit getDfSummary(ads)`

In [121]:
%timeit getDfSummary(ads)

45.3 ms ± 886 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


4\. Using the results returned from `getDfSummary()`, which fields, if any, contain missing `NaN` values?

In [122]:
#buy_freq contains NaN values
getDfSummary(ads)

Unnamed: 0,number_nan,number_distinct,mean,std,min,25%,50%,75%,max
isbuyer,0,2,0.042632,0.202027,0.0,0.0,0.0,0.0,1.0
buy_freq,52257,10,1.240653,0.782228,1.0,1.0,1.0,1.0,15.0
visit_freq,0,64,1.852777,2.92182,0.0,1.0,1.0,2.0,84.0
buy_interval,0,295,0.210008,3.922016,0.0,0.0,0.0,0.0,174.625
sv_interval,0,5886,5.82561,17.595442,0.0,0.0,0.0,0.104167,184.9167
expected_time_buy,0,348,-0.19804,4.997792,-181.9238,0.0,0.0,0.0,84.28571
expected_time_visit,0,15135,-10.210786,31.879722,-187.6156,0.0,0.0,0.0,91.40192
last_buy,0,189,64.729335,53.476658,0.0,18.0,51.0,105.0,188.0
last_visit,0,189,64.729335,53.476658,0.0,18.0,51.0,105.0,188.0
multiple_buy,0,2,0.006357,0.079479,0.0,0.0,0.0,0.0,1.0


5\. For the fields with missing values, does it look like the data is missing at random? Are there any other fields that correlate perfectly, or predict that the data is missing? If missing, what should the data value be?

Hint: create another data frame that has just the records with a missing value. Get a summary of this data frame using `getDfSummary()` and compare the differences. Do some feature distributions change dramatically?

In [124]:
df1 = ads[pd.isnull(ads['buy_freq'])]
getDfSummary(df1)
# isbuyer, buy interval, expected_time_buy, multiple_buy correlate with the missing buy_freq entries. Since all of
# the correlated variables are 0 when buy_freq = NaN, it is likely that the missing buy_freq should be 0.

Unnamed: 0,number_nan,number_distinct,mean,std,min,25%,50%,75%,max
isbuyer,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0
buy_freq,52257,0,,,,,,,
visit_freq,0,48,1.651549,2.147955,1.0,1.0,1.0,2.0,84.0
buy_interval,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sv_interval,0,5112,5.686388,17.623555,0.0,0.0,0.0,0.041667,184.9167
expected_time_buy,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0
expected_time_visit,0,13351,-9.669298,31.23903,-187.6156,0.0,0.0,0.0,91.40192
last_buy,0,189,65.741317,53.484622,0.0,19.0,52.0,106.0,188.0
last_visit,0,189,65.741317,53.484622,0.0,19.0,52.0,106.0,188.0
multiple_buy,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0


6\. Which variables are binary?

In [None]:
#isbuyer, multiple_buy, multiple_visit, y_buy since they only have 2 distinct values