# Introduction to Data Science
## Homework 2

Student Name: Zian Jiang

Student Netid: zj444
***

### Part 1: Case study (5 Points)
- Read [this article](http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html) in the New York Times.
- Use what we've learned in class and from the book to describe how one could set Target's problem up as a predictive modeling problem, such that they could have gotten the results that they did.  Formulate your solution as a proposed plan using our data science terminology.  Include all the aspects of the formulation that you see as relevant to solving the problem.  Be precise but concise.

Problem statement: customers usually have their fixed shopping habits, and their habits are only likely to change when a major event, such as expecting a baby, is happening. Target wants to know when a customer is expecting a baby so that they can send the customers coupons and lure them into shopping at Target instead. However, the only available data Target has about its customers is the stuff they have purchased. So Target needs to predict whether or not a customer is expecting a baby given his/her shopping history.

Solution: Using Target's baby-shower registry, they can already know who the pregnant customers are and observe their the changes in their shopping habits as delivery date approaches. Then they can identity the products that pregnant customers are mostly purchasing and expect their delivery dates as well. Then Target's can start sendind them coupons timed to very specific stages of their pregnancy. 

### Part 2: Exploring data in the command line (4 Points - 1 each)
For this part we will be using the data file located in `"advertising_events.csv"`. This file consists of records that pertain to some online advertising events on a given day. There are 4 comma separated columns in this order: `userid`, `timestamp`, `domain`, and `action`. These fields are of type `int`, `int`, `string`, and `int` respectively. Answer the following questions using Linux/Unix bash commands. All questions can be answered in one line (sometimes, with pipes)! Some questions will have many possible solutions. Don't forget that in IPython notebooks you must prefix all bash commands with an exclamation point, i.e. `"!command arguments"`.

[Hints: You can experiment with whatever you want in the notebook and then delete things to construct your answer later.  You can also use ssh to use the actual bash shell in your terminal and then just paste your answers here. Recall that once you enter the "!" then filename completion should work. Also, these are standard data exploration commands that are quick and easy to use in a terminal or in the notebook. We don't cover command line operations formally in this class, but these are worth learning (and thus are part of the HW). Be resourceful. Use whatever online cheat sheets or Stackoverflow to answer the question.]

1\. How many records (lines) are in this file (look up wc)?

In [1]:
# Place your code here
!wc -l advertising_events.csv

   10341 advertising_events.csv


2\. How many unique users are in this file? (hint: consider the 'cut' command and use pipe operator '|')

In [3]:
# Place your code here  
!cut -d "," -f 1 advertising_events.csv | sort | uniq |wc -l
#!awk '{print $1}' advertising_events.csv | sort | uniq | wc -l

     732


3\. Rank all domains by the number of visits they received in descending order. (hint: consider the 'cut', 'uniq' and 'sort' commands and the pipe operator).

In [2]:
# Place your code here
!cut -d "," -f 3 advertising_events.csv | sort | uniq -c | sort -nr

3114 google.com
2092 facebook.com
1036 youtube.com
1034 yahoo.com
1022 baidu.com
 513 wikipedia.org
 511 amazon.com
 382 qq.com
 321 twitter.com
 316 taobao.com


4\. List all records for the user with user id 37. (hint: this can be done using 'grep')

In [4]:
# Place your code here
!grep ^37, advertising_events.csv

37,648061658,google.com,0
37,642479972,google.com,2
37,644493341,facebook.com,2
37,654941318,facebook.com,1
37,649979874,baidu.com,1
37,653061949,yahoo.com,1
37,655020469,google.com,3
37,640878012,amazon.com,0
37,659864136,youtube.com,1
37,640361378,yahoo.com,1
37,653862134,facebook.com,0
37,648828970,youtube.com,0


### Part 3: Dealing with data Pythonically (16 Points)

In [5]:
# You might find these packages useful. You may import any others you want!
import pandas as pd
import numpy as np

1\. Load the data set `"ads_dataset.tsv"` into a Python Pandas data frame called `ads`.

In [6]:
# Place your code here
ads = pd.read_csv("ads_dataset.tsv",sep = '\t')

2\. Write a Python function called `getDfSummary()` that does the following:
- Takes as input a data frame
- For each variable in the data frame calculates the following features:
  - `number_nan` to count the number of missing not-a-number values
  - Ignoring missing, NA, and Null values:
    - `number_distinct` to count the number of distinct values a variable can take on
    - `mean`, `max`, `min`, `std` (standard deviation), and `25%`, `50%`, `75%` to correspond to the appropriate percentiles
- All of these new features should be loaded in a new data frame. Each row of the data frame should be a variable from the input data frame, and the columns should be the new summary features.
- Returns this new data frame containing all of the summary information

Hint: The pandas `describe()` [(manual page)](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) method returns a useful series of values that can be used here.

In [7]:
def getDfSummary(input_data):
    # Place your code here
    index = input_data.describe().index.values
    index[0] = "number_distinct"
    index = np.append(index,"number_nan")
    output_data = pd.DataFrame(columns=input_data.columns.values,index=index)
    output_data = output_data.transpose()
    output_data["number_distinct"] = [input_data[s].dropna().nunique() for s in output_data.index.values]
    output_data["mean"] = [input_data[s].dropna().mean() for s in output_data.index.values]
    output_data["std"] = [input_data[s].dropna().std() for s in output_data.index.values]
    output_data["min"] = [input_data[s].dropna().min() for s in output_data.index.values]
    output_data["max"] = [input_data[s].dropna().max() for s in output_data.index.values]
    output_data["number_nan"] = [input_data[s].isna().sum() for s in output_data.index.values]
    output_data["25%"] = [input_data[s].dropna().quantile([.25]) for s in output_data.index.values]
    output_data["50%"] = [input_data[s].dropna().quantile([.5]) for s in output_data.index.values]
    output_data["75%"] = [input_data[s].dropna().quantile([.75]) for s in output_data.index.values]
    return output_data

3\. How long does it take for your `getDfSummary()` function to work on your `ads` data frame? Show us the results below.

Hint: `%timeit getDfSummary(ads)`

In [8]:
# Place your code here
%timeit getDfSummary(ads)

165 ms ± 6.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


4\. Using the results returned from `getDfSummary()`, which fields, if any, contain missing `NaN` values?

In [9]:
# Place your code here
output = getDfSummary(ads)
output["number_nan"]
# As we can see, buy_freq is the only field that has missing values

isbuyer                    0
buy_freq               52257
visit_freq                 0
buy_interval               0
sv_interval                0
expected_time_buy          0
expected_time_visit        0
last_buy                   0
last_visit                 0
multiple_buy               0
multiple_visit             0
uniq_urls                  0
num_checkins               0
y_buy                      0
Name: number_nan, dtype: int64

5\. For the fields with missing values, does it look like the data is missing at random? Are there any other fields that correlate perfectly, or predict that the data is missing? If missing, what should the data value be?

Hint: create another data frame that has just the records with a missing value. Get a summary of this data frame using `getDfSummary()` and compare the differences. Do some feature distributions change dramatically?

In [10]:
# Place your code here
null_data = ads[ads.isnull().any(axis=1)] 
null_data_summary = getDfSummary(null_data)
null_data["isbuyer"]

# As we can see, we should set the null buy_freq to 0; 
# they are 0 because these customers did not make any purchases
# which is why that all of their isbuyer = 0

NaN    0
NaN    0
NaN    0
NaN    0
NaN    0
      ..
NaN    0
NaN    0
NaN    0
NaN    0
NaN    0
Name: isbuyer, Length: 52257, dtype: int64

6\. Which variables are binary?

In [11]:
# Place your code here
output[output["number_distinct"] == 2]
# As we can see, these 4 fields are binary

Unnamed: 0,number_distinct,mean,std,min,25%,50%,75%,max,number_nan
isbuyer,2,0.042632,0.202027,0.0,"0.25 0.0 Name: isbuyer, dtype: float64","0.5 0.0 Name: isbuyer, dtype: float64","0.75 0.0 Name: isbuyer, dtype: float64",1.0,0
multiple_buy,2,0.006357,0.079479,0.0,"0.25 0.0 Name: multiple_buy, dtype: float64","0.5 0.0 Name: multiple_buy, dtype: float64","0.75 0.0 Name: multiple_buy, dtype: float64",1.0,0
multiple_visit,2,0.277444,0.447742,0.0,"0.25 0.0 Name: multiple_visit, dtype: float64","0.5 0.0 Name: multiple_visit, dtype: float64","0.75 1.0 Name: multiple_visit, dtype: float64",1.0,0
y_buy,2,0.004635,0.067924,0.0,"0.25 0.0 Name: y_buy, dtype: float64","0.5 0.0 Name: y_buy, dtype: float64","0.75 0.0 Name: y_buy, dtype: float64",1.0,0
