# Foundations of Data Science
## Homework 2

Student Name:
Student Netid:***

### Part 1: Case study
- Read [this article](http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html) in the New York Times.
- Use what we've learned in class and from the book to describe how one could set Target's problem up as a predictive modeling problem, such that they could have gotten the results that they did.  Formulate your solution as a proposed plan using our data science terminology.  Include aspects of the Data Science Workflow that you see as relevant to solving the problem.  Be precise but concise.

Problem Statement : We have to use predictive modeling in the Target's problem set to identify wether the woman is pregnant so that they can draw them to Target through some offers.
Data Collected : Target collects data from their customers through different means like surveys, advertisements etc. They assign each customer a unique id using which they can trace customers purchase history. Data is therefore collected from various sources. This should be filtered so that we can ensure there are no duplicate values in the dataset.
Machine Learning Model : We can use supervised learning for this problem. The dataset would be binary, i.e either the customer would be pregnant or not. The dataset will have to be divided into training, validation and testing data. We can use Decision Tree, or nearest neighbor algorithm for prediction using training set. Different features collected from dataset can be used to increase the model accuracy.

### Part 2: Exploring data in the command line
For this part we will be using the data file located in `"data/advertising_events.csv"`. This file consists of records that pertain to some online advertising events on a given day. There are 4 comma separated columns in this order: `userid`, `timestamp`, `domain`, and `action`. These fields are of type `int` (continuous), `int` (continuous), `string`, and `int` (category) respectively. Answer the following questions using Linux/Unix bash commands. All questions can be answered in one line (sometimes, with pipes)! Some questions will have many possible solutions. Don't forget that in IPython notebooks you must prefix all bash commands with an exclamation point, i.e. `"!command arguments"`.

[Hints: You can experiment with whatever you want in the notebook and then delete things to construct your answer later.  Recall that once you enter the "!" then filename completion should work.]

1\. How many records (lines) are in this file?

In [83]:
!wc -l advertising_events.csv

   10341 advertising_events.csv


2\. How many unique users are in this file? (hint: consider the 'cut' command and use pipe operator '|')

In [69]:
!cut -d ',' -f 1 advertising_events.csv | sort  | uniq | wc -l

     732


3. Rank all domains by the number of visits they received in descending order. (hint: consider the 'cut', 'uniq' and 'sort' commands and the pipe operator).

In [70]:
!cut -d ',' -f 3 advertising_events.csv | sort |uniq -c | sort -r

3114 google.com
2092 facebook.com
1036 youtube.com
1034 yahoo.com
1022 baidu.com
 513 wikipedia.org
 511 amazon.com
 382 qq.com
 321 twitter.com
 316 taobao.com


4\. List all records for the user with user id 37. (hint: this can be done using 'grep')

In [71]:
!grep -w "37" advertising_events.csv

37,648061658,google.com,0
37,642479972,google.com,2
37,644493341,facebook.com,2
37,654941318,facebook.com,1
37,649979874,baidu.com,1
37,653061949,yahoo.com,1
37,655020469,google.com,3
37,640878012,amazon.com,0
37,659864136,youtube.com,1
37,640361378,yahoo.com,1
37,653862134,facebook.com,0
37,648828970,youtube.com,0


### Part 3: Dealing with data Pythonically

In [48]:
# You might find these packages useful. You may import any others you want!
import pandas as pd
import numpy as np

1\. Load the data set `"datasets/ads_dataset.tsv"` into a Python Pandas data frame called `ads`.

In [72]:
ads = pd.read_csv("ads_dataset.tsv", delim_whitespace=True)

2\. Write a Python function called `getDfSummary()` that does the following:
- Takes as input a data frame
- For each variable in the data frame calculates the following features:
  - `number_nan` to count the number of missing not-a-number values
  - Ignoring missing, NA, and Null values:
    - `number_distinct` to count the number of distinct values a variable can take on
    - `mean`, `max`, `min`, `std` (standard deviation), and `25%`, `50%`, `75%` to correspond to the appropriate percentiles
- All of these new features should be loaded in a new data frame. Each row of the data frame should be a variable from the input data frame, and the columns should be the new summary features.
- Returns this new data frame containing all of the summary information

Hint: The pandas `describe()` [(manual page)](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) method returns a useful series of values that can be used here.

In [73]:
def getDfSummary(input_data):
    nan_num = input_data.isnull().sum()
    distinct_num = input_data.nunique().transpose()
    data_describe = input_data.describe()
    temp = data_describe.drop('count')
    data = temp.transpose()
    data_out = data.assign(number_distinct= distinct_num)
    output_data = data_out.assign(number_nan=nan_num)
    return output_data

3\. How long does it take for your `getDfSummary()` function to work on your `ads` data frame? Show us the results below.

Hint: `%timeit getDfSummary(ads)`

In [77]:
%timeit getDfSummary(ads)

102 ms ± 4.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


4\. Using the results returned from `getDfSummary()`, which fields, if any, contain missing `NaN` values?

In [78]:
DataFrame =getDfSummary(ads)
NaN_Fields = []
for key,value in DataFrame["number_nan"].iteritems():
    if value>0:
        NaN_Fields.append(key)
NaN_Fields

['buy_freq']

5\. For the fields with missing values, does it look like the data is missing at random? Are there any other fields that correlate perfectly, or predict that the data is missing? What would be an appropriate method for filling in missing values?

Hint: create another data frame that has just the records with a missing value. Get a summary of this data frame using `getDfSummary()` and compare the differences. Do some feature distributions change dramatically?

In [84]:
#It looks like data is missing at random.
#There is correlation between 'buy_freq' and 'isbuyer'.
#The appropriate method for filling in missing values would be insert 0 in the dataframe.
summary_new = ads.fillna(0)
new_Df = getDfSummary(new_summary)
new_Df.loc["buy_freq"]

mean                0.052891
std                 0.298157
min                 0.000000
25%                 0.000000
50%                 0.000000
75%                 0.000000
max                15.000000
number_distinct    11.000000
number_nan          0.000000
Name: buy_freq, dtype: float64

6\. Which variables are binary?

In [81]:
Bin_variables = []
for key,value in DataFrame["number_distinct"].iteritems():
    if value ==2 :
        Bin_variables.append(key)
Bin_variables

['isbuyer', 'multiple_buy', 'multiple_visit', 'y_buy']