# Introduction to Data Science
## Homework 2

Student Name: Pavel Gladkevich

Student Netid: N16902345
***

### Part 1: Case study (5 Points)
- Read [this article](http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html) in the New York Times.
- Use what we've learned in class and from the book to describe how one could set Target's problem up as a predictive modeling problem, such that they could have gotten the results that they did.  Formulate your solution as a proposed plan using our data science terminology.  Include all the aspects of the data mining process, and be sure to include the motivation for predictive modeling and give a sketch of a solution.  Be precise but concise.

The Target problem consisted of the desire of the retail company to accurately predict female customer's pregnancies before the baby was born and ideally by the 3rd trimester. This would give the company an edge over it's competitors because it would allow it to send enticing coupons to mothers. The coupons come at the precise moment when buying habits become more flexible due to a new parent's lack of time, and it could potentially result in a new high-value long-term customer through habit creation. The problem has a binary outcome of interest since the target customer will either be pregnant or not-pregnant and there will be explanatory variables correlated with each outcome. Next we can break the problem down into several steps.
*  Firstly, we must understand the data and perform exploration. The data was collected in the following fashion: "If you use a credit card or a coupon, or fill out a survey, or mail in a refund, or call the customer help line, or open an e-mail we’ve sent you or visit our Web site, we’ll record it and link it to your Guest ID." The extent of the data as described in the article is extremely exhaustive. We will link all of the explantory variables to the Guest ID, and after identifying the female guests and removing the others we can begin our analysis. For this purpose we could look at the standard predictive statistics for the variables and their covariance matrix. We would then identify products that women we know are pregnant bought, and a general group of products bought by all women. 
* Next in the process we would want to build our model and select a supervised learning algorithm to use (such as Gradient Boosted Tree, Random Forest, Decision Tree, Logistic Regression). First we would split the dataset into training, validation, and testing. Then we would figure out which group of items belongs to the pregnant women and perform feature selection to shorten the list to the most indicative variables.
* Lastly after we have completed our analysis with the model we chose the way in which we deploy and use the model. This could mean sending mixed coupons or ads to the identified future mothers in the weeks before their birth or giving them discounts in some other fashion to entice them into the store. After their birth there could be some continuation of the discounts to keep the new mother as a loyal customer.

### Part 2: Exploring data in the command line (4 Points)
For this part we will be using the data file located in `"data/advertising_events.csv"`. This file consists of records that pertain to some online advertising events on a given day. There are 4 comma separated columns in this order: `userid`, `timestamp`, `domain`, and `action`. These fields are of type `int`, `int`, `string`, and `int` respectively. Answer the following questions using Linux/Unix bash commands. All questions can be answered in one line (sometimes, with pipes)! Some questions will have many possible solutions. Don't forget that in IPython notebooks you must prefix all bash commands with an exclamation point, i.e. `"!command arguments"`.

[Hints: You can experiment with whatever you want in the notebook and then delete things to construct your answer later.  You can also use a bash shell (i.e., EC2 or a Mac terminal) and then just paste your answers here. Recall that once you enter the "!" then filename completion should work.]

[Here](https://opensource.com/article/17/2/command-line-tools-data-analysis-linux) is a good linux command line reference.

1\. How many records (lines) are in this file? (look up 'wc' command) <br/>
Answer: 10341 lines

In [23]:
import pandas as pd
import os

!wc -l advertising_events.csv



   10341 advertising_events.csv


2\. How many unique users are in this file? (hint: consider the 'cut' command and use pipe operator '|') <br/>
Answer: 732

In [30]:
!awk -F ',' '{print $1}' advertising_events.csv | sort | uniq -c | wc -l

     732


3\. Rank all domains by the number of visits they received in descending order. (hint: consider the 'cut', 'uniq' and 'sort' commands and the pipe operator).

In [40]:
!cut -d',' -f3 advertising_events.csv | sort | uniq -c | sort -r

3114 google.com
2092 facebook.com
1036 youtube.com
1034 yahoo.com
1022 baidu.com
 513 wikipedia.org
 511 amazon.com
 382 qq.com
 321 twitter.com
 316 taobao.com


4\. List all records for the user with user id 37. (hint: this can be done using 'grep')

In [44]:
!awk -F ',' '$1 == 37' advertising_events.csv

37,648061658,google.com,0
37,642479972,google.com,2
37,644493341,facebook.com,2
37,654941318,facebook.com,1
37,649979874,baidu.com,1
37,653061949,yahoo.com,1
37,655020469,google.com,3
37,640878012,amazon.com,0
37,659864136,youtube.com,1
37,640361378,yahoo.com,1
37,653862134,facebook.com,0
37,648828970,youtube.com,0


### Part 3: Dealing with data Pythonically (16 Points)

1\. (1 Point) Download the data set `"data/ads_dataset.tsv"` and load it into a Python Pandas data frame called `ads`.

In [118]:
cwd = os.getcwd()
ads = pd.read_csv(cwd + "/ads_dataset.tsv", sep = '\t')
ads.head()

Unnamed: 0,isbuyer,buy_freq,visit_freq,buy_interval,sv_interval,expected_time_buy,expected_time_visit,last_buy,last_visit,multiple_buy,multiple_visit,uniq_urls,num_checkins,y_buy
,0,,1,0.0,0.0,0.0,0.0,106,106,0,0,169,2130,0
,0,,1,0.0,0.0,0.0,0.0,72,72,0,0,154,1100,0
,0,,1,0.0,0.0,0.0,0.0,5,5,0,0,4,12,0
,0,,1,0.0,0.0,0.0,0.0,6,6,0,0,150,539,0
,0,,2,0.0,0.5,0.0,-101.1493,101,101,0,1,103,362,0


2\. (4 Points) Write a Python function called `getDfSummary()` that does the following:
- Takes as input a data frame
- For each variable in the data frame calculates the following features:
  - `number_nan` to count the number of missing not-a-number values
  - Ignoring missing, NA, and Null values:
    - `number_distinct` to count the number of distinct values a variable can take on
    - `mean`, `max`, `min`, `std` (standard deviation), and `25%`, `50%`, `75%` to correspond to the appropriate percentiles
- All of these new features should be loaded in a new data frame. Each row of the data frame should be a variable from the input data frame, and the columns should be the new summary features.
- Returns this new data frame containing all of the summary information

Hint: The pandas `describe()` method returns a useful series of values that can be used here.

In [111]:
def getDfSummary(input_data):
    # New row
    nrow = {}
    
    # Iterate through the df columns aka variables 
    # input_data.describe().transpose()
    for i in list(input_data):
        col = input_data[i]
        unique = len(col[col.notnull()].unique())
        nrow[i] = [col.isnull().sum(), unique, col.mean(), col.max(), col.min(), col.std(), 
                   col.quantile(.25), col.quantile(.5), col.quantile(.75)]
    
    # Create the new DataFrame with former columns as rows
    output_data = pd.DataFrame.from_dict(nrow, orient='index', 
                           columns=["number_nan", "number_distinct", "mean", "max", "min","std",
                                   "25%", "50%", "75%"])
    return output_data

getDfSummary(ads)

Unnamed: 0,number_nan,number_distinct,mean,max,min,std,25%,50%,75%
isbuyer,0,2,0.042632,1.0,0.0,0.202027,0.0,0.0,0.0
buy_freq,52257,10,1.240653,15.0,1.0,0.782228,1.0,1.0,1.0
visit_freq,0,64,1.852777,84.0,0.0,2.92182,1.0,1.0,2.0
buy_interval,0,295,0.210008,174.625,0.0,3.922016,0.0,0.0,0.0
sv_interval,0,5886,5.82561,184.9167,0.0,17.595442,0.0,0.0,0.104167
expected_time_buy,0,348,-0.19804,84.28571,-181.9238,4.997792,0.0,0.0,0.0
expected_time_visit,0,15135,-10.210786,91.40192,-187.6156,31.879722,0.0,0.0,0.0
last_buy,0,189,64.729335,188.0,0.0,53.476658,18.0,51.0,105.0
last_visit,0,189,64.729335,188.0,0.0,53.476658,18.0,51.0,105.0
multiple_buy,0,2,0.006357,1.0,0.0,0.079479,0.0,0.0,0.0


3\. How long does it take for your `getDfSummary()` function to work on your `ads` data frame? Show us the results below.

Hint: `use %timeit`

In [119]:
%timeit getDfSummary(ads)

131 ms ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


4\. (2 Points) Using the results returned from `getDfSummary()`, which fields, if any, contain missing `NaN` values?

In [115]:
ads_sum = getDfSummary(ads)
ads_sum[ads_sum.number_nan > 0]

Unnamed: 0,number_nan,number_distinct,mean,max,min,std,25%,50%,75%
buy_freq,52257,10,1.240653,15.0,1.0,0.782228,1.0,1.0,1.0


5\. (4 Points) For the fields with missing values, does it look like the data is missing at random? Are there any other fields that correlate perfectly, or make it more likely that the data is missing? If missing, what should the data value be? Don't just show code here. Please explain your answer.[Edit this to ask for more details on why they are 0]

Hint: create another data frame that has just the records with a missing value. Get a summary of this data frame using `getDfSummary()` and compare the differences. Do some feature distributions change dramatically?

In [109]:
miss_ads = ads[ads.buy_freq.isna() == True]
getDfSummary(miss_ads)

Unnamed: 0,number_nan,number_distinct,mean,max,min,std,25%,50%,75%
isbuyer,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0
buy_freq,52257,0,,,,,,,
visit_freq,0,48,1.651549,84.0,1.0,2.147955,1.0,1.0,2.0
buy_interval,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sv_interval,0,5112,5.686388,184.9167,0.0,17.623555,0.0,0.0,0.041667
expected_time_buy,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0
expected_time_visit,0,13351,-9.669298,91.40192,-187.6156,31.23903,0.0,0.0,0.0
last_buy,0,189,65.741317,188.0,0.0,53.484622,19.0,52.0,106.0
last_visit,0,189,65.741317,188.0,0.0,53.484622,19.0,52.0,106.0
multiple_buy,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0


It looks like the buy_freq column has missing values whenever someone enters the store but does not purchase anything. Thus the values in the two columns are perfectly correlated in the sense that we know if someone is not a buyer (isbuyer = 0), then they will have a buy_freq value of NaN. Additionally, the fields buy_interval, expected_time_buy, and multiple_buy are all 0 whenever the value for buy_freq is missing. This aligns with our expectations that only buyers would have values in those columns.

6\. (4 Points) Which variables are binary? <br/>
Answer: The variables isbuyer, multiple_buy, multiple_visit, and y_buy

In [116]:
# Binary variables only have two values
ads_sum[ads_sum.number_distinct == 2]

Unnamed: 0,number_nan,number_distinct,mean,max,min,std,25%,50%,75%
isbuyer,0,2,0.042632,1.0,0.0,0.202027,0.0,0.0,0.0
multiple_buy,0,2,0.006357,1.0,0.0,0.079479,0.0,0.0,0.0
multiple_visit,0,2,0.277444,1.0,0.0,0.447742,0.0,0.0,1.0
y_buy,0,2,0.004635,1.0,0.0,0.067924,0.0,0.0,0.0
