# Foundations of Data Science
## Homework 1 Solutions

### Part 1: Pre-class survey (5 Points)
- Fill in [this survey](https://forms.gle/RTARKZaQmP1WDzu18) which will help our course team understand student backgrounds and interests.

### Part 2: Case study (5 Points)
Problem statement about "A question asking students to walk us through the "Target Pregnancy Prediction" case using the framework outlined in the first class."

### Part 3: Exploring data in the command line (4 Points - 1 Point Each)
For this part we will be using the data file located in `"data/advertising_events.csv"`. This file consists of records that pertain to some advertising events on a given day. There are 4 comma separated columns in this order: `userid`, `timestamp`, `domain`, and `action`. These fields are of type `int`, `int`, `string`, and `int` respectively. Answer the following questions using Linux/Unix bash commands. All questions can be answered in one line (sometimes, with pipes)! Some questions will have many possible solutions. Don't forget that in iPython notebooks you must prefix all bash commands with an exclamation point, i.e. `"!command arguments"`.

1\. How many records are in this file?

In [None]:
!wc -l data/advertising_events.csv #use wordcount

wc: data/advertising_events.csv: No such file or directory


2\. How many unique users are in this file?

In [None]:
#divide file into parts by column (delimiter ',') sort it and output the number of unique users (field 1)
!cut -f1 -d',' data/advertising_events.csv | sort | uniq | wc -l 

cut: data/advertising_events.csv: No such file or directory
0


3\. Rank all domains by the number of visits they received in descending order.

In [None]:
!cut -f3 -d',' data/advertising_events.csv | sort | uniq -c | sort -nr
#uniq -c gives counts per item

cut: data/advertising_events.csv: No such file or directory


4\. List all records for the user with user id 37.

In [None]:
!grep '^37,' data/advertising_events.csv

grep: data/advertising_events.csv: No such file or directory


### Part 4: Dealing with data Pythonically (16 Points)

1\. (1 Point) Download the data set `"data/ads_dataset.tsv"` and load it into a Python Pandas data frame called `ads`.

In [None]:
import pandas as pd
ads = pd.read_csv("ads_dataset.xlsx", sep='\t')

2\. (4 Points) Write a Python function called `getDfSummary()` that does the following:
- Takes as input a data frame
- For each variable in the data frame calculates the following features:
  - `number_nan` to count the number of missing not-a-number values
  - Ignoring missing, NA, and Null values:
    - `number_distinct` to count the number of distinct values a variable can take on
    - `mean`, `max`, `min`, `std` (standard deviation), and `25%`, `50%`, `75%` to correspond to the appropriate percentiles
- All of these new features should be loaded in a new data frame. Each row of the data frame should be a variable from the input data frame, and the columns should be the new summary features.
- Returns this new data frame containing all of the summary information

Hint: The pandas `describe()` method returns a useful series of values that can be used here.

- If the code returns the correct format:
  - if <= 30 lines -> 4 points
  - if > 30 lines - > 3 points
- If it doesn't return correct format -> 0 points

In [None]:
import numpy as np

def getDfSummary(input_data):
    # Get a whole bunch of stats
    output_data = input_data.describe().transpose()
    
    # Count NANs
    output_data['number_nan'] = input_data.shape[0] - output_data['count']
    
    # Count unique values - use function that is not bound to a name at runtime (lambda -
    # used to create a small, one-time and anonymous function object in Python)
    output_data['number_distinct'] = ads.apply(lambda x: len(pd.unique(x)), axis=0) 
    
    # Remove 'count' column since it wasn't asked for
    output_data = output_data.drop('count', 1)
    
    return output_data

3\. (1 Points) How long does it take for your `getDfSummary()` function to work on your `ads` data frame? Show us the results below.

Hint: `%timeit getDfSummary(ads)`


In [None]:
%timeit getDfSummary(ads)

4\. (2 Points) Using the results returned from `getDfSummary()`, which fields, if any, contain missing `NaN` values?

In [None]:
summary = getDfSummary(ads)
for column in summary.index[summary['number_nan'] > 0]:
    print(column)

5\. (4 Points) For the fields with missing values, does it look like the data is missing at random? Are there any other fields that correlate perfectly, or predict that the data is missing? If missing, what should the data value be?

Hint: create another data frame that has just the records with a missing value. Get a summary of this data frame using `getDfSummary()` and compare the differences. Do some feature distributions change dramatically?

In [None]:
ads_null = ads[ads.isnull().any(axis=1)]
print(getDfSummary(ads_null))


6\. (4 Points) Which variables are binary?


In [None]:
summary = getDfSummary(ads)

for column in summary.index[(summary['number_distinct'] == 2) & (summary['min'] == 0) & (summary['max'] == 1)]:
    print(column)