# DSC 80: Lab 05

### Due Date: Saturday November 7, Midnight (11:59 PM)

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding work will be developed in an accompanying `lab*.py` file, that will be imported into the current notebook.

Labs and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying python file will be tested (a la DSC 20),
2. The notebook will be graded (for graphs and free response questions).

**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Tips for working in the Notebook**:
- The notebooks serve to present you the questions and give you a place to present your results for later review.
- The notebook on *lab assignments* are not graded (only the `.py` file).
- Notebooks for PAs will serve as a final report for the assignment, and contain conclusions and answers to open ended questions that are graded.
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file.

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional functions to solve the lab! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `lab**.py` (much like we do in the notebook).
- Always document your code!

### Importing code from `lab**.py`

* We import our `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab**.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab**.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab**.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab**` merely import the existing compiled python.

In [197]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [198]:
import lab05 as lab

In [199]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


import requests
import bs4

## Payment data

**Question 1**

You are given a dataset that describes payment information for purchases made on 01-Jan-2019 contianing the columns: `Id`, `credit_card_type`, `credit_card_number`, and the purchaser's `date_of_birth`.

You need to assess the  missingness in payments data. In particular, **Is the credit card number missing at random dependent on the age of shopper?** Look at distribution of ages by missingness of `credit_card_number` and determine if the missingness is dependent on age or not.

`Hint`: use the following steps to approach this problem:

* Obtain the ages of the purchasers
* Plot the distribution of ages by missingness (density curves).
    
* Do you think the missingness of credit card number is dependent on age or not?

Perform a permutation test for the empirical distribution of age conditional on `credit_card_number` with a 5% significance level. Use difference of means as your statistic.

Write a function `first_round` with no arguments that returns a __list__ with two values:
* the first value is the p-value from your permutation test and 
* the second value is either "R" if you reject the null hypothesis, or "NR" if you do not.

**Does the result match your guess? If no, what might be a problem?**

Perform another permutation test for the empirical distribution of age conditional on `credit_card_number` with a 5% significance level. Use KS-Statistic as your statistic.

Write a function `second_round` with no arguments that returns a __list__ with three values: 
* the first value is the p-value from your new permutation test 
* the second value is either "R" if you reject the null hypothesis or "NR" if you do not, and 
* the third value is your final conclusion: "D" (dependent on age) or "ND" (not dependent on age).



In [200]:
payment_fp = os.path.join('data', 'payment.csv')
payments = pd.read_csv(payment_fp)
payments.head()

Unnamed: 0,id,credit_card_type,credit_card_number,date_of_birth
0,1,diners-club-enroute,201870600000000.0,25-Sep-1982
1,2,americanexpress,373751100000000.0,08-Jan-1946
2,3,jcb,3570296000000000.0,
3,4,mastercard,5318327000000000.0,
4,5,maestro,6.759827e+17,20-Apr-1975


In [201]:
def perm4missing(payments, N):
    pay = payments.copy()
    #col = credit_card_number
    #col = pay['credit_card_number']
    #departure delay = date_of_birth
    #turn date_of_birth to age!
    pay['date_of_birth']=pay['date_of_birth'].apply(pd.to_datetime)
    pay['age'] = (pd.to_datetime('now') - pay['date_of_birth']).astype('<m8[Y]')
    
    diff_of_means = []
    for _ in range(N):
        #shuffle the CC column
        shuffled_col = pay['age'].sample(replace=False, frac=1).reset_index(drop=True)
        
        #put into table
        shuffled = pay.assign(**{'age':shuffled_col, 'is_null':pay['credit_card_number'].isnull()})
        
        #calculate difference in means
        mean = shuffled.groupby('is_null')['age'].mean().diff().abs().iloc[-1]
        diff_of_means.append(mean)
    obs = pay.assign(is_null=pay['credit_card_number'].isnull()).groupby('is_null')['age'].mean().diff().abs().iloc[-1]
    pval = np.mean(diff_of_means >= obs)
    pd.Series(diff_of_means).plot(kind='hist', density=True, alpha=.8, title='pval')
    plt.scatter(obs, 0, color='red', s=40)
    return pval

In [202]:
from scipy.stats import ks_2samp
df = payments.copy()
df['date_of_birth']=df['date_of_birth'].apply(pd.to_datetime)
df['age'] = (pd.to_datetime('now') - df['date_of_birth']).astype('<m8[Y]')
ks_2samp(df['credit_card_number'], df['age']).pvalue


3.65394511e-315

In [203]:
def first_round():
    """
    :return: list with two values
    >>> out = first_round()
    >>> isinstance(out, list)
    True
    >>> out[0] < 1
    True
    >>> out[1] is "NR" or out[1] is "R"
    True
    """
    return [.155,'NR']


def second_round():
    """
    :return: list with three values
    >>> out = second_round()
    >>> isinstance(out, list)
    True
    >>> out[0] < 1
    True
    >>> out[1] is "NR" or out[1] is "R"
    True
    >>> out[2] is "ND" or out[2] is "D"
    True
    """
    return [3.65394511e-315, 'R','D']


### Missingness and the proportion of null values

**Question 2**

In the file `data/missing_heights.csv` are the heights of children and their fathers (`child` and `father`). The `child_X` columns are missing values in varying proportions. The missingness of these `child_X` columns were created as MAR dependent on father height. The missingness of these `child_X` columns are all equally dependent on father height and each column `child_X` is `X%` non-null (verify this yourself!).

* You will attempt to *verify* the missingness of `child_X` on the `father` height column using permutation test. Your permutation tests should use `N=100` simulations and use the `KS` test statistic. Write a function `verify_child` that takes in the `heights` data and returns a __series__ of p-values (from your permutation tests), indexed by the columns `child_X`. 

* Now interpret your results. In the function `missing_data_amounts`, return a __list__ of correct statements from the options below:
    1. The p-value for `child_50` is small because the *sampling distribution* of test-statistics has low variance.
    1. MAR is hardest to determine when there are very different proportions of null and non-null values.
    1. The difference between p-value for `child_5` and `child_95` is due to randomness.
    1. You would always expect the p-value of `child_X` and `child_(100-X)` to be similar.
    1. You would only expect the p-value of `child_X` and `child_(100-X)` to be similar if the columns are MCAR.


In [204]:
fp = os.path.join('data', 'missing_heights.csv')
heights = pd.read_csv(fp)
heights.head()

Unnamed: 0,child,father,child_95,child_90,child_75,child_50,child_25,child_10,child_5
0,73.2,78.5,73.2,73.2,73.2,,,,
1,69.2,78.5,69.2,69.2,69.2,,,,69.2
2,69.0,78.5,69.0,69.0,69.0,69.0,,,
3,69.0,78.5,69.0,69.0,,69.0,,,
4,73.5,75.5,73.5,73.5,,73.5,73.5,,


In [205]:
def verify_child(heights):
    """
    Returns a series of p-values assessing the missingness
    of child-height columns on father height.

    >>> fp = os.path.join('data', 'missing_heights.csv')
    >>> heights = pd.read_csv(fp)
    >>> out = verify_child(heights)
    >>> out['child_50'] < out['child_95']
    True
    >>> out['child_5'] > out['child_50']
    True
    """
    def permutation(df, col):
        ks_lst = []
        df = df.assign(null=col.isnull())
        for _ in range(100):
            sample_cols = df['father'].sample(replace=False, frac=1).reset_index(drop=True)
            sample = df.assign(**{'father':sample_cols, 'null':col.isnull()})
            null_fathers = sample.groupby('null')['father']
            ks = ks_2samp(null_fathers.get_group(True), null_fathers.get_group(False)).statistic
            ks_lst.append(ks)
        grouped = df.groupby('null')['father']
        obs_ks = ks_2samp(grouped.get_group(True), grouped.get_group(False)).statistic
        return np.count_nonzero(np.array(ks_lst) > obs_ks) / 100
    p_vals = []
    col_names = heights.columns.drop(['child','father'])
    for col in col_names:
        p_vals.append(permutation(heights,heights[col]))
    return pd.Series(data = p_vals, index = col_names)


def missing_data_amounts():
    """
    Returns a list of multiple choice answers.

    :Example:
    >>> set(missing_data_amounts()) <= set(range(1,6))
    True
    """

    return [1,2,5]

In [206]:
fp = os.path.join('data', 'missing_heights.csv')
heights = pd.read_csv(fp)
out = verify_child(heights)
out

child_95    0.80
child_90    0.74
child_75    0.38
child_50    0.00
child_25    0.10
child_10    0.18
child_5     0.14
dtype: float64

In [96]:
out.columns.drop(['child','father'])

Index(['child_95', 'child_90', 'child_75', 'child_50', 'child_25', 'child_10',
       'child_5'],
      dtype='object')

### Imputation of Heights: quantitative columns

**Question 3**

In lecture, you learned how to do single-valued imputation conditionally on a *categorical* column: impute with the mean for each group. That is, for each distinct value of the *categorical* column, there is a single imputed value.

Here, you will do a single-valued imputation conditionally on a *quantitative* column. To do this, transform the `father` column into a categorical column by binning the values of `father` into [quartiles](https://en.wikipedia.org/wiki/Quartile). Once this is done, you can impute the column as in lecture (and described above).

* Write a function `cond_single_imputation` that takes in a dataframe with columns `father` and `child` (with missing values in `child`) and imputes single-valued mean imputation of the `child` column, conditional on `father`. Your function should return a __Series__ (Hint: `pd.qcut` may be helpful!).

*Hint:* The groupby method `.transform` is useful for this question (see discussion 3), though it's also possible using `aggregate`. As a reminder, *loops are not allowed*, and functions mentioned in "Hints" are not required.



In [29]:
new_heights = heights[['father', 'child_50']].rename(columns={'child_50': 'child'}).copy()
new_heights.head()

Unnamed: 0,father,child
0,78.5,
1,78.5,
2,78.5,69.0
3,78.5,69.0
4,75.5,73.5


In [218]:
def cond_single_imputation(new_heights):
    """
    cond_single_imputation takes in a dataframe with columns 
    father and child (with missing values in child) and imputes 
    single-valued mean imputation of the child column, 
    conditional on father. Your function should return a Series.

    :Example:
    >>> fp = os.path.join('data', 'missing_heights.csv')
    >>> df = pd.read_csv(fp)
    >>> df['child'] = df['child_50']
    >>> out = cond_single_imputation(df)
    >>> out.isnull().sum() == 0
    True
    >>> (df.child.std() - out.std()) > 0.5
    True
    """
    df = new_heights.copy()[['father','child']]
    df['father']=pd.qcut(df['father'],4)
    means =  df.groupby('father').mean()
    def helper(row):
        father = row['father']
        child = row['child']
        if math.isnan(child):
            for i in range(means.shape[0]):
                if father in means.index[i]:
                    child = means['child'][i]
        row['father'] = father
        row['child'] = child
        return row
    out_df = new_heights.copy()
    return out_df.transform(helper, axis = 1)['child']

In [219]:
fp = os.path.join('data', 'missing_heights.csv')
df = pd.read_csv(fp)
df['child'] = df['child_50']
out = cond_single_imputation(df)
out

0      68.083871
1      68.083871
2      69.000000
3      69.000000
4      73.500000
         ...    
929    64.000000
930    62.000000
931    65.481383
932    66.500000
933    65.481383
Name: child, Length: 934, dtype: float64

In [220]:
df

Unnamed: 0,child,father,child_95,child_90,child_75,child_50,child_25,child_10,child_5
0,,78.5,73.2,73.2,73.2,,,,
1,,78.5,69.2,69.2,69.2,,,,69.2
2,69.0,78.5,69.0,69.0,69.0,69.0,,,
3,69.0,78.5,69.0,69.0,,69.0,,,
4,73.5,75.5,73.5,73.5,,73.5,73.5,,
...,...,...,...,...,...,...,...,...,...
929,64.0,62.0,64.0,64.0,64.0,64.0,,,
930,62.0,62.0,62.0,62.0,62.0,62.0,62.0,,
931,,62.0,61.0,61.0,61.0,,,,
932,66.5,62.5,66.5,66.5,66.5,66.5,66.5,,


In [60]:
for i in out.index:
    temp =i
72 in temp

True

In [80]:
import math

### Probabilistic imputation of quantitative columns

**Question 4**

In lecture, you learned how to impute a categorical column by sampling from the dataframe column. One problem with this technique is that the imputation will never generate imputed values that weren't already in the dataset. When the column under consideration is quantitative, this may not be a reasonable assumption. For example, `56.0`, `57.0`, and `57.5` are in the heights dataset, yet `56.5` is not. Thus, any imputation done by sampling from the dataset will not be able to generate a height of `56.5`, even though it's clearly a reasonable value to occur in the dataset.

To keep things simple, you will impute the `child` column *unconditionally* from the distribution of `child` heights present in the dataset. This means that you will use the values present in `child` to impute missing values. i.e. values that appear in `child` more will probably appear more when imputing.

The approach to imputing from a quantitative distribution, is as follows:
* Find the empirical distribution of `child` heights by creating a histogram (using 10 bins) of `child` heights.
* Use this histogram to generate a number within the observed range of `child` heights:
    - The likelihood a generated number belongs to a given bin is the proportion of the bin in the histogram. (Hint: `np.histogram` is useful for this part).
    - Any number within a fixed bin is equally likely to occur. (Hint: `np.random.choice` and `np.random.uniform` may be useful for this part).
    
Create a function `quantitative_distribution` that takes in a Series and an integer `N > 0`, and returns an array of `N` using the method described above. (For writing this function, and this function only, it is *ok* to use loops).

Create a function `impute_height_quant` that takes in a Series of `child` heights with missing values (aka `child_X`) and imputes them using the scheme above. **You should use `quantitative_distribution` to help you do this.**

In [189]:
def quantitative_distribution(child, N):
    """
    quantitative_distribution that takes in a Series and an integer 
    N > 0, and returns an array of N samples from the distribution of 
    values of the Series as described in the question.
    :Example:
    >>> fp = os.path.join('data', 'missing_heights.csv')
    >>> df = pd.read_csv(fp)
    >>> child = df['child_50']
    >>> out = quantitative_distribution(child, 100)
    >>> out.min() >= 56
    True
    >>> out.max() <= 79
    True
    >>> np.isclose(out.mean(), child.mean(), atol=1)
    True
    """
    freq, bins = np.histogram(child.dropna(), bins = 10)
    probs = freq/freq.sum()
    bins_width = np.diff(bins)[0]
    rand_probs = np.random.choice(bins[:-1],p = probs,size = N)
    outs = np.array([])
    for prob in rand_probs:
        outs = np.append(outs,np.random.uniform(prob,prob+bins_width))
    return outs


def impute_height_quant(child):
    """
    impute_height_quant takes in a Series of child heights 
    with missing values and imputes them using the scheme in
    the question.

    :Example:
    >>> fp = os.path.join('data', 'missing_heights.csv')
    >>> df = pd.read_csv(fp)
    >>> child = df['child_50']
    >>> out = impute_height_quant(child)
    >>> out.isnull().sum() == 0
    True
    >>> np.isclose(out.mean(), child.mean(), atol=0.5)
    True
    """
    return child.fillna(pd.Series(quantitative_distribution(child, len(child))))

In [191]:
fp = os.path.join('data', 'missing_heights.csv')
df = pd.read_csv(fp)
child = df['child_50']
out = impute_height_quant(child)
out

0      64.895981
1      70.098929
2      69.000000
3      69.000000
4      73.500000
         ...    
929    64.000000
930    62.000000
931    67.147306
932    66.500000
933    62.341136
Name: child_50, Length: 934, dtype: float64

In [169]:
fp = os.path.join('data', 'missing_heights.csv')
df = pd.read_csv(fp)
child = df['child_50']
out = quantitative_distribution(child, 100)
out

array([66.71810937, 71.76473108, 62.90776614, 63.30232443, 71.77673424,
       68.82295395, 61.12772776, 62.60565491, 60.81931376, 71.41950491,
       62.16389166, 71.75541247, 70.14322712, 70.79024465, 70.33820199,
       71.83886402, 66.68047005, 61.67990668, 69.48276814, 67.35411783,
       63.42065776, 63.84510606, 71.16706637, 71.484122  , 71.93274638,
       70.71079728, 71.30919445, 70.20489858, 65.04425294, 63.44501259,
       71.84036666, 67.01438044, 59.86986732, 63.9678808 , 71.83442965,
       62.97678398, 63.80317112, 71.78675028, 70.51990499, 71.41744028,
       69.23878467, 64.40642875, 63.93889491, 67.99769809, 65.62792906,
       65.18365947, 58.86749709, 63.57654251, 66.95901198, 66.38017718,
       64.93200208, 72.0149001 , 62.54701888, 68.766288  , 64.30437731,
       72.01266973, 62.73814879, 63.57211283, 69.04026355, 60.99540089,
       60.59391743, 60.95177803, 69.052602  , 58.47701547, 62.88210706,
       60.95992615, 69.75366075, 71.03584129, 66.73388291, 64.62

# I'm ready for scraping! But am I allowed to?

**Question 5**

We know that many sites have a published policy allowing or disallowing automatic access to their site. Often, this policy is in a text file `robots.txt`. There is (`https://moz.com/learn/seo/robotstxt`) a good article that explains what these files are, where to find them, and how to use them. After reading the article please answer a few questions. 

**2.1: What is the purpose of `robots.txt`?**

1) To informs agents which pages to crawl.

2) To informs agents that the site is automated.

3) To inform agents that robots will chase them down if their info is stolen.

**2.2: Where do you put your `robots.txt` file?**

1) In the folder you want to disallow.

2) In the root directory of your website.

3) In a Google search.


**2.3: If a `robots.txt` is not present, does it mean you can legally scrape the site?**

1) Yes

2) No

**2.4: Each subdomain on a root domain can use separate `robots.txt` file**

1) Yes

2) No


**2.5: Website hunt**

Next, find three websites that explicitly use a `robots.txt` file and allow scraping (by everyone) and three that do not allow generic user-agents to scrape it (denoted by `*`).

* Note: Some websites may cause gradescope to time out. Please change a website if you encounter this issue. 




Now combine you answers to multiple choice questions in one list and urls of the sites you found in another list. 
Create an argument-free function `answers` to return both of lists.


In [192]:
def answers():
    """
    Returns two lists with your answers
    :return: Two lists: one with your answers to multiple choice questions
    and the second list has 6 websites that satisfy given requirements.
    >>> list1, list2 = answers()
    >>> len(list1)
    4
    >>> len(list2)
    6
    """
    list1 = [1,2,1,2]
    list2 = ['qq.com','soundcloud.com','fc2.com', '*facebook.com','*linkedin.com','*soso.com']
    return list1, list2

In [196]:
list1, list2 = answers()
len(list2)


6

## Congratulations! You're done!

* Submit the lab on Gradescope