# DSC 80: Homework 09

### Due Date: Wednesday, March 12, 11:59PM

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the homework problems and provides code and markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding work will be developed in an accompanying `hw0X.py` file, that will be imported into the current notebook. (`X` is a homework number)

Homeworks and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying python file will be tested (a la DSC 20),
2. The notebook will be graded (for graphs and free response questions).


**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Tips for working in the Notebook**:
- The notebooks serve to present you the questions and give you a place to present your results for later review.
- The notebook on *HW assignments* are not graded (only the `.py` file).
- Notebooks for PAs will serve as a final report for the assignment, and contain conclusions and answers to open ended questions that are graded.
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file.

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional functions to solve the HW! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `hw0X.py` (much like we do in the notebook).
- Always document your code!

In [126]:
%load_ext autoreload
%autoreload 2

In [127]:
import hw09 as hw

In [128]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder

from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

## Evaluation Metrics

**Question 1**

In this question you are given a few word problems that ask you to work with difference measures: specificity, sensitivity, precision.

* A new diagnostic test has 93% sensitivity and 95% specificity. You decided to try it on a group of 10,000 people. Half of them are known to have the disease and half of them do not have it. How many of the *known* positives would actually test positive?  How many of the *known* negatives would actually test negative?

* A new screening test for some disease A has 95% sensitivity and 93% specificity. You plan to screen a population in which the prevalence of the disease (<a href="https://en.wikipedia.org/wiki/Prevalence" >meaning of "prevalence" </a>) is 0.3%.  What proportion of positive identifications was actually correct?

Write a function `question1` that returns a list of you answers, in order. For example, the first 2 numbers are the answers to the first question, third number is the answer to the second question.

## Mushroom classification

**Question 2**

"Although this dataset was originally contributed to the UCI Machine Learning repository nearly 30 years ago, mushroom hunting (otherwise known as "shrooming") is enjoying new peaks in popularity. Learn which features spell certain death and which are most palatable in this dataset of mushroom characteristics. And how certain can your model be?"

Citation: https://www.kaggle.com/uciml/mushroom-classification/version/1#_=_

We will load the dataset using the pandas library. We see that each feature (column) in this dataset is comprised of a set of categories. The number of categories varies per feature.

In [None]:
dataset = pd.read_csv('data/mushrooms.csv')
dataset.head()

Most machine-learning algorithms cannot handle categorical features. In order to extend the possible algorithms, we can encode the features into a more permissive representation (use One-Hot encoding)

* Split this dataset into labels (y) and features (X**). Reserve 1/3 of your data for testing.
* Use three classification algorithms:
    1. KNN classifier, with 1 neighbor
    2. Bayesian classifier that assumes every feature is independent of every other feature (GaussianNB())
    3. Random Forest Classifier with a single estimator, max depth 3, minimum samples split is 20, min samples leaf is 10. 
    
* Test your models by comparing their F1 - scores (for label 'e').
* Write a function `order_classifiers` that returns a list of three classifiers mentioned above sorted by the best F1 score. 

## Faulty Scooters

**Question 3**

A new electric scooter company 'Maxwell Scooters' opened a retail shop in La Jolla recently and 300 UCSD students bought new scooters for getting around campus. After 8 students start complaining their scooters are faulty, negative on-line reviews for the scooters start to spread. In response, the scooter company adamantly claims that 99% of their scooters come off the production line working properly. You think this seems unlikely and decide to investigate.

* Select a significance level for you investigation. (Not to be turned in)
* What are reasonable choices for the *Null Hypothesis* for your investigation? Select all that apply:
    1. The scooter company produces scooters that are 99% non-faulty.
    2. The scooter company produces scooters that are less than 99% non-faulty.
    3. The scooter company produces scooters that are at least 1% faulty.
    4. The scooter company produces scooters that are ~2.6% faulty.

Return your answer in a function `null_hypoth` of zero variables.

* Create a function `simulate_null` simulates a single step of data generation under the null hypothesis. The function should return a binary array.

* Create a function `estimate_p_val` that takes in a number `N` and returns the estimated p-value of your investigation upon simulating the null hypothesis `N` times.

*Note*: Plot the Null distribution and your observed statistic to check your work.

### Inference in the police stops data

These questions will pursue a few basic inference questions with the cleaned vehicle stops data from project 3.

In [4]:
stops = pd.read_csv('data/vehicle_stops_datasd.csv')
stops.head()

Unnamed: 0,stop_id,stop_cause,service_area,subject_race,subject_sex,subject_age,sd_resident,searched,dayofweek,hour
0,0,Equipment Violation,530.0,W,M,28.0,1.0,0.0,4,0
1,1,Moving Violation,520.0,B,M,25.0,0.0,0.0,4,0
2,2,Moving Violation,110.0,H,F,31.0,,,4,0
3,3,Moving Violation,,W,F,29.0,0.0,0.0,4,0
4,4,Moving Violation,230.0,W,M,52.0,0.0,0.0,4,0


**Question 4**

Suppose you would like to answer the question: "Does the likelihood that a traffic stop results in a search depend on location? Or are police equally likely to search any car that's pulled over?"

To investigate this question, perform a hypothesis test with significance level 0.01, using the null hypothesis: "Any given stop is equally likely to result in a search, regardless of the `service_area` in which the stop occurred." Ignore missing values of `service_area` in this analysis.

Measure the difference between the distribution of search under the null hypothesis and the observed distribution of searches using the total-variation distance.

* Create a function `simulate_searches` that takes in the stops data and returns a function of zero variables that simulates a single step of data generation under the null hypothesis. The function should return a (hypothetical) empirical distribution of searches by service area. That is, if `sim = simulate_searches(stops)`, then calling `sim()` generates a distribution of searches.

* Create a function `tvd_sampling_distr` that takes in the `stops` data and a number N and returns the distribution of tvds generated under the null-hypothesis. That is, your function should return an array of `N` floats.

* Create a function `search_results` of zero variables that returns a tuple with the following information:
    1. The value of the observed statistic.
    2. `True` if you reject the null hypothesis, and `False` if you fail to reject the null hypothesis.
    
The values in `search_results` should be hard-coded.

*Note:* [This chapter from DSC10](https://www.inferentialthinking.com/chapters/11/2/Multiple_Categories.html) should help guide you.

In [129]:
# plot the results

# obs_tvd, _ = hw.search_results()
# pd.Series(hw.tvd_sampling_distr(stops, 1000)).plot(kind='hist')
# plt.plot([obs_tvd,obs_tvd], [0,300], markersize=50);

## Null values: testing for MAR vs MCAR

**Question 5**

In this question, you will test for the missingness type of `sd_resident`. You will conclude that the column is missing dependent on `service_area`. 

Recall, the attribute `sd_resident` is *missing completely at random* if for *every* other attribute `col`, the following two distributions are "the same":
* the distribution of `col` when `sd_resident` is present, and
* the distribution of `col` when `sd_resident` is missing.

Determining if two observed distributions come from the same process is exactly what AB-testing does. Thus, to determine if `sd_resident` is MCAR, we need to do a permutation test between these two distributions for *every (other) column* in the dataset.

Perform a permutation test for the empirical distribution of `service_area` conditional on `sd_resident=NULL` with significance level 1%. As the column `service_area` is categorical, use the total variation distance as the measure between the two distributions.

Create the following three functions:

* A function `perm_test` that takes in the stops data and runs the above permutation procedure once. The function should return a float (the tvd between the two distributions).

* A function `obs_stat` that takes in the stops data and computes the observed statistic of the permutation test.

* A function `sd_res_missing_dependent` that takes in the stops data and a number `N`, and returns the p-value that tests whether `sd_resident` is missing at random dependent on `service_area`.

* A function `sd_res_missing_cols` of zero variables that returns a list of columns for which the missingness of `sd_resident` is dependent on (use a 1% significance level). **Do not consider `stop_id` in your tests**.

*Note:* Writing your function to work on any column (not just `service_area`) allows you to run this permutation test on *all columns* of the stops data -- which is what the last column is asking you to do! This allows you to determine *which* columns the missingness of `sd_resident` is dependent on -- to either determine it `sd_resident` is MCAR, or to determine how to impute the column.

*Note:* Be sure to plot your sampling distributions and observed statistic to check your work!