## Assignment 4: Obsevational Studies and Applied ML

### Deadline
November 21st,11:59PM

### Important notes

Make sure you push on GitHub your notebook with all the cells already evaluated. Don't forget to add a textual description of your thought process, the assumptions you made, and the solution you implemented. Back up any hypotheses and claims with data, since this is an important aspect of the course. Please write all your comments in English, and use meaningful variable names in your code. Your repo should have a single notebook (plus the data files necessary) in the master branch. If there are multiple notebooks present, we will not grade anything.

Use this legendary link to create your repository: [link](https://classroom.github.com/g/YXtsr0QK)

In [8]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats.stats import pearsonr


In [3]:
data_folder = './data/'

## Task 1: Boosting the economy by incentivizing self-employment

Assume the biggest priority of the local government in 2018 is to increase per-capita income. To do so, the officials plan to adopt a strategy for incentivizing self-employment through a series of campaigns, educational programs, and dedicated funds.

Since it is unethical and impossible in this setting to run a controlled experiment involving citizens (e.g., fire employees and force them to self-employ), the officials have asked you, the data scientist, to establish the effect of self-employment on the economy, relying on observational data.

**A)** You will be working with the full US 2015 census dataset (acs2015_county_data.csv, available at https://www.kaggle.com/muonneutrino/us-census-demographic-data#acs2015_county_data.csv). Using suitable methods, determine and quantify the dependency between the percentage of self-employed citizens and per capita income across all 3,212 US counties. Do citizens in counties that have a higher percentage of self-employed people earn more per capita?

**B)** The pilot program will involve all counties within a limited set of three US states. Set A includes Wisconsin, Tennessee, and  Minnesota. Quantify the dependency of per-capita income on self-employment rates across all the counties in set A.

**C)** In which state within set A is the observed effect of self-employment on per-capita income the strongest?

**D)** Set B includes New Jersey, Kansas, and Rhode Island. Repeat the analysis from steps B and C above, but now for set B. In which of the two sets A and B (if any) would you recommend incentivizing self-employment? Explain your reasoning.

Hint: It is useful to add a notion of confidence to your results and explore the data visually. You are allowed to use the SciPy library.

In [38]:
# Read the data frame
df = pd.read_csv(data_folder +'acs2015_county_data.csv');

In [41]:
# A

# Percentace of self-employed ['SelfEmployed']
# Per capita income in a county ['IncomePerCap']

# compute two correlat methods
spearman_corr = df['SelfEmployed'].corr(df['IncomePerCap'], method='spearman')
pearson_corr, _ = pearsonr(df['SelfEmployed'], df['IncomePerCap'])
print(corr)
print(pearson_corr)


0.056413673053189874
0.08727386609551786


#### Observation Part A:
By looking at the correlation methods used, spearman and pearson, we can see that both show an **very low correlation** between the **Self Employment percentage** in a county and its **Income Per Capita**
Therefore, counties with higher percentage of of self-employed people do not necessarily earn more per capita.

In [49]:
# B 

a = ['Wisconsin', 'Tennessee', 'Minnesota']

# Compute three states all together
set_a = df[df['State'].isin(a)]
spearman_corr = set_a['SelfEmployed'].corr(set_a['IncomePerCap'], method='spearman')
pearson_corr, _ = pearsonr(set_a['SelfEmployed'], set_a['IncomePerCap'])
print('Set A Spearman:',corr)
print('Set A Pearson',pearson_corr)

# Compute each state individually
wisconsin_df = df[df['State'] == 'Wisconsin']
tennessee_df = df[df['State'] == 'Tennessee']
minnesote_df = df[df['State'] == 'Minnesota']

# Wisconsin
spearman_corr = wisconsin_df['SelfEmployed'].corr(wisconsin_df['IncomePerCap'], method='spearman')
pearson_corr, _ = pearsonr(wisconsin_df['SelfEmployed'], wisconsin_df['IncomePerCap'])
print('Wisconsin Spearman:', spearman_corr)
print('Wisconsin Pearson', pearson_corr)

# Tennessee
spearman_corr = tennessee_df['SelfEmployed'].corr(tennessee_df['IncomePerCap'], method='spearman')
pearson_corr, _ = pearsonr(tennessee_df['SelfEmployed'], tennessee_df['IncomePerCap'])
print('Tennessee Spearman:', spearman_corr)
print('Tennessee Pearson', pearson_corr)

# Minnesota
spearman_corr = minnesote_df['SelfEmployed'].corr(minnesote_df['IncomePerCap'], method='spearman')
pearson_corr, _ = pearsonr(minnesote_df['SelfEmployed'], minnesote_df['IncomePerCap'])
print('Minnesota Spearman:', spearman_corr)
print('Minnesota Pearson', pearson_corr)

Set A Spearman: 0.056413673053189874
Set A Pearson -0.20229350736521498
Wisconsin Spearman: -0.46351291044049403 0.004768134887745234
Wisconsin Pearson -0.32905300016378525
Tennessee Spearman: -0.316991392780988
Tennessee Pearson -0.23836048684913141
Minnesota Spearman: -0.21107460598245847
Minnesota Pearson -0.2538551921654062


## Observation Part B

### Set A

**Spearman: 0.056413673053189874** <br>
**Pearson: -0.20229350736521498**

After computing the correlations using Spearman and Pearson, we can see that the former method gives us a value (0.056) closer to zero than to -1 and 1. This means that there's almost no dependency between the percentage of self-employed people and the income per capita of these three states. On the other hand, when using the Pearson correlation, we can see a slight negative correlation from the value obtained (-0.20); this means that in a slight way the more self-employed people there are, the less the income per capita will be.

Next, we see the correlations between the self-employed percentage and income per capita in each state.

###  Wisconsin
**Spearman: -0.46351291044049403** <br>
**Pearson -0.32905300016378525**

### Tennessee
**Spearman: -0.316991392780988** <br>
**Pearson -0.23836048684913141**

### Minnesota
**Spearman: -0.21107460598245847** <br>
**Pearson -0.2538551921654062**

In each state we see a negative correlation between the self-employed percentage and income per capita. As previously exaplained, this means that the more self-employed people there are, the less the income per capita will be. In Wisconsin we can see a higher negative correlation between the two features, indicating an even stronger negative correlation. The other two states are slighly lower but still inversely proportional.

In [None]:
# C

In [None]:
# D

## Task 2: All you need is love… And a dog!

Here we are going to build a classifier to predict whether an animal from an animal shelter will be adopted or not (aac_intakes_outcomes.csv, available at: https://www.kaggle.com/aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes/version/1#aac_intakes_outcomes.csv). You will be working with the following features:

1. *animal_type:* Type of animal. May be one of 'cat', 'dog', 'bird', etc.
2. *intake_year:* Year of intake
3. *intake_condition:* The intake condition of the animal. Can be one of 'normal', 'injured', 'sick', etc.
4. *intake_number:* The intake number denoting the number of occurrences the animal has been brought into the shelter. Values higher than 1 indicate the animal has been taken into the shelter on more than one occasion.
5. *intake_type:* The type of intake, for example, 'stray', 'owner surrender', etc.
6. *sex_upon_intake:* The gender of the animal and if it has been spayed or neutered at the time of intake
7. *age_upon\_intake_(years):* The age of the animal upon intake represented in years
8. *time_in_shelter_days:* Numeric value denoting the number of days the animal remained at the shelter from intake to outcome.
9. *sex_upon_outcome:* The gender of the animal and if it has been spayed or neutered at time of outcome
10. *age_upon\_outcome_(years):* The age of the animal upon outcome represented in years
11. *outcome_type:* The outcome type. Can be one of ‘adopted’, ‘transferred’, etc.

**A)** Load the dataset and convert categorical features to a suitable numerical representation (use dummy-variable encoding). Split the data into a training set (80%) and a test set (20%). Pair each feature vector with the corresponding label, i.e., whether the outcome_type is adoption or not. Standardize the values of each feature in the data to have mean 0 and variance 1. The use of external libraries is not permitted in part A, except for numpy and pandas.

**B)** Train a logistic regression classifier on your training set. Logistic regression returns probabilities as predictions, so in order to arrive at a binary prediction, you need to put a threshold on the predicted probabilities. For the decision threshold of 0.5, present the performance of your classifier on the test set by displaying the confusion matrix. Based on the confusion matrix, manually calculate accuracy, precision, recall, and F1-score with respect to the positive and the negative class. Vary the value of the threshold in the range from 0 to 1 and visualize the value of accuracy, precision, recall, and F1-score (with respect to both classes) as a function of the threshold. The shelter has a limited capacity and has no other option but to put to sleep animals with a low probability of adoption. What metric (precision, recall, accuracy, or F1-score) and with respect to what class is the most relevant when choosing the threshold in this scenario, and why? Explain your reasoning.

**C)** Reduce the number of features by selecting the subset of the k best features. Use greedy backward selection to iteratively remove features. Evaluate performance and visualize the result using 5-fold cross-validation on the training set as a function of k, where k = 1, 5, 10, 15, 20, 25, 30. Choose the optimal k and justify your choice. Interpret the top-k features and their impact on the probability of adoption.

**D)** Train a random forest. Use 5-fold cross-validation on the training set to fine-tune the parameters of the classifier using a grid search on the number of estimators "n_estimators" and the max depth of the trees "max_depth". For the chosen parameters, estimate the performance of your classifier on the test set by presenting the confusion matrix, accuracy, precision, recall, and F1-score with respect to both classes and compare the performance with the performance of the logistic regression. Interpret the results.

You are allowed to use the scikit-learn library to implement your classifiers.