In [None]:
#imports (same as tuto ML)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.datasets import fetch_20newsgroups
from pandas.plotting import scatter_matrix
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
import seaborn as sns
%matplotlib inline

# Question 1: Propensity score matching

We preform a naive data analysis using plots.

In [None]:
#import the data set
lalonde_df = pd.read_csv('lalonde.csv')
#give a first look
lalonde_df.head()

### 1. A naive analysis

**Looking at numbers**

We use a simple summary to describe the data.

In [None]:
lalonde_df.groupby(['treat', 'black'])['re78'].describe()

In [None]:
lalonde_df.groupby(['treat', 'hispan'])['re78'].describe()

In [None]:
lalonde_df.groupby('treat')['re78'].describe()

- We see that there are many less people in the treated group compared to the untreated group.
- Untreated are very homogeneous (smaller std and higher mean)
- The max salary in the treated group is 3x higher! Plus, the 25% mark is much higher (2x)

==> We see that there are more white people in the untreated group, which may be why the 50% & the 75% marks are this high (the lowest scores are ppl from other ethnicities); in the treated group only 18 people are white, which may explain why there are so many differences (and lower bounds)

we can visualize these numbers nicely with a boxplot.

In [None]:
#define the two categories
treat = lalonde_df.treat ==1 
untreat = lalonde_df.treat == 0

treated_salary = lalonde_df[treat]['re78']
untreated_salary = lalonde_df[untreat]['re78']

In [None]:
plt.boxplot([treated_salary, untreated_salary], labels=['treated', 'untreated'])
plt.show()

**Looking at plots**

we naively (like us at first) try to use a histogram we will see that taking into account that there are much less people in the treated group, both groups fare similarly.

We note that we have a long-tail-like distribution, meaning we may be needing to use a log-log plot.

In [None]:
#draw the plots on top of each other with same bin size
bins = np.linspace(0, 60500, 30)
plt.hist(untreated_salary, bins, alpha=0.5, label='untreated')
plt.hist(treated_salary, bins, alpha=0.5, label='treated')
plt.legend(loc='upper right')

**Conclusion**

A naive researcher might conlcude that the treatment isn't effective, as the mean salary for the untreated population is higher, but for some individuals is is very effective, hence the outliers.
==> Does a naive researcher care about the outliers and the fact that elements are biased?

### 2. A closer look at the data

We look more closely at the data, building a histogram for every column. Maybe not throw id as important, use new DFs each time (and throw at the appropriate time)

In [None]:
#distinct id is useless
del lalonde_df['id']

We only care about intervals (makes sense to represent them using histograms). We also include boxplots (visual representation of the 5 nbr summary)

In [None]:
intervals = ['age', 'educ', 're74', 're75', 're78']

for col in intervals:
    print('Histogram and 5 number summary for column : ', col)
    print('\n',lalonde_df.groupby('treat')[col].describe())
    plt.title(col)
    plt.boxplot([lalonde_df[untreat][col], lalonde_df[treat][col]], labels=['untreated', 'treated'])
    plt.figure()
    bins = np.linspace(np.min(lalonde_df[col]),np.max(lalonde_df[col]), 30)
    plt.hist(lalonde_df[untreat][col], bins,alpha=0.5)
    plt.hist(lalonde_df[treat][col],bins, alpha=0.5)
    plt.show()

We note different distributions (need to compare each one alone). Salaries are all long tailed, education is a gaussian, but age has no apparent distribution (maybe Poisson ?!).

Regarding categorical data, we should look at rates (makes much more sense than looking at just the numbers). Thus we define the rates for race, degree and marriage depending on each treat to be able to compare them (pie charts ? we need smthng visual).

In [None]:
#Implement pie chart mentioned above

We also decide to look at the correlation between the variables (is this the best method to do it ? Not better doing it for each treat alone ?). 
--> Interesting, but explain why ? And most importantly, explain what we see (degrees are not - linearly - linked to race, all salaries are linked, other obvious stuff like link between educ & degree or age & marriage)

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
corr = lalonde_df.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),
            square=True, ax=ax)

### 3. A propsensity score model

In [None]:
prop_table = lalonde_df
del prop_table['re78']

In [None]:
X = prop_table.iloc[:, 1:]
y = prop_table.iloc[:, :1]
y = np.ravel(y)

In [None]:
logistic = LogisticRegression()
logistic.fit(X, y)
logistic.score(X, y)

Is propensity really that simple ? Cf formula in the course (depends on $\pi_i = p(Z = 1 \mid x)$)
==> Doesn't work, here we have the probability of success, need to find the exact thing to do (look with Sas & Leo)

### 4. Balancing the dataset via matching

### 5. Balancing the groups further


### 6. A less naive analysis

# Question 2: Applied ML

## 1.

Load the 20newsgroup dataset. It is, again, a classic dataset that can directly be loaded using sklearn ([link](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html)).  
[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), short for term frequency inverse document frequency, is of great help when if comes to compute textual features. Indeed, it gives more importance to terms that are more specific to the considered articles (TF) but reduces the importance of terms that are very frequent in the entire corpus (IDF). Compute TF-IDF features for every article using [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Then, split your dataset into a training, a testing and a validation set (10% for validation and 10% for testing). Each observation should be paired with its corresponding label (the article category).


In [None]:
# Test and training set are given already, need to choose how to get validation set. Remove as proposed by tutorial
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

In [None]:
# We have the data as a list, transform it to a pandas DF. Not sure if needed though.
train_DF = pd.DataFrame(newsgroups_train.data)
test_DF = pd.DataFrame(newsgroups_test.data)
train_DF.head()

## 2.

Train a random forest on your training set. Try to fine-tune the parameters of your predictor on your validation set using a simple grid search on the number of estimator "n_estimators" and the max depth of the trees "max_depth". Then, display a confusion matrix of your classification pipeline. Lastly, once you assessed your model, inspect the `feature_importances_` attribute of your random forest and discuss the obtained results.
