In [None]:
#imports (same as tuto ML)
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import OneHotEncoder # is this really needed ?
from pandas.plotting import scatter_matrix
from sklearn.model_selection import cross_val_predict, cross_val_score, train_test_split, GridSearchCV, PredefinedSplit

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

%matplotlib inline

# Question 1: Propensity score matching

### 1. A naive analysis

We preform a naive data analysis using plots.

In [None]:
#import the data set
lalonde_df = pd.read_csv('lalonde.csv')
#give a first look
lalonde_df.head()

**Looking at numbers**

We use a simple summary to describe the data.

In [None]:
lalonde_df.groupby('treat')['re78'].describe()

- We see that there are many less people in the treated group compared to the untreated group.
- The standard deviation for the treated group is much higher, there is more variety in the outcome
- The mean of the untreated group is higher
- but the max salary in the treated group is 3x higher!
 - given that the 75% for the untreated group is much higher, this could indicate that the max for the treated group is an outlier

we can visualize these numbers nicely with a boxplot.

In [None]:
#define the two categories
treated_salary = lalonde_df[lalonde_df.treat ==1 ]['re78']
untreated_salary = lalonde_df[lalonde_df.treat == 0 ]['re78']

In [None]:
plt.boxplot([treated_salary, untreated_salary], labels=['treated', 'untreated'])
plt.show()

**Looking at plots**

we naively (like us at first) try to use a histogram we will see that taking into account that there are much less people in the treated group, both groups fare similarly.

We note that we have a log-like distribution.

In [None]:
#draw the plots on top of each other with same bin size
bins = np.linspace(0, 60500, 30)
plt.hist(untreated_salary, bins, alpha=0.5, label='untreated')
plt.hist(treated_salary, bins, alpha=0.5, label='treated')
plt.legend(loc='upper right')

### 2. A closer look at the data

### 3. A propsensity score model

### 4. Balancing the dataset via matching

### 5. Balancing the groups further


### 6. A less naive analysis

# Question 2: Applied ML

First, we need to compute the TF-IDF features of our dataset, using a vectorizer. As we understood the question, what was asked was not to use any of the given datasets from sklearn, but to use all of the data. Thus, we do not use the train and test subsets given to us in sklearn, but will create our own such subsets, adding a validation subset.

Also note that we remove the headers, footers and quotes, as proposed in the <a href="http://scikit-learn.org/stable/datasets/twenty_newsgroups.html">sklearn tutorial</a> of the dataset, as to have something more realistic and without any of the metadata. Note also that we did not use the *sklearn.datasets.fetch_20newsgroups_vectorized* function that returns the TF-IDF features directly, as it would defeat the purpose of the exercise.

In [None]:
# create the TF-IDF vectorizer
tfidf = TfidfVectorizer()

In [None]:
# Import the data we need to use the vectorizer on. Remove metadata as proposed by sci-kit tutorial
newsgroups_all = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

As asked in the question, before seperating in subsets, we will use the vectorizer on the complete set.

In [None]:
# vectors is a sparse matrix
vectors = tfidf.fit_transform(newsgroups_all.data)

Now we need to seperate the dataset into three sets: train, test and validation.

In [None]:
# first we seperate train from the rest. Random_state given to have a seed.
newsgroups_train, newsgroups_inter, vect_train, vect_inter = \
    train_test_split(newsgroups_all.target, vectors, test_size=0.2, random_state=1)

# then we seperate again to get validation and test seperately
newsgroups_test, newsgroups_valid, vect_test, vect_valid = \
    train_test_split(newsgroups_inter, vect_inter, test_size=0.5, random_state=1)

## 2.

Train a random forest on your training set. Try to fine-tune the parameters of your predictor on your validation set using a simple grid search on the number of estimator "n_estimators" and the max depth of the trees "max_depth". Then, display a confusion matrix of your classification pipeline. Lastly, once you assessed your model, inspect the `feature_importances_` attribute of your random forest and discuss the obtained results.


Now we need to train a random forest on our training set. For this, we will use the RandomForestClassifier, as it contains the parameters talked about in the exercise. But first, we need to ask ourselves what we want to set the parameters (*max_depth* and *n_estimators*) to.

According to the ADA course, we know that the number of trees will be in the 10's and the depth will be betweem 20 to 30. Thus for the training set, we set *n_estimators* to 10 and *max_depth* to 25.

For the predictions, we can't use the training set, as we just trained on it and thus would get very good results regardless. So prediction has to be on the validation set.

In [None]:
# need to find estimators and depth first. We use random_state to have a seed again.
clf = RandomForestClassifier(n_estimators=100, max_depth=25, random_state=1)
clf.fit(vect_train, newsgroups_train)
pred = clf.predict(vect_valid)
metrics.f1_score(newsgroups_valid, pred, average='macro')

As we can see, predictions aren't that great.

We try to fine tune on the validation set. Note that to do this, we usethe GridSearch implemented in sklearn. We first chose he estimators between 100 and 1000 and a depth between 20 and 30 as it is what we have seen during the lessons, but seeing as the results for the best parameters were the upper limit (30 and 1000) we decided to look if it would still be the same by taking a larger upper limit (35 and 1500).

Also, as we have already our own training, validation and test sets, we need to use *PredefinedSplit* in the GridSearch.

Please note that the fit takes a lot of time to compute, as there are a very large numbers of estimators.

In [None]:
param_grid = { 
    'n_estimators': [100, 1500],
    'max_depth': [20, 35]
}

CV_rfc = GridSearchCV(estimator=clf, param_grid=param_grid)

In [None]:
CV_rfc.fit(vect_train, newsgroups_train)
print(CV_rfc.best_params_)

As we can see, te best resuts are when *n_estimators* is set around X and *max_depth* is set to X. 

Now we do a confusion matrix on the test set.

Now, let us inspect the `feature_importances_` attribute of our random forest.