In [None]:
#imports (same as tuto ML)
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import OneHotEncoder # is this really needed ?
from pandas.plotting import scatter_matrix
from sklearn.model_selection import cross_val_predict, cross_val_score, train_test_split, GridSearchCV, PredefinedSplit

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

%matplotlib inline

# Question 1: Propensity score matching

We preform a naive data analysis using plots and numbers.

In [None]:
#import the data set
lalonde_df = pd.read_csv('lalonde.csv')
#give a first look
lalonde_df.head()

### 1. A naive analysis

We assume that a naive researcher unfamiliar with observational studies would treat the data as if it was a randomized trial, not taking into consideration the hidden correlates.

We can easily imagine that the first thing he would do is split the salary (_['re78']_) data into 2 sets: treated and untreated.

In [None]:
#masks to be used alot later
treated = lalonde_df.treat == 1
untreated = lalonde_df.treat == 0

#apply masks to get treated and untreated
treated_salary = lalonde_df[treated]['re78']
untreated_salary = lalonde_df[untreated]['re78']

**i - Visualizing the data:**

We first plot the final salary data in a histogram to find the distribution of salaries between the two groups.
We add weights so we can look at percentages instead of number of people in both bins.

In [None]:
plt.figure(figsize=(10, 4))
#define same bin size
bins = np.linspace(0, max(lalonde_df['re78']), 50)
#add weights to get percentages
plt.hist(untreated_salary, weights=np.ones(len(untreated_salary))/len(untreated_salary), alpha=.5 , bins=bins)
plt.hist(treated_salary, weights=np.ones(len(treated_salary))/len(treated_salary), alpha=.5, bins=bins)
plt.title('Histogram showing salary treated and untreated groups')
plt.legend(['untreated', 'treated'])
plt.xlabel('Yearly salary')
plt.ylabel('percentage of subjects')
plt.show()

##### First insights:

By looking at the graph, we see a very similar distribution for both functions, except that outliers are present in the treated group.

Another very visible element is the fact that the function of the treated group's salaries is shifted to the bottom. Very quickly (cf paragraph below), a simple explanation arises: the number of members in the treated group is lower.

**ii - Describing the numbers**

Thus, we determined that he would only look at the basic descriptions of the data (mean, std and 5 number summary). 

In [None]:
lalonde_df.groupby('treat')['re78'].describe()

From the numbers above, we assume that the research would extract the following information from the data :

- The untreated group has more people.
- The untreated group's salaries are higher (higher mean).
- However, the max salary in the treated group is 3x higher! The 1st quartile is also twice higher on the treated group.
- Finally, we have that the second and third quartiles are higher in the first group. Quartiles are more resistent to outliers, so we should put more consideration
- The interquartile distance is larger in the untreated set, as we have outliers in the set this is a better measure for 'variance'.

**iii - Boxplot:**

A boxplot will illustrate the above more consciely

In [None]:
plt.boxplot([treated_salary, untreated_salary], labels=['treated', 'untreated'])
plt.title('Distribution of salary by treated')
plt.ylabel('Salary')
plt.show()

**Conclusion**:

By merging all of the insights the researcher has drawn from the 3 steps of his analysis, he can conclude that **the treatment is ineffective**. Even though salary distributions are similar in both cases, the treated group has in average a lower salary (and only a handful of rich people get lucky). This is shown by the boxplot: the whiskers extend higher in the untreated group.

### 2. A closer look at the data

After performing a **simplistic** analysis of the data ignoring underlying factors –such as race and education– that could influence the outcome. We start looking at the whole table, namely at the other features which surely have an impact on the variable we want to understand at the end of this exercise: _['re78']_.

**i - Interval data :**

We start our analysis by visualizing the data to see if the two groups have different underlying distrubutions of factors. We split out analysis in by **categorical** and **intervall** data. In the beginning, we will focus on the latter.

In [None]:
#defining list of non binary variables
intervals = ['age', 'educ', 're74', 're75']

#for each column draw a Boxplot
for col in intervals:
    plt.title("Boxplot of " + col)
    plt.boxplot([lalonde_df[untreated][col], lalonde_df[treated][col]], labels=['untreated', 'treated'])
    plt.ylabel(col)
    plt.figure()

\NOT RIGHT REPLACE From the graphs above, we can see very clearly 2 elements. First, the data distribution is always similar between the 2 groups (even though the number of participants is different due to the difference in the number of people in both groups). On top of this, 3 types of distributions pop out:
- Poisson: this is the distribution representing the **age** of participants <- rly??? that's not what I'm seeing
- Gaussian : this distribution models the **level of education** of participants
- Power law : this type of distribution is appropriate to understand the **salaries** of participants (_['re74'], ['re75'] and ['re78']_ all have the same form of distribution)

##### a. Salaries:

The main thing we need to look at is the salaries with respect to the education. Education likely influences the salary

In [None]:
plt.figure()
sns.barplot(y="re78", x='educ', data=lalonde_df, hue='treat')

##### b. Evolution:

Even though it is useful to plot the salaries to see the difference between the years, it is much more useful to understand how the salary of each participant has changed over the years. To do so, we will visualize our data using a parallel plot. 

In [None]:
#Implement parallel plot
from pandas.plotting import parallel_coordinates
parallel_coordinates(lalonde_df[untreated][['id','re74', 're75', 're78']], 'id', color='Blue', alpha=0.5)
parplot = parallel_coordinates(lalonde_df[treated][['id','re74', 're75', 're78']], 'id', color='Orange' , alpha=0.7)
#remove legend for readability
parplot.legend_.remove()
plt.title('Salary over time for each participant')
plt.xlabel('Year')
plt.ylabel('Annualy Salary')

We see that:
- the treated group started out with a lower salary
- 75 was a bad year for everybody, treated or untreated.
- the outliers are people partialy people who were already well payed in 74, partialy people who 'made it'.
- there is a lot of movement up for the treated group between 75 and 78

**Conclusion:**

By looking at the interval data, and more specificaly at the salaries of participants, we can say that …

**ii - Categorical data :**

Regarding categorical data, we should look at rates (makes much more sense than looking at just the numbers). Thus we define the rates for race, degree and mariage depending on each treatment to be able to compare them.

In [None]:
#as the values are binary the mean is equal to the percentage of occurence
percentages = lalonde_df.groupby('treat').mean()
percentages

##### a. Race ratios:

We will start with race. As we do not have numbers for "White" participants, we get the number of "Blacks" and "Hispanics" for each treatment group and substract the total. We then compare the rates of each race using pie charts.

In [None]:
black_u, black_t = percentages['black']
hispan_u, hispan_t = percentages['hispan']
#there is no overlap in the hispan and black categories, 
#we assume people that are neither are white (which we checked, it is the case)
white_u, white_t = (1 - black_u - hispan_u, 1 - black_t - hispan_t)

In [None]:
u_race_rates = [black_u, hispan_u, white_u]
t_race_rates = [black_t, hispan_t, white_t]
#give name to lable
race_labels = 'Black', 'Hispanic', 'White'

In [None]:
plt.pie(u_race_rates, labels = race_labels, autopct='%1.1f%%', shadow=True, startangle=90)
plt.axis('equal')
plt.title("Rates of races in the untreated group")
plt.show()

plt.figure()

plt.pie(t_race_rates, labels = race_labels, autopct='%1.1f%%', shadow=True, startangle=90)
plt.axis('equal')
plt.title("Rates of races in the treated group")
plt.show()

We see that there are way more black subjects in the treatment group than in the untreated group.

##### b. Degree ratios:

To have a better understanding of the difference of salaries, we also need to look at the level of education of the participants of each treatment.

In [None]:
nodegree_u, nodegree_t = percentages['nodegree']
degree_u, degree_t = (1 - nodegree_u, 1 - nodegree_t)

In [None]:
#calculate rate for degree havers in treated and untreated group
u_degree_rates = [1 - nodegree_u, nodegree_u]
t_degree_rates = [1 - degree_t, nodegree_t]
degree_labels = 'Degree', 'No degree'

In [None]:
#draw pie diagram
plt.pie(u_degree_rates, labels = degree_labels, autopct='%1.1f%%', shadow=True, startangle=90)
plt.axis('equal')
plt.title("Rate of people with degrees in the untreated group")
plt.show()
plt.figure()
plt.pie(t_degree_rates, labels = degree_labels, autopct='%1.1f%%', shadow=True, startangle=90)
plt.axis('equal')
plt.title("Rate of people with degrees in the treated group")
plt.show()

We see that the treated group is less educated, by a difference of over 10%

##### c. Marriage ratios:

Finally, we look at the rates of married people among both groups as it is our last feature. 

In [None]:
#married and unmarried by treatment
married_u, married_t = percentages['married']
not_married_u, not_married_t = (1 - married_u, 1 - married_t)

In [None]:
mariage_labels = 'Married', 'Not married'
#untreated group
plt.pie([married_u, not_married_u], labels = mariage_labels, autopct='%1.1f%%', shadow=True, startangle=90)
plt.axis('equal')
plt.title("Rate of people with degrees in the untreated group")
#treated group
plt.figure()
plt.pie([married_t, not_married_t], labels = mariage_labels, autopct='%1.1f%%', shadow=True, startangle=90)
plt.axis('equal')
plt.title("Rate of people with degrees in the treated group")
plt.show()

We again note that the treated group contains less married individuals.
Marriage can be an indicator of stability and thus indicate how likely somebody is to preform consistenly well on job.

**Conclusion:**

By looking at the categorical data, we can say that the underlying factors between the two groups are not similar.
The treated group is significantly more black, less educated and married. All these factors influence employment and should be taken into consideration.

**iii - Correlated data :**

After working on each value alone, we want to understand how each value is (linearly) linked to others on each pair of features. We look at the pairplot, as correlation by itself does not give us any insights, as clearly the data is not linearly dependent.

We note that we can somewhat seperate the two groups, inicating that they ate not the same

In [None]:
sns.pairplot(lalonde_df[['treat', 're78']+intervals], markers='+', hue='treat')

### 3. A propsensity score model

The create a fair set to use on our observational study, we calculate the propensity score based on the underlying factors before treatment:

[age, educ, hispan, black, nodegree, re74, re75]

**note: we consider all these factors to have been recorded before treatment**

In [None]:
prop_table = lalonde_df.copy() #otherwise we modify lalonde_df when we modify prop_table

In [None]:
#create our target and training data:

X = prop_table.iloc[:, 2:-1] #rows age to 're75'
y = prop_table.iloc[:, 1:2] #treated or not
y = np.ravel(y) #flatten array
print('First elements of Y : \n', y[0:5],'\nFirst elements of X\n', X[0:5])

In [None]:
#define our model
logistic = LogisticRegression()
logistic.fit(X, y)
print('Accuracy of prediction: ',logistic.score(X, y))

In [None]:
print("Example of prediction : ", logistic.predict(X[0:6]), ' reality :', y[0:6])
print('Example of prediction in percent : \n', logistic.predict_proba(X[0:6]))

In [None]:
#get propensity scores, probability of "being a subject"
prop_table['propensity_scores'] = pd.Series(logistic.predict_proba(X)[:,1])

In [None]:
prop_table.head()

We now use the propensity scores to find a matching.

### 4. Balancing the dataset via matching

Matching the two is an equivalent problem to find a matching in a bipartite graph

In [None]:
import networkx as nx
B = nx.Graph()
#1. Creat graph with nodes as id
B.add_nodes_from(prop_table['id'])

In [None]:
# 2. Add edges from each treated to each untreated subject
#    with weight on each node being the difference between the two
for row_i in prop_table[treated].iterrows():
    for row_j in prop_table[untreated].iterrows():
        B.add_edge(row_i[1]['id'],row_j[1]['id'], 
                   #-x to transform minimisation problem into maximisation problem
                   weight= 1 - np.abs(row_i[1].propensity_scores - row_j[1].propensity_scores))

In [None]:
#3. Find matching
matching_dict = nx.max_weight_matching(B)

In [None]:
print('Example matches:')
list(matching_dict.items())[:5]

In [None]:
#get matching
remaning_subjects = prop_table.copy()[prop_table['id'].isin(matching_dict)]
print('we have : ',len(remaning_subjects)/2, ' matched subjects') #pairs appear in 2 order s ab and ba

In [None]:
#separate treated and untreated
remaning_subjects.groupby('treat').mean()

In [None]:
sns.boxplot(data=remaning_subjects, x='treat', y='re78') # this one is similar! good!

In [None]:
remaning_subjects[remaning_subjects['black'] == 1].groupby('treat')['id'].count() #unbalanced

In [None]:
remaning_subjects[remaning_subjects['hispan'] == 1].groupby('treat')['id'].count()

In [None]:
sns.pairplot(prop_table[intervals+['treat']], markers='+', hue='treat')

### 5. Balancing the groups further


We note that there are still way more black subjects in the treated group than in the untreated group.
Additionaly, we still have outliers in the treated group.

We try to balance the both groups by removing white subjects matched with outlying black subjects

In [None]:
remaning_subjects['match'] = remaning_subjects['id'].map(matching_dict)

In [None]:
remaning_subjects.head()

In [None]:
left = remaning_subjects[remaning_subjects.treat == 1]
right = remaning_subjects[remaning_subjects.treat == 0]
matches = left.merge(right, left_on='id', right_on='match')

In [None]:
matches.head()

In [None]:
matches['difference'] = abs(matches['propensity_scores_x'] - matches['propensity_scores_y'])

In [None]:
#matches that are black/white missmatched and have a large difference in propensity scores
to_drop = matches[(matches['black_x'] == 1) & (matches['black_y'] == 0) & (matches['hispan_y'] == 0) & (matches.difference > .5)].index

In [None]:
final_matches = matches.drop(to_drop)

In [None]:
final_matches.groupby('black_x')['id_x'].count() #treated

In [None]:
final_matches.groupby('black_y')['id_y'].count() #untreated

In [None]:
final_matches.mean()

In [None]:
#removing outliers
nana = final_matches.drop(final_matches[final_matches.re78_x > 30000].index)

In [None]:
plt.boxplot([nana['re78_x'], nana['re78_y']])
plt.show()

In [None]:
nana.re78_x.describe()

In [None]:
nana.re78_y.describe()

### 6. A less naive analysis

After controlling for unerlying factors we see that the treated population fares better than the untreated population

# Question 2: Applied ML

First, we need to compute the TF-IDF features of our dataset, using a vectorizer. As we understood the question, what was asked was not to use any of the given datasets from sklearn, but to use all of the data. Thus, we do not use the train and test subsets given to us in sklearn, but will create our own such subsets, adding a validation subset.

Also note that we remove the headers, footers and quotes, as proposed in the <a href="http://scikit-learn.org/stable/datasets/twenty_newsgroups.html">sklearn tutorial</a> of the dataset, as to have something more realistic and without any of the metadata. Note also that we did not use the *sklearn.datasets.fetch_20newsgroups_vectorized* function that returns the TF-IDF features directly, as it would defeat the purpose of the exercise.

In [None]:
# create the TF-IDF vectorizer
tfidf = TfidfVectorizer()

In [None]:
# Import the data we need to use the vectorizer on. Remove metadata as proposed by sci-kit tutorial
newsgroups_all = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

As asked in the question, before seperating in subsets, we will use the vectorizer on the complete set.

In [None]:
# vectors is a sparse matrix
vectors = tfidf.fit_transform(newsgroups_all.data)

Now we need to seperate the dataset into three sets: train, test and validation.

In [None]:
# first we seperate train from the rest. Random_state given to have a seed.
newsgroups_train, newsgroups_inter, vect_train, vect_inter = \
    train_test_split(newsgroups_all.target, vectors, test_size=0.2, random_state=1)

# then we seperate again to get validation and test seperately
newsgroups_test, newsgroups_valid, vect_test, vect_valid = \
    train_test_split(newsgroups_inter, vect_inter, test_size=0.5, random_state=1)

## 2.

Train a random forest on your training set. Try to fine-tune the parameters of your predictor on your validation set using a simple grid search on the number of estimator "n_estimators" and the max depth of the trees "max_depth". Then, display a confusion matrix of your classification pipeline. Lastly, once you assessed your model, inspect the `feature_importances_` attribute of your random forest and discuss the obtained results.


Now we need to train a random forest on our training set. For this, we will use the RandomForestClassifier, as it contains the parameters talked about in the exercise. But first, we need to ask ourselves what we want to set the parameters (*max_depth* and *n_estimators*) to.

According to the ADA course, we know that the number of trees will be in the 10's and the depth will be betweem 20 to 30. Thus for the training set, we set *n_estimators* to 10 and *max_depth* to 25.

For the predictions, we can't use the training set, as we just trained on it and thus would get very good results regardless. So prediction has to be on the validation set.

In [None]:
# need to find estimators and depth first. We use random_state to have a seed again.
clf = RandomForestClassifier(n_estimators=100, max_depth=25, random_state=1)
clf.fit(vect_train, newsgroups_train)
pred = clf.predict(vect_valid)
metrics.f1_score(newsgroups_valid, pred, average='macro')

As we can see, predictions aren't that great.

We try to fine tune on the validation set. Note that to do this, we usethe GridSearch implemented in sklearn. We first chose he estimators between 100 and 1000 and a depth between 20 and 30 as it is what we have seen during the lessons, but seeing as the results for the best parameters were the upper limit (30 and 1000) we decided to look if it would still be the same by taking a larger upper limit (35 and 1500).

Also, as we have already our own training, validation and test sets, we need to use *PredefinedSplit* in the GridSearch.

Please note that the fit takes a lot of time to compute, as there are a very large numbers of estimators.

In [None]:
param_grid = { 
    'n_estimators': [1000],
    'max_depth': [250, 500]
}

CV_rfc = GridSearchCV(estimator=clf, param_grid=param_grid, n_jobs=-1)

In [None]:
CV_rfc.fit(vect_valid, newsgroups_valid)

In [None]:
print(CV_rfc.best_params_)

In [None]:
clf = RandomForestClassifier(n_estimators=1000, max_depth=250, random_state=1, n_jobs=-1)
clf.fit(vect_train, newsgroups_train)
pred = clf.predict(vect_valid)
metrics.f1_score(newsgroups_valid, pred, average='macro')

As we can see, te best resuts are when *n_estimators* is set around X and *max_depth* is set to X. 

Now we do a confusion matrix on the test set.

Now, let us inspect the `feature_importances_` attribute of our random forest.