In [None]:
#imports (same as tuto ML)
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import OneHotEncoder # is this really needed ?
from pandas.plotting import scatter_matrix
from sklearn.model_selection import cross_val_predict, cross_val_score, train_test_split, GridSearchCV, PredefinedSplit

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

%matplotlib inline

We define custom chart drawing functions we reuse a lot:

In [None]:
#we use this a lot :D
def drawPies(u_rates, t_rates, labels, supertitle):
    """draws pretty comparative pie charts
    u_rates : rates for untreated group
    t_rates : rates for treated group
    lables: labels for values in rates
    supertitle: title of chart
    """
    fig = plt.figure(figsize=(7, 7))
    fig.suptitle(supertitle)
    
    plt.subplot(2,2,1)
    plt.pie(u_rates, labels = labels, autopct='%1.1f%%', shadow=True, startangle=90)
    plt.axis('equal')
    plt.title("untreated group")

    plt.subplot(2,2,2)
    plt.pie(t_rates, labels = labels, autopct='%1.1f%%', shadow=True, startangle=90)
    plt.axis('equal')
    plt.title("treated group")
    plt.show()

# Question 1: Propensity score matching

We preform a naive data analysis using plots and numbers.

In [None]:
#import the data set
lalonde_df = pd.read_csv('lalonde.csv')
#give a first look
lalonde_df.head()

### 1. A naive analysis

We assume that a naive researcher unfamiliar with observational studies would treat the data as if it was a randomized trial, not taking into consideration the hidden correlates.

We can easily imagine that the first thing he would do is split the salary (_['re78']_) data into 2 sets: treated and untreated.

In [None]:
#masks to be used alot later
treated = (lambda x: x.treat == 1)
untreated = (lambda x: x.treat == 0)

#apply masks to get treated and untreated
treated_salary = lalonde_df[treated(lalonde_df)]['re78']
untreated_salary = lalonde_df[untreated(lalonde_df)]['re78']

**i - Describing the numbers**

We first look at the numbers to see how many subjects in each group we have an how the values are distributed

In [None]:
lalonde_df.groupby('treat')['re78'].describe()

From the numbers above, we extract the following information from the data :

- The untreated group has more people.
- The untreated group's salaries have a higher mean.
- However, the max salary in the treated group is 3x higher! The 1st quartile is also two times higher on the treated group.
- Finally, we have that the second and third quartiles are higher in the untreated. Quartiles are more resistent to outliers, so we should put more consideration into these values
- The interquartile distance is larger in the untreated set, as we have outliers in the set this is a better measure for 'variance'.

**ii - Visualizing the data:**

We now plot the final salary data in a histogram to find the distribution of salaries of the two groups. We add weights so we can look at percentages instead of number of people in both bins, as the number of people in the two groups is not equal.

In [None]:
plt.figure(figsize=(10, 4))
#define same bin size
bins = np.linspace(0, max(lalonde_df['re78']), 50)
#add weights to get percentages
plt.hist(untreated_salary, weights=np.ones(len(untreated_salary))/len(untreated_salary), alpha=.5 , bins=bins)
plt.hist(treated_salary, weights=np.ones(len(treated_salary))/len(treated_salary), alpha=.5, bins=bins)
plt.title('Histogram showing salary treated and untreated groups')
plt.legend(['untreated', 'treated'])
plt.xlabel('Yearly salary')
plt.ylabel('percentage of subjects')
plt.show()

##### First insights:

By looking at the graph, we see a very similar distribution for both functions, except that outliers are present in the treated group.

We also note that there are more relatively more subjects in the untreated group with a salary between 10k and 20k, while both groups have a similar ratio of subjects in the <10k section of the graph.

**iii - Boxplot:**

A boxplot will illustrate the above more consciely the 5 number summary we presented first.

In [None]:
plt.boxplot([treated_salary, untreated_salary], labels=['treated', 'untreated'])
plt.title('Distribution of salary by treated')
plt.ylabel('Salary')
plt.show()

**Conclusion**:

If the treatment was effective, we should see that the treated group is more sucessful on average, as they were placed in program, wherease the untreated group was left to fend for themselves.

By merging all of the insights the researcher has drawn from the 3 steps of his analysis, he can conclude that **the treatment shows no effect**. The salary distributions are similar in both cases, indicating that the treatment isn't effective.

Additionaly, the treated group has in average a lower salary (except for the handful of people get lucky and find a good job). This is shown by the boxplot: the whiskers extend higher in the untreated group and the median and lower wisker as situated higher up. Since the difference is small this may just be due to chance.


### 2. A closer look at the data

After performing a **simplistic** analysis of the data ignoring underlying factors –such as race and education– that could influence the outcome, we start looking at the whole table assuming the other features will impact _['re78']_.

We split out analysis in by **categorical** and **intervall** data.

**i - Categorical data :**

Regarding categorical data, we should look at rates (makes much more sense than looking at just the numbers). Thus we define the rates for race, degree and mariage depending on each treatment to be able to compare them.

In [None]:
#as the values are binary the mean is equal to the percentage of occurence
percentages = lalonde_df.groupby('treat').mean()
percentages

##### a. Race ratios:

We will start with race. As we do not have numbers for "White" participants, we get the number of "Blacks" and "Hispanics" for each treatment group and substract the total. We then compare the rates of each race using pie charts.

In [None]:
black_u, black_t = percentages['black']
hispan_u, hispan_t = percentages['hispan']
#there is no overlap in the hispan and black categories, 
#we assume people that are neither are white (which we checked, it is the case)
white_u, white_t = (1 - black_u - hispan_u, 1 - black_t - hispan_t)

In [None]:
u_race_rates = [black_u, hispan_u, white_u]
t_race_rates = [black_t, hispan_t, white_t]
#give name to lable
race_labels = 'Black', 'Hispanic', 'White'

In [None]:
drawPies(u_race_rates, t_race_rates, race_labels, 'Racial groups in percent by treatment')

We see that there are way more black subjects in the treatment group than in the untreated group.

##### b. Degree ratios:

To have a better understanding of the difference of salaries, we also need to look at the level of education of the participants of each treatment.

In [None]:
degree_u, degree_t = percentages['nodegree']

In [None]:
#calculate rate for degree havers in treated and untreated group
u_degree_rates = [degree_u, 1 - degree_u]
t_degree_rates = [degree_t, 1 - degree_t]
degree_labels = 'Degree', 'No degree'

In [None]:
#draw pie diagram
drawPies(u_degree_rates, t_degree_rates, degree_labels, 'Percentage of individuals with a degree, by treatment')

We see that the treated group is less educated, by a difference of over 10%

##### c. Marriage ratios:
Finally, we look at the rates of married people among both groups as it is our last feature. 

In [None]:
#married and unmarried by treatment
married_u, married_t = percentages['married']
not_married_u, not_married_t = (1 - married_u, 1 - married_t)

In [None]:
u_marriage_rates = [married_u, not_married_u]
t_marriage_rates = [married_t, not_married_t]
mariage_labels = ['Married', 'Not married']

In [None]:
drawPies(u_marriage_rates, t_marriage_rates, mariage_labels, 'Percentage of married individuals by treatment')

We again note that the treated group contains less married individuals.
Marriage can be an indicator of stability and thus indicate how likely somebody is to preform consistenly well on job.

##### d. Unemployment ratios:

Even though salaries are not categories but intervals, it is important to compare unemployment rates between both groups (which we define at categories, employed and unemployed). To get better insights, we will plot the years 1974 and 1975.

In [None]:
#defining list of non binary variables
salaries = ['re74', 're75']
cat_salaries = lalonde_df.copy()
cat_salaries[salaries].apply(lambda x: 1 if any(x == 0) else 2)
unemployed_labels = 'Employed', 'Unemployed'
#for each column draw a Boxplot
for sal in salaries:
    cat_salaries[sal] = cat_salaries[sal].map(lambda x : 0 if x == 0 else 1)
    u_employed, t_employed = cat_salaries.groupby('treat')[sal].mean()
    drawPies([1-u_employed, u_employed],[1-t_employed, t_employed], unemployed_labels, 'Unemployment rates by treatment in 19'+sal[-2:] )

We see that our assumtion that both groups are balanced pre-treatment is wrong for unemplyment aswell.
There are much more unemployed people in the treated group than in the untreated group. Who already has a job will be able to move up the ladder more easely, skewing the results.

**Conclusion:**

By looking at the categorical data, we can say that the underlying factors between the two groups are not similar at all.
The treated group is significantly more black, less educated, less employed and less married. All these factors influence employment and should be taken into consideration.

**ii - Interval data :**

we look at non binary data and the their distribution.


To do this we first do a box plot and relative fequency histogram for the intervall variables:

In [None]:
#defining list of non binary variables
intervals = ['age', 'educ', 're74', 're75']

#for each column draw a Boxplot
for col in intervals:
    plt.figure(figsize=(10, 10))
    treated_ = lalonde_df[treated(lalonde_df)][col]
    untreated_ = lalonde_df[untreated(lalonde_df)][col]
    
    #boxplot
    plt.subplot(2,2,1)
    plt.title("Boxplot of " + col)
    plt.boxplot([untreated_, treated_], 
                labels=['untreated', 'treated'])
    plt.ylabel(col)
    
    #histogram
    plt.subplot(2,2,2)
    bins = np.linspace(min(lalonde_df[col]), max(lalonde_df[col]), 50)
    plt.title("Relative frequency histogram of " + col)
    plt.ylabel('percentage')
    plt.xlabel(col)
    plt.hist(untreated_, weights=np.ones(len(untreated_))/len(untreated_), alpha=.5 , bins=bins)
    plt.hist(treated_, weights=np.ones(len(treated_))/len(treated_), alpha=.5, bins=bins)
    plt.legend(['untreated', 'treated'])
    

We see that pre treatment, the salaries are very unbalanced, the treatment group earning much less than the untreated group.

We can also observe a different age distribution in the two groups, the treated group being a bit younger, containg a lot of individuals in their 20's.

##### a. Evolution:

Even though it is useful to plot the salaries to see the difference between the years, it is much more useful to understand how the salary of each participant has changed over the years. To do so, we will visualize our data using a parallel plot. 

In [None]:
#Implement parallel plot
from pandas.plotting import parallel_coordinates
parallel_coordinates(lalonde_df[untreated(lalonde_df)][['id','re74', 're75', 're78']], 'id', color='Blue', alpha=0.5)
parplot = parallel_coordinates(lalonde_df[treated(lalonde_df)][['id','re74', 're75', 're78']], 'id', color='Orange' , alpha=0.7)
#remove legend for readability
parplot.legend_.remove()
plt.title('Salary over time for each participant')
plt.xlabel('Year')
plt.ylabel('Annualy Salary')

We see that:
- the treated group started out with a lower salary
- 75 was a bad year for everybody, treated or untreated.
- the outliers are people partialy people who were already well payed in 74, partialy people who 'made it'.
- there is a lot of movement up for the treated group between 75 and 78

#### b. Salary by race, education and marital status

By race:

In [None]:
#defining mask for white
white = (lalonde_df['black'] == 0) & (lalonde_df['hispan'] == 0)

In [None]:
years = ['re74', 're75', 're78']
for year in years:
    plt.boxplot([lalonde_df[white][year], lalonde_df[lalonde_df['black'] == 1][year],
                 lalonde_df[lalonde_df['hispan'] == 1][year]], 
                labels=['white', 'black', 'hispanic'])
    plt.title('salary distribution by race in 19'+year[-2:])
    plt.ylabel('annual earnings')
    plt.show()

We see that 
- there is a racial discrepancy in salary in our dataset
- the outliers are all black individuals

By martial status and degree:

In [None]:
for year in years:
    #for every year we plot marriage and degree
    plt.figure(figsize=(15,10))
    plt.subplot(2,2,1)
    plt.title('unmarried vs married salary in 19'+year[-2:])
    sns.boxplot(data=lalonde_df, x='married', y=year, hue='treat')
    plt.subplot(2,2,2)
    plt.title('degree vs nodegree salary in 19'+year[-2:])
    sns.boxplot(data=lalonde_df, x='nodegree', y=year, hue='treat')

In [None]:
for year in years:
    sns.factorplot(data=lalonde_df, x='age', y=year, hue='treat',aspect=4, size=3)
    plt.title('salary by age and treatment in 19'+year[-2:])
    plt.yticks(np.linspace(0, 40000, 5))
    plt.show()

- its easier to find a job when your young, young people easely catch up over the years.
- untreated +40 are very unemployed
- treatment group catches up, the treatment seems to have an effect
- 75 bad year, low salary in general (look at y axis) 

**Conclusion:**

By looking at the interval data, and more specificaly at the salaries of participants, we can say that the underlying conditions such as race, education marital status influence the salary.
As our two groups are not balanced, this interferes with out analysis.

**iii - Correlation data :**

After working on each value alone, we want to understand how each value is (linearly) linked to others on each pair of features. We look at the pairplot, as correlation by itself does not give us any insights, as clearly the data is not linearly dependent.

We note that we can somewhat seperate the two groups, inicating that they ate not the same

In [None]:
sns.pairplot(lalonde_df[['treat', 're78']+intervals], markers='+', hue='treat')

### 3. A propsensity score model

The create a fair set to use on our observational study, we calculate the propensity score based on the underlying factors before treatment:

[age, educ, hispan, black, nodegree, re74, re75]

In [None]:
prop_table = lalonde_df.copy() #otherwise we modify lalonde_df when we modify prop_table

In [None]:
#create our target and training data:

X = prop_table.iloc[:, 2:-1] #rows age to 're75'
y = prop_table.iloc[:, 1:2] #treated or not
y = np.ravel(y) #flatten array
print('First elements of Y : \n', y[0:5],'\nFirst elements of X\n', X[0:5])

In [None]:
#define our model
logistic = LogisticRegression()
logistic.fit(X, y)
print('Accuracy of prediction: ',logistic.score(X, y))

In [None]:
print("Example of prediction : ", logistic.predict(X[0:6]), ' reality :', y[0:6])
print('Example of prediction in percent : \n', logistic.predict_proba(X[0:6]))

In [None]:
#get propensity scores, probability of "being a subject"
prop_table['propensity_scores'] = pd.Series(logistic.predict_proba(X)[:,1])

In [None]:
prop_table.head()

We now use the propensity scores to find a matching.

### 4. Balancing the dataset via matching

Matching the two is an equivalent problem to find a matching in a bipartite graph

In [None]:
import networkx as nx
B = nx.Graph()
#1. Creat graph with nodes as id
B.add_nodes_from(prop_table['id'])

In [None]:
# 2. Add edges from each treated to each untreated subject
#    with weight on each node being the difference between the two
for row_i in prop_table[treated(prop_table)].iterrows():
    for row_j in prop_table[untreated(prop_table)].iterrows():
        B.add_edge(row_i[1]['id'],row_j[1]['id'], 
                   #-x to transform minimisation problem into maximisation problem
                   weight= 1 - np.abs(row_i[1].propensity_scores - row_j[1].propensity_scores))

In [None]:
#3. Find matching
matching_dict = nx.max_weight_matching(B)

In [None]:
print('Example matches:')
list(matching_dict.items())[:5]

In [None]:
#get matching
matched = prop_table.copy()[prop_table['id'].isin(matching_dict)]
print('we have : ',len(matched)/2, ' matched subjects') #pairs appear in 2 order s ab and ba

In [None]:
#separate treated and untreated
matched.groupby('treat').mean()

In [None]:
sns.boxplot(data=matched, x='treat', y='re78') # this one is similar! good!

In [None]:
matched[matched['black'] == 1].groupby('treat')['id'].count() #unbalanced

In [None]:
matched[matched['hispan'] == 1].groupby('treat')['id'].count()

In [None]:
sns.pairplot(prop_table[intervals+['treat']], markers='+', hue='treat')

### 5. Balancing the groups further


In [None]:
def compare_groups(table) :
    columns = ['untreated', 'treated']
    index = ['age', 'educ', 'black', 'hisp', 'married', 'no_degree']
    result = pd.DataFrame(columns=columns, index=index)
    result['untreated']['age'] = table['age_y'].mean()
    result['untreated']['educ'] = table['educ_y'].mean()
    result['untreated']['black'] = table['black_y'].mean()
    result['untreated']['hisp'] = table['hispan_y'].mean()
    result['untreated']['married'] = table['married_y'].mean()
    result['untreated']['no_degree'] = table['nodegree_y'].mean()
    result['treated']['age'] = table['age_x'].mean()
    result['treated']['educ'] = table['educ_x'].mean()
    result['treated']['black'] = table['black_x'].mean()
    result['treated']['hisp'] = table['hispan_x'].mean()
    result['treated']['married'] = table['married_x'].mean()
    result['treated']['no_degree'] = table['nodegree_x'].mean()
    return result

We note that there are still way more black subjects in the treated group than in the untreated group.
Additionaly, we still have outliers in the treated group.

We try to balance the both groups by removing white subjects matched with outlying black subjects

In [None]:
matched['match'] = matched['id'].map(matching_dict)
balanced_match = matched[treated(matched)].merge(matched[untreated(matched)], left_on='id', right_on='match')
balanced_match['difference'] = abs(balanced_match['propensity_scores_x'] - balanced_match['propensity_scores_y'])
print('we have : ', len(balanced_match), ' matched subjects')
balanced_match.head()

In [None]:
#matches that are black/white missmatched and have a large difference in propensity scores
race_bool_1 = (balanced_match['black_x'] == 1) & (balanced_match['black_y'] == 0) & (balanced_match['hispan_y'] == 0)
race_bool_2 = (balanced_match['black_x'] == 1) & (balanced_match['black_y'] == 0) & (balanced_match['hispan_y'] == 1)
balanced_match = balanced_match.drop(balanced_match[race_bool_1 | race_bool_2].index)
print('we have : ', len(balanced_match), ' matched subjects')

In [None]:
race_u = balanced_match.groupby('black_y')['id_y'].count()
u_race_rates = (race_u[0]/race_u.sum(), race_u[1]/race_u.sum())
race_t = balanced_match.groupby('black_x')['id_x'].count()
t_race_rates = (race_t[0]/race_t.sum(), race_t[1]/race_t.sum())
race_labels = ('black', 'other race')

In [None]:
compare_groups(balanced_match)

In [None]:
#matches that are black/white missmatched and have a large difference in propensity scores
marriage_bool_1 = (balanced_match['married_x'] == 1) & (balanced_match['married_y'] == 0)
marriage_bool_2 = (balanced_match['married_y'] == 1) & (balanced_match['married_x'] == 0)
balanced_match = balanced_match.drop(balanced_match[marriage_bool_1 | marriage_bool_2].index)
print('we have : ', len(balanced_match), ' matched subjects')

In [None]:
marriage_u = balanced_match.groupby('married_y')['id_y'].count()
u_marriage_rates = (marriage_u[0]/marriage_u.sum(), marriage_u[1]/marriage_u.sum())
marriage_t = balanced_match.groupby('married_x')['id_x'].count()
t_marriage_rates = (marriage_t[0]/marriage_t.sum(), marriage_t[1]/marriage_t.sum())
marriage_labels = ('married', 'not married')

In [None]:
compare_groups(balanced_match)

In [None]:
#removing outliers
balanced_match = balanced_match.drop(balanced_match[balanced_match.re78_x > 30000].index)

In [None]:
compare_groups(balanced_match)

In [None]:
plt.boxplot([balanced_match['re78_x'], balanced_match['re78_y']])
plt.show()

In [None]:
stats = pd.concat([balanced_match.re78_y.describe(),balanced_match.re78_x.describe()], axis=1)
stats.columns = ['untreated', 'treated']
stats

### 6. A less naive analysis

After controlling for unerlying factors we see that the treated population fares better than the untreated population

# Question 2: Applied ML

First, we need to compute the TF-IDF features of our dataset, using a vectorizer. As we understood the question, what was asked was not to use any of the given datasets from sklearn, but to use all of the data. Thus, we do not use the train and test subsets given to us in sklearn, but will create our own such subsets, adding a validation subset.

Also note that we remove the headers, footers and quotes, as proposed in the <a href="http://scikit-learn.org/stable/datasets/twenty_newsgroups.html">sklearn tutorial</a> of the dataset, as to have something more realistic and without any of the metadata. Note also that we did not use the *sklearn.datasets.fetch_20newsgroups_vectorized* function that returns the TF-IDF features directly, as it would defeat the purpose of the exercise.

In [None]:
# create the TF-IDF vectorizer
tfidf = TfidfVectorizer()

In [None]:
# Import the data we need to use the vectorizer on. Remove metadata as proposed by sci-kit tutorial
newsgroups_all = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

As asked in the question, before seperating in subsets, we will use the vectorizer on the complete set.

In [None]:
# vectors is a sparse matrix
vectors = tfidf.fit_transform(newsgroups_all.data)

Now we need to seperate the dataset into three sets: train, test and validation.

In [None]:
# first we seperate train from the rest. Random_state given to have a seed.
newsgroups_train, newsgroups_inter, vect_train, vect_inter = \
    train_test_split(newsgroups_all.target, vectors, test_size=0.2, random_state=1)

# then we seperate again to get validation and test seperately
newsgroups_test, newsgroups_valid, vect_test, vect_valid = \
    train_test_split(newsgroups_inter, vect_inter, test_size=0.5, random_state=1)

## 2.

Train a random forest on your training set. Try to fine-tune the parameters of your predictor on your validation set using a simple grid search on the number of estimator "n_estimators" and the max depth of the trees "max_depth". Then, display a confusion matrix of your classification pipeline. Lastly, once you assessed your model, inspect the `feature_importances_` attribute of your random forest and discuss the obtained results.


Now we need to train a random forest on our training set. For this, we will use the RandomForestClassifier, as it contains the parameters talked about in the exercise. But first, we need to ask ourselves what we want to set the parameters (*max_depth* and *n_estimators*) to.

According to the ADA course, we know that the number of trees will be in the 10's and the depth will be betweem 20 to 30. Thus for the training set, we set *n_estimators* to 10 and *max_depth* to 25.

For the predictions, we can't use the training set, as we just trained on it and thus would get very good results regardless. So prediction has to be on the validation set.

In [None]:
# need to find estimators and depth first. We use random_state to have a seed again.
clf = RandomForestClassifier(n_estimators=100, max_depth=25, random_state=1)
clf.fit(vect_train, newsgroups_train)
pred = clf.predict(vect_valid)
metrics.f1_score(newsgroups_valid, pred, average='macro')

As we can see, predictions aren't that great.

We try to fine tune on the validation set. Note that to do this, we usethe GridSearch implemented in sklearn. We first chose he estimators between 100 and 1000 and a depth between 20 and 30 as it is what we have seen during the lessons, but seeing as the results for the best parameters were the upper limit (30 and 1000) we decided to look if it would still be the same by taking a larger upper limit (35 and 1500).

Also, as we have already our own training, validation and test sets, we need to use *PredefinedSplit* in the GridSearch.

Please note that the fit takes a lot of time to compute, as there are a very large numbers of estimators.

In [None]:
param_grid = { 
    'n_estimators': [1000],
    'max_depth': [250, 500]
}

CV_rfc = GridSearchCV(estimator=clf, param_grid=param_grid, n_jobs=-1)

In [None]:
CV_rfc.fit(vect_valid, newsgroups_valid)

In [None]:
print(CV_rfc.best_params_)

In [None]:
clf = RandomForestClassifier(n_estimators=1000, max_depth=250, random_state=1, n_jobs=-1)
clf.fit(vect_train, newsgroups_train)
pred = clf.predict(vect_valid)
metrics.f1_score(newsgroups_valid, pred, average='macro')

As we can see, te best resuts are when *n_estimators* is set around X and *max_depth* is set to X. 

Now we do a confusion matrix on the test set.

Now, let us inspect the `feature_importances_` attribute of our random forest.