In [143]:
#Importing relevant packages
import numpy as np 
import pandas as pd
from tqdm import tqdm #to create a progress bar

#Packages for NLP
import nltk
from nltk.tokenize import TweetTokenizer

#Machine learning packages
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

#Packages to create DFM
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

#Packages for cross-validation and parameter tuning
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

#Packages for getting model performance metrics
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import roc_auc_score

#Packages for visualization
import matplotlib.pyplot as plt
import seaborn as sns


# Supervised learning 

For this exercise lab, the goal is to apply supervised machine learning to predict which tweets are from Republicans and from Democrats on Twitter. 

The dataset is the same as the one we used Monday and last week, containing tweets from US Members of Congress. The preprocessed version is also the same as the one used on Monday. 

We will be applying two supervised machine learning algorithms - random forest and lasso - and comparing which method performs best when predicting party affiliation from tweet text. 


### 1.1: Preparing the data for analysis 

1. Import the dataframe
2. Similar to last week, replace NaN values with an empty string in the stemmed text. Then use `groupby` and `agg` to group the data by politician (nominate_name), and aggregate the stemmed tweet text for each politician into one long string. 
3. Create a column with a binary label to show party affiliation. 0 if Republican, 1 if Democrat. 


### 2: Creating a validation set and splitting features (X) and labels (y)

We'll pretend that we only know for 300 of the politicians whether they are Republicans or Democrats. For the remaining politicians, we therefore don't know the author's partisanship. Our goal then, is to use machine learning models to predict whether a user is a Republican or Democrat using the 218 labeled observations in our training set. Then we'll use a machine learning model that we have fit to the training data to predict the label for the unlabeled politicians.

1. Split the dataset into two: one labeled and one unlabeled. You can use `sample` on the aggregated dataframe to get a random sample of 300 politicians for the unlabeled dataset. The labeled dataset should be the remaining 218 politicians. 
3. Create a training set by splitting the labeled data into: X (the stemmed text) and y (the newly created binary label column). 

Shape of labeled dataset: (218, 6) 
Shape of unlabeled dataset: (300, 6)


### 4.1: RandomForest: Hyperparameter tuning

Now we begin with the supervised learning. First, we will train and tune a RandomForest classifier. Find the documentation for RandomForest here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html 

To use text data for prediction, we need to convert the data into a document-feature matrix. We will be comparing two methods of vectorization: a vectorizer using term frequencies and a transformation of those into tf-idf frequencies. 


Create a **pipeline** containing: 

1. `CountVectorizer`: 
    - Like Monday, we want to include both unigrams and bigrams. `CountVectorizer` can do this for us with the parameter `ngram_range`. 
    - Use the parameters `max_df` and `min_df` to remove very frequent (those that appear in more than 99.9% of the documents) and very infrequent words (those that appear in less than 10.0% documents).
    - CountVectorizer has a build-in tokenizer. However, if you want to use the `TweetTokenizer` we used on Monday, you can override the default tokenization with your own defined function, like so: `CountVectorizer(tokenizer=your_tokenizer.tokenize)`. <br>


2. `TfidfTransformer()` <br>


3. `RandomForestClassifier()`


Create a **parameter-grid** containing: 

1. To easily test the use of either term frequencies or tf-idf frequencies as part of your hyperparameter-tuning, use the parameter `use_idf` in the `TfidfTransformer()` in the pipeline.
2. Experiment with the `max_features` parameter (the number of features to consider when splitting branches). To get a model that runs (fairly) quickly, try just three: [260,300,340]. Other parameters that could also be experimented with, but only if you have time, are `n_estimators` (number of trees) and `max_depth` (size of the trees).


Use `StratifiedKFold` with 5 folds for **cross-validation**. This creates balanced distributions across the folds. 

Use `GridSearchCV` to **find the best RandomForest classifier**. Save the best performing model and compute the accuracy.

Investigate the results. Does the count vectorized data or the tf-idf vectorized data perform better? 


Note: `TfidfVectorizer` is the same as using `CountVectorizer` followed by `TfidfTransformer`. If you at a later point, e.g. for your exam, know that you want to use the tf-idf frequencies rather than the plain term frequencies, this is an option. 


In [11]:
#Initializing the tokenizer I want to use
tweet_tokenizer = TweetTokenizer()

#Fill in the three pipeline steps
pipeline = Pipeline([ 
    ('#fill in here',#and here) , 
    ('#fill in here',#and here) ,
    ('#fill in here',#and here)
])

#Fill in the parameter values in the grid 
parameter_grid = {
    '#fill in here': #and here,
    '#fill in here': #and here
}

#Initializing a kfold with 5 folds
cv = StratifiedKFold(n_splits=5)

#Initializing the GridSearchCV
search = GridSearchCV(pipeline, parameter_grid, cv=cv, n_jobs = -1, verbose=10)

     

In [1]:
#Running the GridSearchCV

forest_result = search.fit( ) #Input your X and y

In [2]:
#Finding the best performing model and saving it 

#Viewing the parameters and accuracy of the best performing model 


### 4.2: Lasso: Supervised learning and hyperparameter tuning

Repeat the above steps to find the best performing lasso regression model. 

To implement lasso regression, we will use scikit-learn's `LogisticRegression` with `penalty = 'l1'` (which refers to the lasso penalty), `solver = 'saga'`, and bumping up `max_iter = 1000`. 

The parameter to cross-validate will be `C`, the inverse of regularization strength (1/λ). Experiment with the parameter values [0.5, 1, 5].


In [20]:
#Initializing the tokenizer I want to use
tweet_tokenizer = TweetTokenizer()

#Initializing the pipeline
pipeline = Pipeline([ 
    ('#fill in here',#and here) , 
    ('#fill in here',#and here) ,
    ('#fill in here',#and here)
])

#Setting the parameter grid 
parameter_grid = {
    '#fill in here': #and here,
    '#fill in here': #and here
}

#Initializing a kfold with 5 folds
cv = StratifiedKFold(n_splits=5)

lasso_search = GridSearchCV(pipeline, parameter_grid, cv=cv, n_jobs = -1, verbose=10)


In [3]:
lasso_result = lasso_search.fit( ) #Input your X and y

In [4]:
#Finding the best performing model and saving it 

#Viewing the parameters and accuracy of the best performing model 


### 5. Performance evaluation

Now that we have our two best performing models, let's try them on data the models have never seen. Normally we couldn't "check" to see how well we labeled our unlabeled data, because they are... unlabeled. But for this example we actually can do that because in all of our data we know who actually is a Democrat or Republican. Split the unlabeled dataset into X_test and y_test. 

Fit the best models with the labeled dataset. 

1. Plot a confusion-matrix for the best performing RandomForest and the best performing Lasso. Scikit-learn's `plot_confusion_matrix` can do this for you. 
2. Compute accuracy, precision, recall, and f1 for each best performing model. This can be done by computing predicted y and then using `classification_report` to get the performance scores. 
3. Compute the AUC-score (area-under-the-curve) and plot the ROC-curve for each model. Code is provided. 

Which method performs best?


In [112]:
#Splitting unlabeled features into a new X and y, to use as our testing data


#### Confusion matrices

#### Accuracy, precision, recall, and f1-score

#### AUC-score and ROC curve

For this, we are borrowing code from the documentation: https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html 

Just fill in the blank and attempt to understand the code. 

In [136]:
#We have a binary classification - therefore, 2 classes
n_classes = 2

#Defining dictionaries to save scores for each class
fpr = dict() #Increasing false positive rates
tpr = dict() #Increasing true positive rates
roc_auc = dict() #AUC-score

#The predicted probabilities based on the unseen data - fill in here:
probs =   

#For each class, the dictionaries are filled in with values
for i in range(n_classes):
    fpr[i], tpr[i], threshold = roc_curve(ynew, probs[:, i]) #Using scikit-learn's roc_curve function to get values
    roc_auc[i] = auc(fpr[i], tpr[i]) #Using scikit-learn's auc function to get the auc-score


In [None]:
plt.figure()

#Setting linewidth
lw = 2

#Plotting the ROC-curve for class 1 (aka prediction of Democrat) with AUC-score as the legend
plt.plot(fpr[1], tpr[1], color='darkorange', lw=lw, label='ROC curve (area = %0.5f)' % roc_auc[1])

#Plotting the 'no skill' line, i.e. the line predicting 0 and 1 equally
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')

#Adding embellishments :) 
plt.title('ROC-curve for RandomForestClassifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

#Plotting AUC-score as legend
plt.legend(loc="lower right")

plt.show()


In [133]:
#Repeating above steps for Lasso 

n_classes = 2

fpr = dict()
tpr = dict()
roc_auc = dict()

#Fill in the predicted probabilities on the unseen data here:
probs = 

for i in range(n_classes):
    fpr[i], tpr[i], threshold = roc_curve(ynew, probs[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])
    

In [5]:
#Plotting the ROC-curve and AUC-score for Lasso

plt.figure()

lw = 2

plt.plot(fpr[1], tpr[1], color='darkorange', lw=lw, label='ROC curve (area = %0.5f)' % roc_auc[1])

plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')

plt.title('ROC-curve for Lasso')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

plt.legend(loc="lower right")

plt.show()

### Interpreting feature important for prediction

Hooray, RandomForestClassifier is the best model! 

To see which features were important for prediction, extract the feature importances from the best performing model using `model.named_steps['insert pipeline step'].feature_importances_`. 

Combining these with `model.named_steps['insert pipeline step'].get_feature_names()` from the vectorizer, find the 20 most important words for prediction. 

Does it qualitatively make sense to you that these are the most important words to predict Democrats vs. Republicans from tweet text? 

If you have time, plot the 10 largest feature importances with `sns.barplot`. 

Hint: This can be solved similarly to the way the largest beta values and associated words were extracted in the topic modelling exercise. 

Optional: The most impactful coefficients can also be extracted from Lasso, in a similar way. Just remember to extract the 20 largest coefficients in *absolute* value. 

#### Extracting feature importances from RandomForest classifier

#### Optional: Extracting coefficients from Lasso