# Exercise 8: Classification Pipeline

**Libaries needed: pandas, sklearn>=0.19, numpy, nltk, graphviz (python-graphviz), e.g.**
```
conda install graphviz
conda install python-graphviz
```

First, make sure your environment is setup with the right libraries. In this exercise, you should be filling the empty code sections, marked as `TODO:`

In [1]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings('ignore')

ModuleNotFoundError: No module named 'seaborn'

## Analyzing Weather Sentiments in Tweets

In this exercise, we consider the [Weather sentiment](https://data.world/crowdflower/weather-sentiment) dataset from [Crowdflower](https://www.crowdflower.com/).

To create this dataset, contributors were asked to grade the sentiment of a particular tweet relating to the weather. Contributors could choose among the following categories:
1. Positive
2. Negative
3. I can't tell
4. Neutral / author is just sharing information
5. Tweet not related to weather condition

The catch is that 20 contributors graded each tweet. Thus, in many cases contributors assigned conflicting sentiment labels to the same tweet. 

In the `data` directory, you will find the file [weather-non-agg-DFE.csv](data/weather-non-agg-DFE.csv) containing the raw contributor answers for each of the 1,000 tweets.


The fields of the csv file are as follows:
1. **_unit\_id_**: CrowdFlower’s numeric ID for the unit,
2. **channel**: channel via which the contributor entered the job,
3. **trust**: the contributor's accuracy level in the current job, determined by their accuracy on the Test Questions they’ve seen in the job,
4. **worker_id**: CrowdFlower Contributor ID,
5. **country**: worker's country code
6. **region**: worker's region
7. **city**: worker's city
8. **emotion**: worker's assigned emotion to the tweet
9. **tweet_id**: id of the tweet
10. **tweet_body**: body text of the tweet


Our goal in this exercise is to build a classifier that predicts the tweet's emotion according to the aforementioned categories. Towards that, we will be first aggregating the results of the crowdflower task in order to get a clean dataset. Then we will be preparing the data features by tokenizing the text. Finally, we will be inputting these features into a text classifier. 


### Task 1: Data Formatting

To begin with let's load the data into a format we can work with.



In [6]:
data = pd.read_csv('data/weather-non-agg-DFE.csv')
# print the shape of our data frame
print(data.shape)
data.head()

(20000, 10)


Unnamed: 0,_unit_id_,channel,trust,worker_id,country,region,city,emotion,tweet_id,tweet_body
0,314960382,clixsense,0.4541,18034918,IND,7,Delhi,Neutral / author is just sharing information,82846118,Fire Weather Watch issued May 17 at 4:21PM CDT...
1,314960385,clixsense,0.4541,18034918,IND,7,Delhi,Positive,82510997,Passing out now. working tonight. Storms toda...
2,314960391,clixsense,0.4541,18034918,IND,7,Delhi,Negative,83271279,"RT @mention: ""The storm is only that which aut..."
3,314960396,clixsense,0.4541,18034918,IND,7,Delhi,Positive,80058872,It is hot out here but it feels great
4,314960400,clixsense,0.4541,18034918,IND,7,Delhi,Neutral / author is just sharing information,80058809,I can't find a way to delete my iWitness Weath...


Let's see how our labels are distributed:

In [7]:
print("For all tweets:\n" + str(data.emotion.value_counts()))
print("For tweet_id=82846118:\n" + str(data[data.tweet_id==82846118].emotion.value_counts()))

For all tweets:
Neutral / author is just sharing information    5371
Negative                                        4986
Positive                                        4953
Tweet not related to weather condition          3553
I can't tell                                    1137
Name: emotion, dtype: int64
For tweet_id=82846118:
Neutral / author is just sharing information    16
Positive                                         1
Tweet not related to weather condition           1
I can't tell                                     1
Negative                                         1
Name: emotion, dtype: int64


### Task 2: Aggregating the Annotations

Now we will be aggregating the data of the workers to obtain one label per tweet. Your input is the pandas data frame `data`. Your output should be a data frame of 1000 rows, with one `emotion` field for each `tweet_id`. You should use the Majority Decision algorithm, where the value of `emotion` field is simply the one occurring most frequently per `tweet_id`.




In [None]:
# We're only interested in these columns for now ['tweet_id','emotion', 'tweet_body'].
# We convert the data to an object with just these columns 
data= data[['tweet_id','emotion', 'tweet_body']]

# TODO: Next we group the data with here
agg_data = data...

By now, your data should have been aggregated. You should get 1000 rows in total as we had 20 labels. The index column for your data frame would be the `tweet_id`. Let's get a preview of how the data frame looks like now.

In [None]:
# We can verify the shape of the resulting data (should be (1000,2))
print('data shape:',agg_data.shape)
# We can also check the columns and the index
print('data columns:',agg_data.columns)
print('data columns:',agg_data.index.name)
agg_data.head()

We will split now the dataset into two parts, the training data and the testing data. The test data should be 0.2 of the original data size. 

In [None]:
from sklearn.model_selection import train_test_split

# TODO: fill here
train_data, test_data = ...

In [None]:
# check a sample of the training data
train_data.head()

In [None]:
# check a sample of the test data
test_data.head()

You can see the number of samples from each class.

In [None]:
print('training') 
print(train_data.emotion.value_counts())
print('testing')
print(test_data.emotion.value_counts())

### Task 3: Feature Creation

Now that we have aggregated the data, we will work on creating the features to use in our classifier. The input to the classifier is a tweet. However, for this task, we will be using the tokens (e.g., words) in the tweet as the feautres.
In specific, each tweet will be represented with a vector that indicates what words are in the tweet and how frequently each word occurs. This way, we do not account for the order of occurrence of these words. This kind of features is also called the **bag of words** technique.

Let's further process and take a look first on our data:

In [None]:
x_train = train_data['tweet_body'].values
y_train = train_data['emotion'].values
x_test = test_data['tweet_body'].values
y_test = test_data['emotion'].values

print(list(zip(x_train, y_train)))

Now let's see how we can transform our tweets into vectors

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
# we take two samples
samples = x_train[:2]
# TODO: fit and transform these samples to get the count vectors
x_train_counts = ...

# You can see how these vectors look with the following commands
print(pd.DataFrame(x_train_counts.A, columns=count_vect.get_feature_names()).to_string())

### Task 4: Creating the Classifiers

In practice, it's cleaner to integrate all of our preprocessing and classifier into a single [pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). So let's do that now. Add the missing code to build a simple pipeline with CountVectorizer and a Multinomial Naive Bayes classifier.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB


def build_pipeline(classifier_fn,x_train,y_train):
    pipeline = Pipeline([
        # TODO: add the missing code here
        ('count_vectorizer',  ...,
        ('classifier',          ...)
        ])
    pipeline.fit(x_train,y_train)
    return pipeline

pipeline = build_pipeline(MultinomialNB(),x_train,y_train)

To evaluate the classifiers, we will be using the precision, recall, and F1 scores. 

In [None]:
from sklearn.metrics import precision_recall_fscore_support,classification_report, confusion_matrix

# TODO: get the class names from the data
class_names = ...

def plot_confusion_matrix(y_test,y_predicted,labels):
    cm = confusion_matrix(y_test, y_predicted,labels =labels)

    figsize = (10,7)
    df_cm = pd.DataFrame(
        cm, index=class_names, columns=class_names, 
    )
    fig = plt.figure(figsize=figsize)
    try:
        heatmap = sns.heatmap(df_cm, annot=True, fmt="d")
    except ValueError:
        raise ValueError("Confusion matrix values must be integers.")
    heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0)
    heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()
    return
    
def evaluate_classifier(pipeline, x_test, y_test):
    # TODO: get the predictions
    y_predicted = ...
    # TODO: generate the report
    report  = ...
    print(report)
    # TODO: plot the confusion matrix
    ...
    return

evaluate_classifier(pipeline, x_test,y_test)

We can experiment with other classifiers as we please at this point. Try using an SVM classifier and a RandomForestClassifier classifier instead. You might notice that the Multinomial Naive Bayes is already a good baseline for short text classification.


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier

# TODO: build the pipeline with an SGDClassifier (SVM) and evaluate it
...
...


# TODO: build the pipeline with a Random Forest Classifier and evaluate it
...
...

### Task 5: Predicting Emotions for New Tweets

It's worth noting that as have the classifiers, we can try them a couple of new tweets.

In [None]:
tweets = ["love the weather","#WEATHER: 7:51 am E: 55.0F. Feels F. 30.01% Humidity. 3.5MPH Variable Wind."]
# TODO: get the predictions
predictions = ...
predictions

### Task 6: Visualizing a Decision Tree Classifier

side note: this requires the installation of **graphviz**

One way to see how our features are affecting the classification is to use interpretable classifiers like Decision Trees. To begin with, let's add another step to our pipeline, where we use TF-IDF measure for each word as a vector element instead of taking the term frequency only. We'll use the [TfidfTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html).

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

# TODO: build a new pipeline with the TfidfTransformer
def build_pipeline(classifier_fn,x_train,y_train):
    pipeline = Pipeline([
        ...
        ...
        ...
        ])
    pipeline.fit(x_train,y_train)
    return pipeline

Now we're ready to build and visualize the decision tree.

In [None]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import graphviz

# TODO: build the pipeline with the Decision Tree Classifier
pipeline =...
# extract the classifier and the count_vectorizer from the pipeline
classifier= pipeline.get_params()['classifier']
count_vectorizer = pipeline.get_params()['count_vectorizer']
# TODO: Use export_graphviz to visualize the classifiers. You already have all the needed parameters.
dot_data = ...

graph = graphviz.Source(dot_data) 
graph

### Task 7 Get Feature Importance

One way to debug our classifiers and see how they are working is to investigate the features and print the most important ones. We'll do that next. But to start, let's do one more preprocessing step that further removes some of the noise: removing stop words.

In [None]:
from nltk.corpus import stopwords

# remove stopwords in order to improve interpretability
stop = stopwords.words('english')
stop += ['rt','@mention:','@mention','link']

# TODO: write a function that removes the stop words from each string in a numpy array of strings 
def remove_stopwords(x_train):
    ...


x_train_nostop = remove_stopwords(x_train)
x_test_no_stop = remove_stopwords(x_test)

print(x_train_nostop)

Now we're ready to get the most important features for each class. Complete the function below with the missing lines. An easy sanity check is to see if the words that are important for the "Positive" class are actually positive and vice-versa!

In [None]:
def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=20):
    labelid = list(classifier.classes_).index(classlabel)
    # TODO: get the feature names from the vectorizer
    feature_names = ...
    topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print(classlabel, feat, coef)

        
pipeline = build_pipeline(MultinomialNB(),x_train_nostop,y_train)
# TODO: get the most informative features for a Multinomial NB classifier 
# by using the function above for the classes "Positive" and "Negative"
classifier= ...
count_vectorizer = ...
most_informative_feature_for_class(count_vectorizer,classifier,"Positive")

In [None]:
most_informative_feature_for_class(count_vectorizer, classifier, "Negative")

In [None]:
most_informative_feature_for_class(count_vectorizer, classifier, "Neutral / author is just sharing information")

### Task 8: Cross validation

Above, we were using a training set and a testing set for evaluating our models. A more robust way to evaluate a model is to use cross validation. In the code below, we will do such evaluation. At the same time, we will use the cross-validation results to select better parameters for our model. More specifically, we will find a good `alpha` parameter for the `MultinomialNB` model. Hence, we will plot the variation of the F1 score for the "Positive" and the "Negative" labels.

In [None]:
from sklearn.model_selection import ShuffleSplit
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.model_selection import StratifiedKFold


# construct a new X and y 
X = np.concatenate((x_train_nostop, x_test_no_stop),axis=0)
y = np.concatenate((y_train, y_test),axis=0)  

# prepare for the cross validation 
kf = StratifiedKFold(n_splits=5,random_state=4)

alphas = [(i*0.1) for i in range(0,10)]


total_f1_pos = []
total_f1_neg = []

for a in alphas:
    f_score_pos = []
    f_score_neg = []
    for train_index, test_index in kf.split(X,y):
        
        # TODO: build the pipeline for the current alpha and the current training and testing set 
        pipeline = ...
       
        y_predicted = pipeline.predict(X[test_index])
        report  = precision_recall_fscore_support(y[test_index], y_predicted)
        
        # let's get the f1_score value for the "Positive" and the "Negative" labels to plot them
        
        ind_pos = list(pipeline.classes_).index('Positive')
        ind_neg = list(pipeline.classes_).index('Negative')

        # report 0 is precision
        # report 2 is f-score
        fscore_positive = report[0][ind_pos]
        fscore_negative = report[0][ind_neg]
        
        f_score_pos += [fscore_positive]
        f_score_neg += [fscore_negative]
    # TODO: extend the arrays with the average  fscore_positive and fscore_negative repsectively.
    total_f1_pos += ...
    total_f1_neg += ...
    

We now plot the variation of the F1 score with different `alpha` values.

In [None]:
plt.plot(alphas,total_f1_pos,label='Positive')
plt.plot(alphas,total_f1_neg,label='Negative')

plt.xlabel('alpha')
plt.ylabel('Precision')
plt.legend()
plt.show()

### Bonus Task 9: Food for thought

There is a lot of room for improvement in the above problem. Here are some issues to think about:
- You can see a lot of numbers and measures of wind speed or humidity in the dataset. Can we do some custom tokenization to group these similar features into a single one?
- Can we alternatively discretize these measures and turn them into discrete features like low/high wind speed? If people are expressing sentiment based on temperature, humdity, etc., this could be a good potential classifier which can work well in practice.
- You have seen that some of the labels have no predictions in the testing set. This is due to the imbalanced dataset. What can we do to improve this?