## 1. World Leaders

<p>Are you up for the challenge of classifying text, specifically tweets? In this notebook, we'll explore the realm of social media text classification and delve into the task of properly categorizing tweets from two prominent North American politicians: Donald Trump and Justin Trudeau.</p>
<p>Classifying tweets presents unique challenges in natural language processing due to their brevity. Additionally, various platform-specific conventions such as mentions, #hashtags, emojis, links, and short-hand phrases (ikr?) can make the task more complex. But fear not! We will overcome these hurdles and construct a valuable classifier that can distinguish between these two prolific tweeters. Let's get started!</p>
<p>To begin, we'll import all the necessary tools from scikit-learn. This includes data vectorization techniques like <code>CountVectorizer</code> and <code>TfidfVectorizer</code>. We'll also import essential models, such as <code>MultinomialNB</code> from the <code>naive_bayes</code> module, <code>LinearSVC</code> from the <code>svm</code> module, and <code>PassiveAggressiveClassifier</code> from the <code>linear_model</code> module. Lastly, to evaluate and optimize our model, we'll need <code>sklearn.metrics</code> and the utility functions <code>train_test_split</code> and <code>GridSearchCV</code> from the <code>model_selection</code> module.</p>

In [None]:
# Set seed for reproducibility
import random; random.seed(53)

# Import all we need from sklearn
# ... YOUR CODE FOR TASK 1 ...
# ... YOUR CODE FOR TASK 1 ...
# ... YOUR CODE FOR TASK 1 ...
# ... YOUR CODE FOR TASK 1 ...
# ... YOUR CODE FOR TASK 1 ...

## 2. Transforming our collected data
<p>To kick off, we will utilize a collection of tweets gathered in November 2017, presented in CSV format. Employing Pandas DataFrame, we will import and subsequently process the data using scikit-learn.</p>
<p>Given that the Twitter API furnished the data without segregating it into test and training sets, we will perform this division ourselves. By employing <code>train_test_split()</code> with <code>random_state=53</code> and specifying a test size of 0.33, we can ensure adequate test data while attaining consistent results regardless of the code's location or execution time, akin to our methodology in the DataCamp course.</p>

In [None]:
import pandas as pd

# Load data
tweet_df = ...

# Create target
y = ...

# Split training and testing data
X_train, X_test, y_train, y_test = ...

## 3. Vectorize the tweets
<p>We have the training and testing data all set up, but we need to create vectorized representations of the tweets in order to apply machine learning.</p>
<p>To do so, we will utilize the <code>CountVectorizer</code> and <code>TfidfVectorizer</code> classes which we will first need to fit to the data.</p>
<p>Once this is complete, we can start modeling with the new vectorized tweets!</p>

In [None]:
# Initialize count vectorizer
count_vectorizer = ...

# Create count train and test variables
count_train = ...
count_test = ...

# Initialize tfidf vectorizer
tfidf_vectorizer = ...

# Create tfidf train and test variables
tfidf_train = ...
tfidf_test = ...

## 4. Training a multinomial naive Bayes model
<p>Now that we have the data in vectorized form, we can train the first model. Investigate using the Multinomial Naive Bayes model with both the <code>CountVectorizer</code> and <code>TfidfVectorizer</code> data. Which do will perform better? How come?</p>
<p>To assess the accuracies, we will print the test sets accuracy scores for both models.</p>

In [None]:
# Create a MulitnomialNB model
tfidf_nb = ...

# ... Train your model here ...

# Run predict on your TF-IDF test data to get your predictions
tfidf_nb_pred = ...

# Calculate the accuracy of your predictions
tfidf_nb_score = ...

# Create a MulitnomialNB model
count_nb = ...
# ... Train your model here ...

# Run predict on your count test data to get your predictions
count_nb_pred = ...

# Calculate the accuracy of your predictions
count_nb_score = ...

print('NaiveBayes Tfidf Score: ', tfidf_nb_score)
print('NaiveBayes Count Score: ', count_nb_score)

## 5. Evaluating our model using a confusion matrix
<p>We see that the TF-IDF model performs better than the count-based approach. Based on what we know from the NLP fundamentals course, why might that be? We know that TF-IDF allows unique tokens to have a greater weight - perhaps tweeters are using specific important words that identify them! Let's continue the investigation.</p>
<p>For classification tasks, an accuracy score doesn't tell the whole picture. A better evaluation can be made if we look at the confusion matrix, which shows the number correct and incorrect classifications based on each class. We can use the metrics, True Positives, False Positives, False Negatives, and True Negatives, to determine how well the model performed on a given class. How many times was Trump misclassified as Trudeau?</p>

In [None]:
%matplotlib inline

from datasets.helper_functions import plot_confusion_matrix

# Calculate the confusion matrices for the tfidf_nb model and count_nb models
tfidf_nb_cm = ...
count_nb_cm = ...

# Plot the tfidf_nb_cm confusion matrix
plot_confusion_matrix(tfidf_nb_cm, classes=..., title="TF-IDF NB Confusion Matrix")

# Plot the count_nb_cm confusion matrix without overwriting the first plot 
plot_confusion_matrix(..., classes=..., title=..., figure=1)

## 6. Trying out another classifier: Linear SVC
<p>So the Bayesian model only has one prediction difference between the TF-IDF and count vectorizers -- fairly impressive! Interestingly, there is some confusion when the predicted label is Trump but the actual tweeter is Trudeau. If we were going to use this model, we would want to investigate what tokens are causing the confusion in order to improve the model. </p>
<p>Now that we've seen what the Bayesian model can do, how about trying a different approach? <a href="https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html">LinearSVC</a> is another popular choice for text classification. Let's see if using it with the TF-IDF vectors improves the accuracy of the classifier!</p>

In [None]:
# Create a LinearSVM model
tfidf_svc = ...

# ... Train your model here ...

# Run predict on your tfidf test data to get your predictions
tfidf_svc_pred = ...

# Calculate your accuracy using the metrics module
tfidf_svc_score = ...

print("LinearSVC Score:   %0.3f" % tfidf_svc_score)

# Calculate the confusion matrices for the tfidf_svc model
svc_cm = ...

# Plot the confusion matrix using the plot_confusion_matrix function
plot_confusion_matrix(svc_cm, classes=..., title="TF-IDF LinearSVC Confusion Matrix")

## 7. Introspecting our top model
<p>Wow, the LinearSVC model is even better than the Multinomial Bayesian one. Nice work! Via the confusion matrix we can see that, although there is still some confusion where Trudeau's tweets are classified as Trump's, the False Positive rate is better than the previous model. So, we have a performant model, right? </p>
<p>We might be able to continue tweaking and improving all of the previous models by learning more about parameter optimization or applying some better preprocessing of the tweets. </p>
<p>Now let's see what the model has learned. Using the LinearSVC Classifier with two classes (Trump and Trudeau) we can sort the features (tokens), by their weight and see the most important tokens for both Trump and Trudeau. What are the most Trump-like or Trudeau-like words? Did the model learn something useful to distinguish between these two men? </p>

In [None]:
from datasets.helper_functions import plot_and_return_top_features

# Import pprint from pprint
from pprint ...

# Get the top features using the plot_and_return_top_features function and your top model and tfidf vectorizer
top_features = ...

# pprint the top features
pprint(...)

## 8. Bonus: can you write a Trump or Trudeau tweet?

<p>Upon analyzing the results, it appears that our model has successfully identified that Trudeau tends to tweet in French!</p>
<p>Now, it's your turn to craft a tweet using this newfound knowledge to challenge the model! Utilize the displayed list or plot above to make educated guesses on the words that will categorize your text as either Trump or Trudeau. Can you cleverly deceive the model into believing that you are either Trump or Trudeau?</p>
<p>If you happen to be fluent in French, don't hesitate to compose your Trudeau-style tweet in French! As observed, these French words are common and often known as "stop words." Although you have the option to preprocess the tweets by removing both English and French stop words, it may reduce the model's accuracy since Trudeau is the sole French-speaker in the dataset. If the dataset included multiple French speakers, this preprocessing step would prove valuable.</p>
<p>For future research on this dataset, some potential avenues to explore are:</p>
<ul>
<li>Incorporating additional preprocessing techniques (e.g., removing URLs or French stop words) and studying their impact on the model</li>
<li>Enhancing both the Bayesian and LinearSVC models by using GridSearchCV to identify optimal parameters</li>
<li>Conducting introspection on the Bayesian model to identify words that lean towards either Trump's or Trudeau's writing style</li>
<li>Enriching the dataset with more recent tweets using tweepy and subsequently retraining the model</li>
</ul>
<p>Best of luck as you compose your impersonation tweets – don't forget to share them on Twitter if you'd like!</p>

In [None]:
# Write two tweets as strings, one which you want to classify as Trump and one as Trudeau
trump_tweet = ...
trudeau_tweet = ...

# Vectorize each tweet using the TF-IDF vectorizer's transform method
# Note: `transform` needs the string in a list object (i.e. [trump_tweet])
trump_tweet_vectorized = ...
trudeau_tweet_vectorized = ...

# Call the predict method on your vectorized tweets
trump_tweet_pred = ...
trudeau_tweet_pred = ...

print("Predicted Trump tweet", trump_tweet_pred)
print("Predicted Trudeau tweet", trudeau_tweet_pred)