# Classifying Tweets

The goal is to create a classification algorithm that can classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris.

# Investigate the Data

To begin, let's take a look at the data. I've imported `new_york.json` and printed the following information:
* The number of tweets.
* The columns, or features, of a tweet.
* The text of the 12th tweet in the New York dataset.



In [1]:
import pandas as pd

new_york_tweets = pd.read_json("new_york.json", lines=True)
print(len(new_york_tweets))
print(new_york_tweets.columns)
print(new_york_tweets.loc[12]["text"])

4723
Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'extended_tweet',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities',
       'withheld_in_countries'],
      dtype='object')
Be best #ThursdayThoughts


Let's load the Paris tweets and London tweets:


In [2]:
london_tweets = pd.read_json("london.json", lines=True)
print(len(london_tweets))
print(london_tweets.columns)
print(london_tweets.loc[12]["text"])

5341
Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'extended_tweet', 'quote_count',
       'reply_count', 'retweet_count', 'favorite_count', 'entities',
       'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities'],
      dtype='object')
I saw this on the BBC and thought you should see it:

The precious metal sparking a new gold rush - https://t.co/ScW4MOSobZ


In [4]:
paris_tweets = pd.read_json("paris.json", lines=True)
print(len(paris_tweets))
print(paris_tweets.columns)
print(paris_tweets.loc[12]["text"])

2510
Index(['created_at', 'id', 'id_str', 'text', 'source', 'truncated',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'display_text_range',
       'extended_entities', 'possibly_sensitive', 'quoted_status_id',
       'quoted_status_id_str', 'quoted_status', 'quoted_status_permalink',
       'extended_tweet'],
      dtype='object')
Hauts-de-Seine : l’incendie d’Issy-les-Moulineaux prive aussi 16 000 foyers de courant https://t.co/Hlb02Fpliy


There are 5341 London tweets and 2510 Paris tweets. 

# Classifying using language: Naive Bayes Classifier

Begin by grabbing the text of all of the tweets and making it one big list. Also, I want to make labels associated with tweets coming from different locations: `0` represents a New York tweet, `1` represents a London tweet, and `2` represents a Paris tweet. 

In [5]:
new_york_text = new_york_tweets["text"].tolist()
london_text = london_tweets["text"].tolist()
paris_text = paris_tweets["text"].tolist()

all_tweets = new_york_text + london_text+paris_text
labels = [0] * len(new_york_text) + [1]*len(london_text)+[2]*len(paris_text)

# Making a Training and Test Set

Now let's divide the data into the a training set and a test set: the training set is used for building the model and the test set is used for validate the accuracy of the model. To do this, I would call `train_test_split` function in the scikit-learning liibrary. Also, the `test_size` parameter would be set to 0.2, meaning that 20% of the data would be divided into the test set; the `random_state` would be set to 1, ensuring that I could produce the same splitting result each time. 


In [51]:
from sklearn.model_selection import train_test_split

train_data, test_data, train_labels, test_labels=train_test_split(all_tweets, labels, test_size=0.2, train_size=0.8, random_state=1)

print(len(train_data), len(test_data))


10059 2515


# Making the Count Vectors

Naive Bayes Classifier requires it to transform the lists of words into count vectors. To do this, I would use `CountVectorizer` and train it based on the `train_data`. Then, I would transform both the train_data and test_data. 

In [52]:
from sklearn.feature_extraction.text import CountVectorizer

counter=CountVectorizer()
counter.fit(train_data)

# see the index that each word corresponds to
print(counter.vocabulary_)





In [21]:
train_counts=counter.transform(train_data)
test_counts=counter.transform(test_data)
print("train_counts example:", train_counts[3].toarray())
# the output corresponds to the number of each word in the 'counter.fit(train_data)' inside the text at the index 3

print("test_counts example:", test_counts[3].toarray())
# even though the array looks like containing all 0, some number inside the array should be greater than 0

train_counts example: [[0 0 0 ... 0 0 0]]
test_counts example: [[0 0 0 ... 0 0 0]]


# Train and Test the Naive Bayes Classifier

Now, I could build the Naive Bayes Classifier using `MultinomialNB`. Also, I could predict the label for the test data and the predicted probability of each label given a point. The Naive Bayes Classifier return the final label for a given point based on its largest probability. 

In [26]:
from sklearn.naive_bayes import MultinomialNB

classifier=MultinomialNB()
classifier.fit(train_counts,train_labels)
predictions=classifier.predict(test_counts)
predictions_probability = classifier.predict_proba(test_counts)
print("The predicted label:",predictions )
print("The predicted probability of each label given a point:", predictions_probability)


The predicted label: [0 2 1 ... 1 0 1]
The predicted probability of each label given a point: [[5.60874743e-01 4.38700750e-01 4.24507229e-04]
 [9.98658102e-02 3.39169867e-02 8.66217203e-01]
 [3.72800477e-01 4.25489611e-01 2.01709912e-01]
 ...
 [4.94619438e-01 5.05378554e-01 2.00847546e-06]
 [9.87982751e-01 1.20172490e-02 3.08725885e-10]
 [3.07114355e-01 6.37773425e-01 5.51122201e-02]]


# Evaluating the Model

Evaluate the model using the accuracy score, precision score, recall score, f1 score and confusion matrix

In [42]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix 

accuracy= accuracy_score(test_labels, predictions)
print("Accuracy score:", accuracy)

precision = precision_score(test_labels, predictions, average='micro')
print("Precision Score:", precision_score)

recall_score=recall_score(test_labels, predictions,average='micro')
print("Recall Score:", recall_score)

f1_score=f1_score(test_labels, predictions,average='micro')
print("F1 Score:", f1_score)

confusion_matrix = confusion_matrix(test_labels, predictions)
print("Confusion matrix:", confusion_matrix)

Accuracy score: 0.6779324055666004
Precision Score: <function precision_score at 0x7fb497ea3048>
Recall Score: 0.6779324055666004
F1 Score: 0.6779324055666004
Confusion matrix: [[541 404  28]
 [203 824  34]
 [ 38 103 340]]


From the evaluation, the accuracy of 0.6779324 is fine meaning that the model explains 67.79% of the test data correctly. But the precision score is very low, meaning that the percentage of relevant items the classifier finds is relatively low. However, the recall score is high, meaning that the classifier performs well on selecting items successfully out of the relevant items. Also, the f-1 score of 0.6779 is acceptable. The diagnoal of the confusion matrix is high as well. Overall, the classifier has a high accuracy.

Specially, based on the confusion matrix, the classifier predicts tweets that were actually from New York as either New York tweets or London tweets, but almost never Paris tweets. Similarly, the classifier rarely misclassifies the tweets that were actually from Paris. Tweets coming from two English speaking countries are harder to distinguish than tweets in different languages.

# Test Tweet

Now, let test the tweet of "I go to UCLA!" in English and French. 

In [47]:
tweet = ["I go to UCLA!"]
tweet_counts=counter.transform(tweet)
result=classifier.predict(tweet_counts)

In [49]:
if result[0] == 0: 
    print("The tweet comes from New York.")
elif result[0] == 1: 
    print("The tweet comes from London.")
else: 
    print("The tweet comes from Paris.")

The tweet comes from New York.


In [50]:

tweet = ["je vais à UCLA!"]
tweet_counts=counter.transform(tweet)
result=classifier.predict(tweet_counts)
if result[0] == 0: 
    print("The tweet comes from New York.")
elif result[0] == 1: 
    print("The tweet comes from London.")
else: 
    print("The tweet comes from Paris.")

The tweet comes from Paris.
