# Project: Classifying Tweets

In this off-platform project, I will use a Naive Bayes Classifier to find patterns in real tweets. I have been given three files: `new_york.json`, `london.json`, and `paris.json`. These three files contain tweets that have been gathered from those locations.

The goal is to create a classification algorithm that can classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris.

# Investigate the Data

To begin, let's take a look at the data. I've imported `new_york.json`, and printed the following information:
* The number of tweets.
* The columns, or features, of a tweet.
* The text of the 12th tweet in the New York dataset.

In [1]:
import pandas as pd

new_york_tweets = pd.read_json("new_york.json", lines=True)
print(len(new_york_tweets))
print(new_york_tweets.columns)
print(new_york_tweets.loc[12]["text"])

4723
Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'extended_tweet',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities',
       'withheld_in_countries'],
      dtype='object')
Be best #ThursdayThoughts


The aove output gives a good overview of the data contained in the json file, number of tweets and an example tweet. 

In the code block below, I will also load the London and Paris tweets into DataFrames named `london_tweets` and `paris_tweets`.

In [2]:
london_tweets = pd.read_json("london.json", lines=True)
print(len(london_tweets))
paris_tweets = pd.read_json("paris.json", lines=True)
print(len(paris_tweets))

5341
2510


# Classifying using language: Naive Bayes Classifier

I will grab the text of all of the tweets and make it one big list. First I will grab all the text from each location and then combine all these into one variable name `all_tweets`.

Next I will make the labels associated with those tweets. `0` represents a New York tweet, `1`  represents a London tweet, and `2` represents a Paris tweet. 

In [6]:
new_york_text = new_york_tweets["text"].tolist()
london_text = london_tweets["text"].tolist()
paris_text = paris_tweets["text"].tolist()

all_tweets = new_york_text + london_text + paris_text
labels = [0] * len(new_york_text) + [1] * len(london_text) + [2] * len(paris_text)

print(len(all_tweets))
print(len(labels))

12574
12574


# Making a Training and Test Set

In [8]:
from sklearn.model_selection import train_test_split

train_data, test_data, train_labels, test_labels = train_test_split(all_tweets, labels, test_size=0.2, random_state=1)
print(len(train_data))
print(len(test_data))

10059
2515


# Making the Count Vectors

Next I will transform the lists of words into count vectors. 

Next, I will call the `.fit()` method using `train_data` as a parameter. This teaches the counter the vocabulary.

Finally, I will transform `train_data` and `test_data` into Count Vectors. 

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

counter = CountVectorizer()
counter.fit(train_data)

train_counts = counter.transform(train_data)
test_counts = counter.transform(test_data)

print(train_data[3])
print(train_counts[3])

saying bye is hard. Especially when youre saying bye to comfort.
  (0, 5022)	2
  (0, 6371)	1
  (0, 9552)	1
  (0, 12314)	1
  (0, 13903)	1
  (0, 23994)	2
  (0, 27146)	1
  (0, 29397)	1
  (0, 30274)	1


# Train and Test the Naive Bayes Classifier

I now have the inputs to the classifier. Let's use the CountVectors to train and test the Naive Bayes Classifier!

In [10]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()

classifier.fit(train_counts, train_labels)
predictions = classifier.predict(test_counts)

# Evaluating The Model

Now that the classifier has made its predictions, let's see how well it did. 



In [11]:
from sklearn.metrics import accuracy_score

print(accuracy_score(test_labels, predictions))

0.6779324055666004


The above results shows that the model was just under 68% accurate, which is pretty good for this model. 

The other way I can evaluate your model is by looking at the **confusion matrix**. 

In [12]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(test_labels, predictions))

[[541 404  28]
 [203 824  34]
 [ 38 103 340]]


The above shows how the model predicted the true results for each label `new_york`, `london` and `paris`. 

The first row shows `new_york` which shows that it correct predicted New York tweets 541 times but classified 404 as London and 28 as Paris. The second row shows `london` has similar results. This is as expected where it is harder to distinguish between 2 english speaking locations, but shows low wrong predictions of these rows by classifying them as Paris, a predominately French speaking location. 

# Test My Own Tweet

Now to write some of my own tweets to see how they are classified. 

In [13]:
tweet = "Man, I love the tubes, the parks and the palaces!"

tweet_counts = counter.transform([tweet])
print(classifier.predict(tweet_counts))

[1]


In [14]:
tweet_2 = "Bagels, pizza and hotdogs! Man I love this place!"

tweet_counts_2 = counter.transform([tweet_2])
print(classifier.predict(tweet_counts_2))

[0]


In [15]:
tweet_3 = "Bonjour!! Le premier café est égal à zéro tasse."

tweet_counts_3 = counter.transform([tweet_3])
print(classifier.predict(tweet_counts_3))

[2]
