# Naive Bayes Project: Classify Tweets Based on Location

This project is given by CodeCademy as an off-site project to classify the locations of different tweets. The datasets provided give tweets from three different locations: New York, London, and Paris. 

We will train our model using the Naive Bayes algorithm, which takes a list of vocabulary words and how often they appear for specific labels within a data set. When a new tweet is then introduced, it calculates the probability of that tweet coming from each individual city by utilizing Bayes' Theorem. To whichever city has the highest probability, the algorithm assigns the unlabeled tweet to it.

To begin, we import the pandas library and our datasets, and make initial observations about the data.

In [1]:
import pandas as pd

# Read in dataframe
new_york_tweets = pd.read_json('new_york.json', lines=True)

# Print out some important information to get a feel for the data
print(len(new_york_tweets))
print(new_york_tweets.columns)
new_york_tweets.head()

4723
Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'extended_tweet',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities',
       'withheld_in_countries'],
      dtype='object')


Unnamed: 0,created_at,id,id_str,text,display_text_range,source,truncated,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,...,lang,timestamp_ms,extended_tweet,possibly_sensitive,quoted_status_id,quoted_status_id_str,quoted_status,quoted_status_permalink,extended_entities,withheld_in_countries
0,2018-07-26 13:32:33+00:00,1022474755625164800,1022474755625164800,@DelgadoforNY19 Calendar marked.,"[16, 32]","<a href=""http://twitter.com/download/android"" ...",False,1.022208e+18,1.022208e+18,8.290618e+17,...,en,2018-07-26 13:32:33.060,,,,,,,,
1,2018-07-26 13:32:34+00:00,1022474762491183104,1022474762491183104,petition to ban more than one spritz of cologne,,"<a href=""http://twitter.com/download/iphone"" r...",False,,,,...,en,2018-07-26 13:32:34.697,,,,,,,,
2,2018-07-26 13:32:35+00:00,1022474765750226945,1022474765750226944,People really be making up beef with you in th...,,"<a href=""http://twitter.com/download/iphone"" r...",False,,,,...,en,2018-07-26 13:32:35.474,,,,,,,,
3,2018-07-26 13:32:36+00:00,1022474768736546816,1022474768736546816,30 years old.. wow what a journey... I moved t...,,"<a href=""http://instagram.com"" rel=""nofollow"">...",True,,,,...,en,2018-07-26 13:32:36.186,{'full_text': '30 years old.. wow what a journ...,0.0,,,,,,
4,2018-07-26 13:32:36+00:00,1022474769260838913,1022474769260838912,At first glance it looked like asparagus with ...,,"<a href=""http://twitter.com/download/iphone"" r...",True,,,,...,en,2018-07-26 13:32:36.311,{'full_text': 'At first glance it looked like ...,,,,,,,


In [2]:
# Repeat for the London data
london_tweets = pd.read_json('london.json', lines=True)
print(len(london_tweets))
print(london_tweets.columns)
london_tweets.head()

5341
Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'extended_tweet', 'quote_count',
       'reply_count', 'retweet_count', 'favorite_count', 'entities',
       'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities'],
      dtype='object')


Unnamed: 0,created_at,id,id_str,text,display_text_range,source,truncated,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,...,retweeted,filter_level,lang,timestamp_ms,possibly_sensitive,quoted_status_id,quoted_status_id_str,quoted_status,quoted_status_permalink,extended_entities
0,2018-07-26 13:39:30+00:00,1022476504855400449,1022476504855400448,@bbclaurak i agree Laura but the Party you see...,"[11, 140]","<a href=""http://twitter.com/download/iphone"" r...",True,1.022447e+18,1.022447e+18,61183570.0,...,False,low,en,2018-07-26 13:39:30.109,,,,,,
1,2018-07-26 13:39:30+00:00,1022476506075942912,1022476506075942912,@masturbacaolove Why?,"[17, 21]","<a href=""http://twitter.com/download/iphone"" r...",False,1.021997e+18,1.021997e+18,9.003777e+17,...,False,low,und,2018-07-26 13:39:30.400,,,,,,
2,2018-07-26 13:39:31+00:00,1022476510089949190,1022476510089949184,@JackRobinson80 @pgroresearch Yeah not great b...,"[30, 65]","<a href=""http://twitter.com/download/iphone"" r...",False,1.022444e+18,1.022444e+18,735563300.0,...,False,low,en,2018-07-26 13:39:31.357,,,,,,
3,2018-07-26 13:39:33+00:00,1022476519845883905,1022476519845883904,Penalty shit out Arsenal,,"<a href=""http://twitter.com/download/iphone"" r...",False,,,,...,False,low,en,2018-07-26 13:39:33.683,,,,,,
4,2018-07-26 13:39:36+00:00,1022476532684648448,1022476532684648448,Obviously need some pen practice 🙈,,"<a href=""http://twitter.com/download/iphone"" r...",False,,,,...,False,low,en,2018-07-26 13:39:36.744,,,,,,


In [3]:
# And again for the paris data
paris_tweets = pd.read_json('paris.json', lines=True)
print(len(paris_tweets))
print(paris_tweets.columns)
paris_tweets.head()

2510
Index(['created_at', 'id', 'id_str', 'text', 'source', 'truncated',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'display_text_range',
       'extended_entities', 'possibly_sensitive', 'quoted_status_id',
       'quoted_status_id_str', 'quoted_status', 'quoted_status_permalink',
       'extended_tweet'],
      dtype='object')


Unnamed: 0,created_at,id,id_str,text,source,truncated,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,...,lang,timestamp_ms,display_text_range,extended_entities,possibly_sensitive,quoted_status_id,quoted_status_id_str,quoted_status,quoted_status_permalink,extended_tweet
0,2018-07-27 17:40:45+00:00,1022899608396156928,1022899608396156928,Bulletin météo parisien : des grêlons énormes ...,"<a href=""http://twitter.com/download/android"" ...",False,,,,,...,fr,2018-07-27 17:40:45.854,,,,,,,,
1,2018-07-27 17:40:47+00:00,1022899613550956544,1022899613550956544,Prêt pour le match #USORCL https://t.co/V5jw0S...,"<a href=""http://twitter.com/download/iphone"" r...",False,,,,,...,fr,2018-07-27 17:40:47.083,"[0, 26]","{'media': [{'id': 1022899599336525825, 'id_str...",0.0,,,,,
2,2018-07-27 17:40:50+00:00,1022899626041651200,1022899626041651200,MAIS QOIDBDNND'SLS'SLSLLSLS''D DBDODNDNODJDBKD...,"<a href=""http://twitter.com/download/android"" ...",False,,,,,...,in,2018-07-27 17:40:50.061,"[0, 111]","{'media': [{'id': 1022899571884744706, 'id_str...",0.0,,,,,
3,2018-07-27 17:40:57+00:00,1022899655347249152,1022899655347249152,@ToursFC Où peut on le championnat de National...,"<a href=""http://twitter.com/download/android"" ...",False,1.022888e+18,1.022888e+18,978599220.0,978599220.0,...,fr,2018-07-27 17:40:57.048,"[9, 50]",,,,,,,
4,2018-07-27 17:40:57+00:00,1022899656685223936,1022899656685223936,Les tismey ils sont bas qu’a tromper leur go e...,"<a href=""http://twitter.com/download/iphone"" r...",False,,,,,...,fr,2018-07-27 17:40:57.367,,,,,,,,


Two quick takeaways from the initial observations. First, all we really need here is the text column of the data set, as columns like time stamp and id aren't particularly necessary; and we will create our own target column with the locations. Second, NY and London tweets are in English, while Paris tweets seem to all be in French (makes sense, right?). Our initial hypothesis, then, is that Paris tweets should be easily distinguishable from NY and London tweets, while NY and London tweets might be quite similar to each other.

Let's set up our data in a way that NB algorithm can work with it.

In [5]:
# Create lists of each 'text' column
new_york_text = new_york_tweets['text'].tolist()
london_text = london_tweets['text'].tolist()
paris_text = paris_tweets['text'].tolist()

# Combine lists together to get all text
all_text = new_york_text + london_text + paris_text

# Add labels associated with each city. In this case, a 0 will be New York, 1 will be London, 2 will be Paris.
target = [0] * len(new_york_text) + [1] * len(london_text) + [2] * len(paris_text)

# Print out length of each list to make sure they match
print(len(all_text))
print(len(target))

12574
12574
