# Twitter Where tweets from with Naive Bayes Machine Learning

i will use a Naive Bayes Classifier to find patterns in real tweets. i have three files: `new_york.json`, `london.json`, and `paris.json`. These three files contain tweets that i gathered from those locations.

- The goal is to create a classification algorithm that can classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris.

### Investigate the Data

To begin, let's take a look at the data

In [1]:
import pandas as pd

new_york_tweets = pd.read_json("new_york.json", lines=True)
print(len(new_york_tweets))
print(new_york_tweets.columns)
print(new_york_tweets.loc[12]["text"])

4723
Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'extended_tweet',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities',
       'withheld_in_countries'],
      dtype='object')
Be best #ThursdayThoughts


Loading in the block below the London and Paris tweets into DataFrames named `london_tweets` and `paris_tweets`.

In [2]:
london_tweets = pd.read_json('london.json', lines=True)
paris_tweets = pd.read_json('paris.json', lines=True)

print('There are {} London tweets'.format(len(london_tweets)))
print('There are {} Paris tweets'.format(len(paris_tweets)))

There are 5341 London tweets
There are 2510 Paris tweets


# Classifying using language: Naive Bayes Classifier

We're going to create a Naive Bayes Classifier! Let's begin by looking at the way language is used differently in these three locations. Let's grab the text of all of the tweets and make it one big list. In the code block below, we've created a list of all the New York tweets. Do the same for `london_tweets` and `paris_tweets`.

Then combine all three into a variable named `all_tweets` by using the `+` operator. For example, `all_tweets = new_york_text + london_text + ...`

Let's also make the labels associated with those tweets. `0` represents a New York tweet, `1`  represents a London tweet, and `2` represents a Paris tweet. Finish the definition of `labels`.

In [4]:
# transforming the tweets text to a list variable
new_york_text = new_york_tweets["text"].tolist()
london_text = london_tweets['text'].tolist()
paris_text = paris_tweets['text'].tolist()

# getting our data and labels to modelling into naive bayers classifier
all_tweets = new_york_text + london_text + paris_text
labels = [0] * len(new_york_text) + [1] * len(london_text) + [2] * len(paris_text)

### Making a Training and Test Set

We can now break our data into a training set and a test set. We'll use scikit-learn's `train_test_split` function to do this split.

In [5]:
from sklearn.model_selection import train_test_split

train_data, test_data, train_labels, test_labels = train_test_split(all_tweets, labels, random_state=1)

print('The length of train data is:', len(train_data))
print('The length of test data is:', len(test_data))

The length of train data is: 9430
The length of test data is: 3144


### Making the Count Vectors

To use a Naive Bayes Classifier, we need to transform our lists of words into count vectors. Recall that this changes the sentence `"I love New York, New York"` into a list that contains:

* Two `1`s because the words `"I"` and `"love"` each appear once.
* Two `2`s because the words `"New"` and `"York"` each appear twice.
* Many `0`s because every other word in the training set didn't appear at all.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

counter = CountVectorizer()

counter.fit(train_data)
train_counts = counter.transform(train_data)
test_counts = counter.transform(test_data)

print(train_data[3])
print(train_counts[3])

At Beckenham Hill for the 12.01 today. I had booked assistance &amp; had it confirmed. https://t.co/x1mq7G81rW
  (0, 6)	1
  (0, 165)	1
  (0, 2300)	1
  (0, 2883)	1
  (0, 2920)	1
  (0, 3596)	1
  (0, 4238)	1
  (0, 5896)	1
  (0, 6202)	1
  (0, 10145)	1
  (0, 11580)	2
  (0, 12095)	1
  (0, 12417)	1
  (0, 13299)	1
  (0, 25504)	1
  (0, 25947)	1
  (0, 28502)	1


### Train and Test the Naive Bayes Classifier

We now have the inputs to our classifier. Let's use the CountVectors to train and test the Naive Bayes Classifier!

In [7]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()

classifier.fit(train_counts, train_labels)

predictions = classifier.predict(test_counts)

### Evaluating Your Model

Now that the classifier has made its predictions, let's see how well it did. Let's look at two different ways to do this. First, call scikit-learn's `accuracy_score` function.

In [9]:
from sklearn.metrics import accuracy_score

print('The acurracy of the model is:', accuracy_score(test_labels, predictions))

The acurracy of the model is: 0.6835241730279898


### Testing My Own Tweet

The classifier predicts tweets that were actually from New York as either New York tweets or London tweets, but almost never Paris tweets. Similarly, the classifier rarely misclassifies the tweets that were actually from Paris. Tweets coming from two English speaking countries are harder to distinguish than tweets in different languages.

In [10]:
tweet = ['This is how the brazillians are made: self motivation and strength']

tweet_counts = counter.transform(tweet)

print('0 = New York, 1 = London, 2 = Paris')
print('Your tweet predicion is:', classifier.predict(tweet_counts))

0 = New York, 1 = London, 2 = Paris
Your tweet predicion is: [1]
