In this lab we're going to do a content-based classification of cities, based on sample of tweets. This is the same case-study we did in the lab 2.2. But last time we used a Logistic Regression classifier using *scikit-learn*. So this time we're going to solve the same problem using an MLP classifier (multi-layer perceptron) using *tensorflow*.

As before, this lab shows the simpler version. If you want to dig more into the details, have a look at the examples in the *text_analytics()* package.

In [1]:
from text_analytics import TextAnalytics
import os
import pandas as pd

ai = TextAnalytics()
ai.data_dir = os.path.join(".", "data")
print("Done!")

Done!


Let's load the tweets that we need for each city. We're reducing the corpus to avoid memory issues.

In [2]:
file = "sociolinguistics.english_cities.gz"
file = os.path.join(ai.data_dir, file)
df = pd.read_csv(file, nrows=65000)
print(df)
print("Done!")

             City                                               Text
0      washington   you really need to go back to bar tending or ...
1          london   jay finley christ in explains why today is co...
2           lagos   forget if this happened truly it s definitely...
3         toronto   yall i love this skin big thanks to for makin...
4         nairobi   the late brilliant prof ali mazrui explains h...
...           ...                                                ...
64995     toronto   taxing issues federal government to charge gs...
64996   bengaluru   this is revealing of americans are held hosta...
64997      mumbai   pumps in pumping station did not work in my s...
64998     atlanta   why is it that some ppl can speak amp act w i...
64999     phoenix   you netflix in already ready for more seasons...

[65000 rows x 2 columns]
Done!


And now our data is loaded. Let's look at the survey of cities again.

In [3]:
freqs = ai.print_labels(df, "City")
for city in freqs:
    print(city, freqs[city])

washington 3263
london 4154
lagos 3057
toronto 3268
nairobi 2505
melbourne 1125
perth 946
chicago 3614
dallas 3013
chennai 1230
delhi 2506
mumbai 2608
atlanta 2787
bengaluru 2528
seattle 1046
johannesburg 2772
adelaide 968
miami 1990
calgary 1177
boston 3672
phoenix 1096
auckland 899
brisbane 1054
karachi 1282
new_york 3042
los_angeles 2500
san_francisco 3113
singapore 1743
kolkata 1273
syndney 769


So now we have (1) our data from each city and (2) our content vectorizer (phrases and TF-IDF weights from the larger digital corpora). We're going to classify these samples by city. The basic code is below; this just called our *text_analytics* package. That package splits the data into training and testing data, learns a classifier, and then evaluates the classifier. We're telling the package to use "City" as the ground-truth class, with content features.

Let's look at the vocab features.

In [4]:
#ai.fit_phrases(df, min_count=1, language="en")
#ai.fit_tfidf(df, n_features = 10000)

file_phrases = os.path.join(ai.data_dir, "sociolinguistics.english_all.gz")
ai.phrases = ai.deserialize("phrases", file_phrases + ".phrases.json")
ai.tfidf_vectorizer = ai.deserialize("tfidf_model", file + ".tfidf.json")

print("PHRASES:")
example_phrases = list(ai.phrases.phrasegrams.keys())
print(example_phrases[:12])

print("\nVOCAB")
example_vocab = list(ai.tfidf_vectorizer.vocabulary_.keys())
print(example_vocab[:12])

PHRASES:
['united_states', 'boris_johnson', 'chuck_schumer', 'ricky_gervais', 'rashida_tlaib', 'hillary_clinton', 'devin_nunes', 'sherlock_holmes', 'edinson_cavani', 'climate_change', 'harry_potter', 'happy_birthday']

VOCAB
['people', 'new', 'today', 'need', 'want', 'video', 'life', 'world', 'year', 'please', 'watch', 'thanks']


What's different here is the, last time, we used an SVM and this time we're going to use an MLP. The SVM trains all at once. But an MLP trains over time, so we'll see the model's progress as it learns.

In [5]:
report = ai.mlp(df, labels = "City", features = "content", layers = [500, 500])
print(report)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
               precision    recall  f1-score   support

     adelaide       0.76      0.91      0.83       100
      atlanta       0.94      0.91      0.92       263
     auckland       0.94      0.95      0.95        85
    bengaluru       0.90      0.96      0.93       272
       boston       0.94      0.93      0.93       374
     brisbane       0.89      0.75      0.81       115
      calgary       0.94      0.97      0.95       109
      chennai       0.98      0.95      0.96       124
      chicago       0.96      0.83      0.89       371
       dallas       0.90      0.94      0.91       293
        delhi       0.91      0.95      0.93       241
 johannesburg       1.00      1.00      1.00    

**Be patient**

And there we go! We're looking at the classifier accuracy. 

This will change a bit from the lecture, because we're using random train/test splits. That means the classifier is looking at different data each time. If you want more advanced examples for how to solve this city classification problem, take a look at the *text_analytics.mlp()* function.