In this lab we're going to do a content-based classification of cities, based on tweets. This is the same case-study we saw in the section. We start by loading our dependencies.

In [1]:
from text_analytics import TextAnalytics
import os
import pandas as pd

ai = TextAnalytics()
ai.data_dir = os.path.join(".", "data")
print("Done!")

Done!


Let's load the tweets that we need for each city.

In [2]:
file = "sociolinguistics.english_cities.gz"
file = os.path.join(ai.data_dir, file)
df = pd.read_csv(file)
print(df)
print("Done!")

              City                                               Text
0       washington   you really need to go back to bar tending or ...
1           london   jay finley christ in explains why today is co...
2            lagos   forget if this happened truly it s definitely...
3          toronto   yall i love this skin big thanks to for makin...
4          nairobi   the late brilliant prof ali mazrui explains h...
...            ...                                                ...
150025     chennai   finally indian women team make victory in las...
150026     chennai   syngene international ltd calls for board mee...
150027     chennai   true no one takes you seriously such a senile...
150028     chennai   do you really need to test your site to impro...
150029     chennai   one becomes father of the nation and another ...

[150030 rows x 2 columns]
Done!


Now, our content features are fit to the data in two ways: first, we learn phrases using pointwise mutual information; second, we learn TF-IDF weights by counting words across documents. Both of these steps can take some time. So we have pre-loaded the models we need. But, you can run the code below if you want to do it yourself (remove the comment tag first). You can check out the *text_analytics* package to see this in more detail.

In [3]:
#ai.fit_phrases(df, min_count=1, language="en")
#ai.fit_tfidf(df, n_features = 10000)

file_phrases = os.path.join(ai.data_dir, "sociolinguistics.english_all.gz")
ai.phrases = ai.deserialize("phrases", file_phrases + ".phrases.json")
ai.tfidf_vectorizer = ai.deserialize("tfidf_model", file + ".tfidf.json")

print("PHRASES:")
example_phrases = list(ai.phrases.phrasegrams.keys())
print(example_phrases[:12])

print("\nVOCAB")
example_vocab = list(ai.tfidf_vectorizer.vocabulary_.keys())
print(example_vocab[:12])

PHRASES:
['united_states', 'boris_johnson', 'chuck_schumer', 'ricky_gervais', 'rashida_tlaib', 'hillary_clinton', 'devin_nunes', 'sherlock_holmes', 'edinson_cavani', 'climate_change', 'harry_potter', 'happy_birthday']

VOCAB
['people', 'new', 'today', 'need', 'want', 'video', 'life', 'world', 'year', 'please', 'watch', 'thanks']


So now we have (1) our data from Twitter and (2) our full content vectorizer (TF-IDF + PMI phrases). We're going to classify these by cities. The basic code is below; this just called our *text_analytics* package. That package splits the data into training and testing data, learns a classifier, and then evaluates the classifier. We're telling the package to use "City" as the ground-truth class, with content features.

In [4]:
x, vocab_size = ai.get_features(df, features = "content")
print(x)
print(vocab_size)

  (0, 9726)	0.0749937420567999
  (0, 9658)	0.07497708911780718
  (0, 9090)	0.07364115740517338
  (0, 8973)	0.0742041893027956
  (0, 8645)	0.07289270227746461
  (0, 8327)	0.07222838937187062
  (0, 8291)	0.0724010225801125
  (0, 8092)	0.0721626588101253
  (0, 8043)	0.0721103367495788
  (0, 7998)	0.07186492724429036
  (0, 7980)	0.07187771706160526
  (0, 7853)	0.07153716713149491
  (0, 7660)	0.07067434669250264
  (0, 7623)	0.07080215164329921
  (0, 7439)	0.07051366281361417
  (0, 7078)	0.06972018033040071
  (0, 6990)	0.06974151596970049
  (0, 6941)	0.1399358018127286
  (0, 6928)	0.06908731221807107
  (0, 6552)	0.06875858350367162
  (0, 6512)	0.06856372184379768
  (0, 6508)	0.06797932641569877
  (0, 6464)	0.06798852610842035
  (0, 6230)	0.06718801231956584
  (0, 6206)	0.0675637123540012
  :	:
  (150029, 48)	0.023706245439530375
  (150029, 47)	0.02252345595296407
  (150029, 46)	0.04476143061578029
  (150029, 45)	0.02245941325645736
  (150029, 40)	0.06812059389201963
  (150029, 37)	0.02148327

In [5]:
report = ai.shallow_classification(df, labels = "City", features = "content", classifier = "lm")
print(report)

               precision    recall  f1-score   support

     adelaide       0.92      0.92      0.92       512
      atlanta       0.92      0.92      0.92       502
     auckland       0.99      0.99      0.99       528
    bengaluru       0.96      0.93      0.95       503
       boston       0.94      0.95      0.94       514
     brisbane       0.88      0.87      0.87       515
      calgary       1.00      0.98      0.99       529
      chennai       0.99      1.00      0.99       491
      chicago       0.92      0.91      0.91       498
       dallas       0.93      0.93      0.93       501
        delhi       0.93      0.95      0.94       487
 johannesburg       1.00      1.00      1.00       490
      karachi       1.00      1.00      1.00       505
      kolkata       0.96      0.95      0.95       481
        lagos       1.00      1.00      1.00       501
       london       0.97      0.97      0.97       509
  los_angeles       0.86      0.92      0.89       490
    melbo

**Be patient**

And there we go! We're looking at the classifier accuracy. 

This will change a bit from the lecture, because we're using random train/test splits. That means the classifier is looking at different data each time. If you want more advanced examples for how to solve this city classification problem, take a look at the *text_analytics.shallow_classification()* function.