# Topic Modeling and Natural Language Processing with Twitter Data

##  Jason Anastasopoulos
##  December 4, 2018
### Email: ljanastas@uga.edu

The code below provides a brief introduction on acquiring Twitter data using the twitter API via Python. For this exercise I will be acquiring Donald Trump's tweets and will try to figure out what the topics his tweets are using the Latent Dirichlet Allocation  Topic Model.

In [2]:
import csv
import os,re,csv
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from gensim import corpora, models
import gensim
import numpy as np
import scipy
import math
import matplotlib.pyplot as plt
import plotly.plotly as py
import twitter
import json

Here we enter our Twitter credentials. These can be acquired through 

In [3]:
api = twitter.Api(consumer_key='',
                      consumer_secret='',
                      access_token_key='',
                      access_token_secret='')

print(api.VerifyCredentials())

{"created_at": "Fri Dec 19 19:38:51 +0000 2008", "description": "Microsoft Visiting Professor @PrincetonCITP\u25aa\ufe0fAssistant Professor @UGA_PA_Policy & Political Science \u25aa\ufe0f political economy\u25aa\ufe0f machine learning\u25aa\ufe0f causal inference", "favourites_count": 1108, "followers_count": 581, "friends_count": 502, "geo_enabled": true, "id": 18249358, "id_str": "18249358", "lang": "en", "listed_count": 33, "location": "Athens, GA", "name": "Jason Anastasopoulos", "profile_background_color": "C0DEED", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "profile_banner_url": "https://pbs.twimg.com/profile_banners/18249358/1535050189", "profile_image_url": "http://pbs.twimg.com/profile_images/1065001203887570944/5hA3pVIK_normal.jpg", "profile_image_url_https": "https://pbs.twimg.com/profile_images/1065001203887570944/5hA3pVIK_normal.jpg", "profile

Search the Twitter API using keywords.

In [5]:
search = api.GetSearch("puppies") # Replace happy with your search
for tweet in search:
    print(tweet.id, tweet.text)
    
len(search)

(1070149834643070976, u'Giuliani Releases Statement on Flynn Sentencing: It\'s Like "Spitting on a Sidewalk" ...and Mueller Team "Are Sick P\u2026 https://t.co/iJXHDoCTn4')
(1070012903544250373, u'My friend in Germany sent me a picture of his latest litter of German shepherd puppies to see if I would like one f\u2026 https://t.co/MeM7naOII0')
(1070075275184984064, u'RT if you think puppies are cute.\n\n#ProBowlVote\n\n@LarryFitzgerald\n@P2\n@chanjones55\n@DavidJohnson31\n@AndyLee4\u2026 https://t.co/jiYM7CTpYw')
(1070483211778703360, u'Free Dog \u276f\u276f https://t.co/gkGrlupOI9 \u276e\u276e #Dogs #Puppies #DogFinder #AdoptADog https://t.co/FxOMeZBttK')
(1070483185975185408, u'RT @jikooksexts: jungkook: *accidentally sends jimin a picture of 7 puppies*\n\njungkook: oops wrong number\n\njimin: wait i want one\n\njungkook\u2026')
(1070483179843153920, u'RT @Drebae_: Ursula scammed a mermaid out of her voice and soul &amp; then stole sis man.\n\nCruella De Vill wanted to skin 100 puppie

15

The Python twitter library has a lot of cool functions that you can use and learn about through the help() function

In [6]:
help(api.GetUserTimeline)


Help on method GetUserTimeline in module twitter.api:

GetUserTimeline(self, user_id=None, screen_name=None, since_id=None, max_id=None, count=None, include_rts=True, trim_user=False, exclude_replies=False) method of twitter.api.Api instance
    Fetch the sequence of public Status messages for a single user.
    
    The twitter.Api instance must be authenticated if the user is private.
    
    Args:
      user_id (int, optional):
        Specifies the ID of the user for whom to return the
        user_timeline. Helpful for disambiguating when a valid user ID
        is also a valid screen name.
      screen_name (str, optional):
        Specifies the screen name of the user for whom to return the
        user_timeline. Helpful for disambiguating when a valid screen
        name is also a user ID.
      since_id (int, optional):
        Returns results with an ID greater than (that is, more recent
        than) the specified ID. There are limits to the number of
        Tweets which c

In [24]:
t = api.GetUserTimeline(screen_name="BerkeleyISchool", count=5000)
tweets = [i.AsDict() for i in t]

tweettext = [i["text"] for i in tweets]

len(tweettext)

tweettext[0:10]

[u'RT @CLTCBerkeley: Check out this incredible research news from @BerkeleyISchool! I Schoolers developed a custom-fit earpiece that can captu\u2026',
 u"\U0001f44b\U0001f3fd Hey I School friends! Do you have a @kaggle account? Go vote for @BerkeleyData student Rob Mulla's kernal! It wo\u2026 https://t.co/a7u2GkwYoi",
 u'Yes! Buckland\'s "Information and Society" is a concise and easy to understand intro to Information Science, and rec\u2026 https://t.co/cmVTyPlxgq',
 u'RT @UCBStartup: The @hultprize @UCBerkeley campus round is underway! Joint @BerkeleyISchool &amp; @BerkeleyHaas   team Roadmap, pitching their\u2026',
 u"Did you miss the TUI Showcase Monday? It's on again today! \n\nStop by 202 South Hall before 11am to see and try out\u2026 https://t.co/qLs086q8rU",
 u'Prof Steve Weber: Trump Administration Has Elevated China\u2019s Role in International Property Theft to a National Leve\u2026 https://t.co/iNVbvNo8bi',
 u"Mark you calendars! Next week alumna @kingjen will return to So

## Cleaning the text

In [25]:
######## So far so good not lets clean this up ###
tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = get_stop_words('en')

# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
texts = []

Here we are pre-processing the text by creating a tokenizer that splits the documents up into tokens (words or phrases), creating a dictionary of stop words and creating a "stemmer" which stems the words (ie removing "-ing" endings etc. We also remove extraneous "bill related" words such as "propXX_XXXX".

In [26]:
for i in tweettext:
    #print "Processing",i
    # clean and tokenize document string
    tokens = tokenizer.tokenize(i)
    # remove all numbers
    tokens = [x for x in tokens if not (x.isdigit() or x[0] == '-' and x[1:].isdigit())]
    # remove structural words
    tokens = [x for x in tokens if len(x) > 1]
    tokens = [x.lower() for x in tokens]
    tokens = [x for x in tokens if 'http' not in x]
    tokens = [x for x in tokens if x not in "_"]
    tokens = [x for x in tokens if x not in 'rt']
    tokens = [x for x in tokens if x not in ".co"]
    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]
    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    # add tokens to list
    texts.append(stemmed_tokens)

dictionaryall = corpora.Dictionary(texts)

corpusall = [dictionaryall.doc2bow(text) for text in texts]

texts[0]



[u'cltcberkeley',
 u'check',
 u'incred',
 u'research',
 u'news',
 u'berkeleyischool',
 u'schooler',
 u'develop',
 u'custom',
 u'fit',
 u'earpiec',
 u'can',
 u'captu']

In [27]:
tweettext[0:10]

[u'RT @CLTCBerkeley: Check out this incredible research news from @BerkeleyISchool! I Schoolers developed a custom-fit earpiece that can captu\u2026',
 u"\U0001f44b\U0001f3fd Hey I School friends! Do you have a @kaggle account? Go vote for @BerkeleyData student Rob Mulla's kernal! It wo\u2026 https://t.co/a7u2GkwYoi",
 u'Yes! Buckland\'s "Information and Society" is a concise and easy to understand intro to Information Science, and rec\u2026 https://t.co/cmVTyPlxgq',
 u'RT @UCBStartup: The @hultprize @UCBerkeley campus round is underway! Joint @BerkeleyISchool &amp; @BerkeleyHaas   team Roadmap, pitching their\u2026',
 u"Did you miss the TUI Showcase Monday? It's on again today! \n\nStop by 202 South Hall before 11am to see and try out\u2026 https://t.co/qLs086q8rU",
 u'Prof Steve Weber: Trump Administration Has Elevated China\u2019s Role in International Property Theft to a National Leve\u2026 https://t.co/iNVbvNo8bi',
 u"Mark you calendars! Next week alumna @kingjen will return to So

This code performs tokenization, stop word removal and number removal and places the corpora into a clean list that will be ready for analysis using the Latent Dirichlet Allocation. 

## Estimating the Latent Dirichlet Allocation model

In [30]:
# generate LDA model
ldamodelall = gensim.models.ldamodel.LdaModel(corpusall, num_topics=3, 
                                              id2word = dictionaryall, passes=20,
                                              minimum_probability=0)



The code above estimates a 5 topic topic model using Trump's tweets

In [31]:
print(ldamodelall.print_topics(num_topics=5, num_words=5))

[(0, u'0.009*"ucberkeley" + 0.008*"thank" + 0.007*"school" + 0.007*"student" + 0.007*"work"'), (1, u'0.012*"prof" + 0.010*"berkeleyischool" + 0.010*"ucberkeley" + 0.010*"student" + 0.010*"today"'), (2, u'0.015*"us" + 0.015*"join" + 0.011*"student" + 0.011*"amp" + 0.010*"today"')]


Prints the first 5 topics from the full model.

## What are the topic here?

### Topic 1: will, great, state, want, peopl -- Label: Immigration
### Topic 2: presid, will, trump, two, mueller -- Label: Russia Investigatons I 
### Topic 3: presid, look, bush, argentina, first  -- Label: Events 
### Topic 4: will,china,border,can,start -- Label: International Relations/Policies.
### Topic 5: great, year, america, make, thank -- Label: Self Congratulation 

## Print out the distribution over topics for a tweet

In [32]:
tweettext[20]

u'RT @Blum_Center: Read about @BigIdeasContest winner @Dosteducation - breaking the cycle of #illiteracy by empowering parents https://t.co/W\u2026'

In [33]:
ldamodelall[corpusall[20]]

[(0, 0.030524886398311704),
 (1, 0.93839353996770924),
 (2, 0.031081573633978985)]

## Breakout Session Exercise

1. Collect and process all tweets from Berkeley's School of Information Account: @BerkeleyISchool.

2. Estimate a 5-topic topic model and label each of the topics.

3. Label one of the tweets using the topic distribution. 