# Twitter Analytics

+ Capture Twitter Data
+ Analyze Twitter Data by Applying Descriptive, Content, & Network Analytics Techniques
+ Present & Communicate The Results

<img src = "images\twitteranalytics.gif">

# 1. Twitter Data Collection

## Twitter App and Obtain OAuth Data

first, make sure you have a Twitter Application. https://dev.twitter.com/ In the network analytics module, we created a Twitter app to use NodeXL (you can watch my video on this)). You can reuse that app for this task. You need to have the following information:

- consumer_key
- consumer_secret 
- access_token 
- access_token_secret 

second, tweepy python package is needed for this tutorial.

- **pip install tweepy**

## Streaming Twitter data and save it in a JSON file

- adapted from http://adilmoujahid.com/posts/2014/07/twitter-analytics/

- Create a python file (you may use **NotePad+** or PyCharm for this) and name it as **twitter_streaming.py**
- If you run the program from your terminal using the command: **python twitter_streaming.py**, you will see data flowing like the picture below and also find a json file named as **politics.json**.
- You can stop the program by pressing **Ctrl-C** or simply **closing the command prompt window**.
- We want to capture this data into a file that we will use later for the analysis. You can do so by piping the output to a file using the following command: python twitter_streaming.py > twitter_data.txt.

### If you receive an error "data must be a byte string"

Install pyOpenSSL 0.15.1

- Open a terminal in Mac or a command prompt in Windows
- then, type **pip install pyopenssl**

https://pypi.python.org/pypi/pyOpenSSL

This will fix the error.

# 2. Retrieve and Process Twitter Data

In [1]:
import json
 
with open('data/politics.json', 'r') as f:
    line = f.readline() # read only the first tweet/line
    tweet = json.loads(line) # load it as Python dictionary
    print(json.dumps(tweet, indent=4)) 
    
# the original data from Twitter looks like below.

IOError: [Errno 2] No such file or directory: 'data/politics.json'

## The key attributes are the following:

text: the text of the tweet itself

created_at: the date of creation

favorite_count, retweet_count: the number of favourites and retweets

favorited, retweeted: boolean stating whether the authenticated user (you) have favourited or retweeted this tweet

lang: acronym for the language (e.g. “en” for english)

id: the tweet identifier

place, coordinates, geo: geo-location information if available

user: the author’s full profile

entities: list of entities like URLs, @-mentions, hashtags and symbols

in_reply_to_user_id: user identifier if the tweet is a reply to a specific user

in_reply_to_status_id: status identifier id the tweet is a reply to a specific status

http://marcobonzanini.com/2015/03/09/mining-twitter-data-with-python-part-2/

In [None]:
import json

# create an empty list to store our tweets in
data = []

# append each line of the data to our tweets list using the json module
for line in open('data/politics.json'):
    try:
        data.append(json.loads(line))
    except:
        pass

# lets see how many we got
print len(data)

In [None]:
# read the first five tweets and other meta data only
for i in data[:5]:
    print i

In [None]:
# read first five tweets only 
for i in data[:5]:
    print i['text']

In [None]:
# savin the entire tweets (not other data) in a variable and print it

texts = [ T['text'] for T in data if 'text' in T ]
len(texts)

In [None]:
for T in data:
    if 'text' not in T:
        print T    

This is what happened. Twitter sets a limit on how many requests your Twitter app can make to Twitter. The above are the rate limiting messages from Twitter. It appears that we searched two terms "hillary" and "trump". There are so many tweets containing either name so we experiences the rate limit set by Twitter. In the future you would need a single keyword search if the term is so popular in Twitter.

Then, the solution is removing those rate limiting messages from your original data.

If your search term is not popular (e.g., supplychain, informationsystems, HR), you won't have this issue at all.

In [None]:
for i in data[:5]:
    print i

In [None]:
# removing those 22 error messages
tweets = []
for T in data:
    if 'text' in T:
        tweets.append(T)
len(tweets)       
#now we have 8902 ... good!!!

In [None]:
# save screen_names

screen_names = [T['user']['screen_name'] for T in tweets]
len(screen_names)

In [None]:
# display screen_name, tweets

for i in tweets[:5]:
    print i['user']['screen_name'], i['text']

In [None]:
# More codes for extracting information from tweets

ids = [T['id_str'] for T in tweets]
times = [T['created_at'] for T in tweets]
texts = [T['text'] for T in tweets]
screen_names = [T['user']['screen_name'] for T in tweets]
names = [T['user']['name'] for T in tweets]
lats = [(T['geo']['coordinates'][0] if T['geo'] else None) for T in tweets]
lons = [(T['geo']['coordinates'][1] if T['geo'] else None) for T in tweets]
place_names = [(T['place']['full_name'] if T['place'] else None) for T in tweets]
place_types = [(T['place']['place_type'] if T['place'] else None) for T in tweets]

# open an output csv file to write to
out = open('tweets_food.csv', 'w')

# write the header of our CSV as its first line
out.write('id,created at,text,screen name,name,lat,lon,place name,place type\n')

# merge each individual list into a single list using the zip function
rows = zip(ids, times, texts, screen_names, names, lats, lons, place_names, place_types)

# use the writer module on our csv file
csv = writer(out)

# use one value from each of our rows list and write it to the csv as a new row
for row in rows:
    values = [(value.encode('utf8') if hasattr(value, 'encode') else value) for value in row]
    csv.writerow(values)

# close our csv file when done
out.close()

# 3. Descriptive Analytics

In [None]:
# import popular packages
import csv
import pandas as pd

### Tweets per user

In [None]:
from collections import Counter

c = Counter(screen_names)
print c

In [None]:
# how many unique users in the data?
len(c)

In [None]:
#how many tweets per user?

float(2657/2303)

### Most active users

In [None]:
from collections import Counter

c = Counter(screen_names)
print c

In [None]:
# five most active tweeters
c.most_common(5)

In [None]:
# make it pretty
activetweeters = c.most_common(5)
activetweeters_df = pd.DataFrame(activetweeters)
activetweeters_df

### Popular languages

In [None]:
lang = [T['user']['lang'] for T in tweets if 'user' in T]

c = Counter(lang)
print c

In [None]:
# only English tweets & meta data
english = []
for i in tweets:
    if i['user']['lang'] == "en":
        english.append(i)
len(english)

In [None]:
# read first five English tweets only 
for i in english[:5]:
    print i['text']
    
# so now all English tweets are saved in english. 

### Who is sharing location information?

In [None]:
#how many tweets contain geocode

geo = [T['user']['geo_enabled'] for T in tweets if 'user' in T]

c = Counter(geo)
print c

### Original tweets & Retweets

In [None]:
#remove retweets

originaltweets = []

for tweet in texts:
    if 'rt @' not in tweet.lower():
        originaltweets.append(tweet)
        
len(originaltweets)

In [None]:
# get retweets only

#remove retweets

retweets_only = []

for tweet in texts:
    if 'rt @' in tweet.lower():
        retweets_only.append(tweet)
        
len(retweets_only)

In [None]:
for i in retweets_only[:5]:
    print i

### Most visible users 

In [None]:
for tweet in texts[:5]:
    print tweet

In [None]:
# first extract all users from tweets

#let's use regular expression ... how to use re ... https://docs.python.org/2/howto/regex.html
    
import re

for tweet in texts[:5]:
    print re.findall(r"(?<=@)\w+", tweet)

In [None]:
for tweet in texts[:5]:
    a = re.findall(r"(?<=@)\w+", tweet)
    for i in a:
        print '@'+i

In [None]:
visible_users = []

for tweet in texts:
    a = re.findall(r"(?<=@)\w+", tweet)
    for i in a:
        visible_users.append(['@'+i])

In [None]:
#compute frequency distribution for visible users in the tweets
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords

#convert lists to strings
visible_users = str(visible_users)

#lowercase
visible_users = visible_users.lower()

#tokenize
visible_users = visible_users.split()

fdist = nltk.FreqDist(visible_users)

fdist.most_common(10)

### URL metrics

In [None]:
for T in tweets[:10]:
    print T['entities']['urls']

In [None]:
for T in tweets:
    for i in T['entities']['urls']:
        print i['url']

In [None]:
urls = []

for T in tweets:
    for i in T['entities']['urls']:
        urls.append(i['url'])
        
urls

In [None]:
#top 10 urls ... visit some of them and find out what the articles are about

c = Counter(urls)
c.most_common(10)

### More Data Preprocessing
- remove urls
- remove user names
- extract only urls (for url frequency analysis)

In [None]:
texts[:5]

In [None]:
# remove urls

texts_wo_urls = []

for i in texts:
    result = re.sub(r"http\S+", "", i)
    texts_wo_urls.append(result)

texts_wo_urls[:5]

In [None]:
# remove user names

texts_wo_urls_usernames = []

for i in texts_wo_urls:
    result = re.sub(r"(@[A-Za-z0-9]+)", "", i)
    texts_wo_urls_usernames.append(result)

texts_wo_urls_usernames[:5]

In [None]:
texts_clean_completely = []

for i in texts:
    result = re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", i)
    texts_clean_completely.append(result)

texts_clean_completely[:5]

In [None]:
# this would be better ...
texts_clean_completely2 = []

for i in texts:
    result = ' '.join(re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", i).split())
    texts_clean_completely2.append(result)

texts_clean_completely2[:5]

The above data would be good for content analytics (e.g., word frequency, clustering analysis). For content analytics, you need to go through text preprocessing steps (e.g., tokenization, remove stopwords, remove short words).

In [None]:
texts[:10]

In [None]:
# extract urls from texts (rather than the original JSON data)

only_urls = []

for i in texts:
    result = re.findall(r'(https?://\S+)', i)
    for url in result: 
        only_urls.append(url)

only_urls[:10]

In [None]:
c = Counter(only_urls)
c.most_common(10)

In [None]:
# the total number of urls in the tweets
len(only_urls)

In [None]:
# the number of unique urls in the tweets
len(c.most_common())

### Data for more descriptive analytics

In [None]:
screen_names = [T['user']['screen_name'] for T in tweets if 'user' in T]
screen_names_description = [status['user']['description'] for status in tweets if 'user' in status]
followers_count = [status['user']['followers_count'] for status in tweets if 'user' in status]
friends_count = [status['user']['friends_count'] for status in tweets if 'user' in status]
screen_names_created = [status['user']['created_at'] for status in tweets if 'user' in status]
location = [status['user']['location'] for status in tweets if 'user' in status]

In [None]:
followers_friends = zip(screen_names, followers_count, friends_count)
for i in followers_friends:
    print i

In [None]:
# this is another way to find out screen name, follower count, and friends count

for tweet in tweets:
    print tweet['user']['screen_name'], tweet['user']['followers_count'], tweet['user']['friends_count']

In [None]:
# saving every user and his/her follower counts

user_followerscount = []

for tweet in tweets:
    user_followerscount.append([tweet['user']['screen_name'], tweet['user']['followers_count']]) 
    
user_followerscount[:10]

In [None]:
from operator import itemgetter

sorted(user_followerscount,key=itemgetter(1), reverse=True)
#sorted(user_followerscount,key=itemgetter(1))

### Where do people live?

In [None]:
for i in location:
    print i

# 4. Content Analytics (Text Mining)

## Word Frequencies

In [None]:
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords

#convert lists to strings
originaltweets = str(originaltweets)

#lowercase
tokens = originaltweets.lower()

#tokenize
tokens = tokens.split()

#Remove stopwords
tokens = (word for word in tokens if word not in stopwords.words('english'))

# Filter non-alphanumeric chars from tokens
tokens = (ch.lower() for ch in tokens if ch.isalpha())

#Create your bigrams
#bgs = nltk.bigrams(tokens)

#compute frequency distribution for all the bigrams in the text
fdist = nltk.FreqDist(tokens)

#k refers to keys (or tokens); v refers to values (or counts)
for k,v in fdist.items()[:10]:
    print k,v

In [None]:
fdist.most_common(100)

## Hashtag Frequencies

In [None]:
for tweet in texts[:5]:
    print tweet

In [None]:
# first extract all hashtags from tweets

import re

hashtags = []

for tweet in texts[:50]:
    print re.findall(r"(?<=#)\w+", tweet)

In [None]:
# list one hashtag in a row and save them
for tweet in texts[:50]:
    a = re.findall(r"(?<=#)\w+", tweet)
    for i in a:
        hashtags.append(['#'+i])

In [None]:
for i in hashtags[:15]:
    print i

In [None]:
#compute frequency distribution for all the hashtags in the tweets

#convert lists to strings
hashtags_string = str(hashtags)

#lowercase
hashtags_string = hashtags_string.lower()

#tokenize
hashtags_string = hashtags_string.split()

fdist = nltk.FreqDist(hashtags_string)

fdist.most_common(10)

## Topic Modeling

+ Below is a simple demnostration of topic modeling on a sample data from Twitter. For some serious analysis for a business, you need to have a **comprehensive** dataset and the dataset needs to be properly processed through removing stopwords, stemming (or lemmatizing), etc.

In [None]:
# you should be able to do document clustering and/or topic modeling here
# select the English tweets only for topic modeling or document clustering

In [None]:
import csv
import pandas as pd

# import packages for text processing
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
import re

from gensim.corpora import Dictionary
from gensim.models import ldamodel

from gensim.models.coherencemodel import CoherenceModel
from gensim.models.wrappers import LdaVowpalWabbit, LdaMallet

import matplotlib.pyplot as plt
%matplotlib inline

import numpy

import warnings
warnings.filterwarnings('ignore')  # To ignore all warnings that arise here to enhance clarity

In [None]:
# read first five English tweets only 
for i in english[:5]:
    print i['text']

In [None]:
len(english)

In [None]:
# select original English tweets (exclusing retweets in clustering analysis)

english_originaltweets = []

for tweet in english:
    if 'rt @' not in tweet['text'].lower():
        english_originaltweets.append(tweet['text'])
        
len(english_originaltweets)

In [None]:
english_originaltweets[:5]

In [None]:
# Remove useless numbers and alphanumerical words including http     
documents = [re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", text) for text in english_originaltweets]
# tokenize
texts = [[word for word in text.lower().split() ] for text in documents]
# stemming words: having --> have; friends --> friend
lmtzr = WordNetLemmatizer()
texts = [[lmtzr.lemmatize(word) for word in text ] for text in texts]
# remove common words 
stoplist = stopwords.words('english')
texts = [[word for word in text if word not in stoplist] for text in texts]
#remove short words
english_originaltweets_clean = [[ word for word in tokens if len(word) >= 3 ] for tokens in texts]

In [None]:
# A list of extra stopwords specific to the debates transcripts (if you want to remove more stopwords)
extra_stopwords = ['amp','get','got','hey','hmm','hoo','hop','iep','let','ooo','par',
            'pdt','pln','pst','wha','yep','yer','aest','didn','nzdt','via',
            'one','com','new','like','great','make','top','awesome','best',
            'good','wow','yes','say','yay','would','thanks','thank','going',
            'new','use','should','could','really','see','want','nice',
            'while','know','free','today','day','always','last','put','live',
            'week','went','wasn','was','used','ugh','try','kind', 'http','much',
            'need', 'next','app','ibm','appleevent','using']

extra_stoplist = extra_stopwords
english_originaltweets_clean = [[word for word in text if word not in extra_stoplist] for text in english_originaltweets_clean]
#https://github.com/alexperrier/datatalks/blob/master/debates/R/stm.R

In [None]:
english_originaltweets_clean[:5]

In [None]:
# after processing each tweet, some tweets could be empty. These empty rows should be removed from further analysis.

english_originaltweets_clean = [x for x in english_originaltweets_clean if x]
english_originaltweets_clean[:5]

In [None]:
len(english_originaltweets_clean)

In [None]:
# this is text processing required for topic modeling with Gensim
dictionary = Dictionary(english_originaltweets_clean)
corpus = [dictionary.doc2bow(text) for text in english_originaltweets_clean]

In [None]:
numpy.random.seed(1) # setting random seed to get the same results each time.
k_range = range(2,20)
scores = []
for k in k_range:
    goodLdaModel = ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=k, passes=75)
    goodcm = CoherenceModel(model=goodLdaModel, corpus=corpus, dictionary=dictionary, coherence='u_mass')
    scores.append(goodcm.get_coherence())
    
plt.figure(figsize=(14, 8))
plt.plot(k_range, scores)

In [None]:
scores

In [None]:
numpy.random.seed(1) # setting random seed to get the same results each time. For a large dataset, high passes (75) would be desirable.
model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=4, passes=75)

In [None]:
model.show_topics()

There could be a lot of extra stopwords (e.g., trump, donald, said, come), which should be removed prior to topic modeling. 

In [None]:
# print words without probability
for i in range(0,4):
    topics = model.show_topic(i, 10)
    print ', '.join([str(word[0]) for word in topics])

In [None]:
lda_corpus = model[corpus]

results = []
for i in lda_corpus:
    #print i
    results.append(i)

results

In [None]:
documents = []

for i in english_originaltweets_clean:
    documents.append(str(i).replace(",", "").replace("u'","").replace("'", ""))

documents[:5]

In [None]:
# finding highest value from each row
toptopic = [max(collection, key=lambda x: x[1])[0] for collection in results]

toptopic = pd.DataFrame(toptopic)
documents = pd.DataFrame(documents)
documents = documents.rename(columns = {0: 'documents'})
summary = documents.join(toptopic)
summary.head()

In [None]:
summary.groupby(0).count()

In [None]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(model, corpus, dictionary)

## Sentiment Analysis

In [None]:
# you should be able to do unsupervised sentiment analysis here ...
# select the English tweets only for sentiment analysis
# you may use "Pattern python package"

In [None]:
from pattern.en import sentiment

In [None]:
for tweet in english_originaltweets_clean:
    score = sentiment(tweet)
    print score[0], score[1] 

Or you could do text preprocessing before sentiment analysis. The results look almost same.

In [None]:
# Remove useless numbers and alphanumerical words
documents = [re.sub("[^a-zA-Z]+", " ", document) for document in english_originaltweets]
# tokenize
texts = [[word for word in document.lower().split() ] for document in documents]
# remove common words 
stoplist = stopwords.words('english')
texts = [[word for word in text if word not in stoplist] for text in texts]
#remove short words
texts = [[ word for word in tokens if len(word) >= 3 ] for tokens in texts]

for row in texts:
    score = sentiment(row)
    print score[0], score[1]  

### Collecting and analysting user profiles (Appedix)

In [None]:
for i in screen_names_description:
    print i

# 5. Network Analytics 

## Mention Network

Consider a tweet by **@kevin**, **"@amy, are you available today? how about coffee this afternoon? #friday**

The above tweet creates a relationship between @kevin and @amy. This relationship is created by **mention**.

In [None]:
for tweet in tweets[:5]:
    print tweet['user']['screen_name']+',', tweet['text']

In [None]:
mention = []

for tweet in tweets:
    mention.append([tweet['user']['screen_name']+',', tweet['text']])
    
mention[:5]

In [None]:
# computationally intensive ... very slow if you have a lot of data
for i in mention:
    print i[0], i[1]

In [None]:
for tweet in mention:
    print tweet[0], re.findall(r"(?<=@)\w+", tweet[1])

In [None]:
for tweet in mention:
    a = re.findall(r"(?<=@)\w+", tweet[1])
    for i in a:
        print tweet[0], '@'+i

In [None]:
#putting everything together

import csv

mention = []
for tweet in tweets:
    mention.append([tweet['user']['screen_name']+',', tweet['text']])    
openfile = open("data/mentionnetwork.csv", "wb")

w = csv.writer(openfile)
for tweet in mention:
    a = re.findall(r"(?<=@)\w+", tweet[1])
    for i in a:
        w.writerow([tweet[0], '@'+i])
        
openfile.close()

mentionnetwork.csv contains two columns representing relationships. Now. you can use **NodeXL** or **Gephi** for **network analytics**.

You should perform various statistical analyses (e.g., centrality, degree) and modularity analysis using Gephi


## Topic Modeling using Co-Hashtag Network Analysis

- Identify co-appearing hashtags and build a network of co-appearing hashtags
- Apply modularity (or network clustering) analysis and identify topics or themes
- This is similar to topic modeling: the difference is that topic modeling uses texts, and this proposed approach uses hashtags and network analysis

In [None]:
for tweet in tweets[:2]:
    print tweet['text']

In [None]:
english_originaltweets[:2]

In [None]:
for i in english_originaltweets[:2]:
    print re.sub("[^a-zA-Z0-9#]+", " ", i)

In [None]:
for i in english_originaltweets[:10]:
    data = re.sub("[^a-zA-Z0-9#]+", " ", i)
    hashtag = re.findall(r"(?<=#)\w+", str(data).lower())
    print hashtag

In [None]:
datas = []
for i in english_originaltweets:
    data = re.sub("[^a-zA-Z0-9#]+", " ", i)
    hashtag = re.findall(r"(?<=#)\w+", str(data).lower())
    datas.append(hashtag)

In [None]:
datas[:20]

In [None]:
from itertools import combinations
cohashtags = [x for d in datas for x in combinations(d, 2)]
cohashtags[:10]

In [None]:
#for cohashtag analysis
outfile = open("data/cohashtag_network.csv", "wb")
w = csv.writer(outfile)
for i in cohashtags:
    w.writerow(i)    
outfile.close()

# 6. Spatial Analytics (Appendix) 

In [None]:
lats = [(T['geo']['coordinates'][0] if T['geo'] else None) for T in tweets]
lons = [(T['geo']['coordinates'][1] if T['geo'] else None) for T in tweets]

In [None]:
geo = zip(lats, lons)
geo

In [None]:
import csv
openfile = open("data/geo.csv", "wb")
w = csv.writer(openfile)
for i in geo:
    w.writerow([i])      
openfile.close()

open Excel and import the csv data (using **import texts** in Data Tab). Make sure that the result is an Excel file with two columns (latitute, longitude)

Remove the rows with NONE. 

Import the Excel file in **Tableau** and visualize the location data in a map. That's it!