# Analysing Dr Who tweets

During the 50th anniversary episode of *Dr Who* I scraped over 35,000 tweets. This notebook details some ways to analyse that in Python.

In [None]:
#import pandas to read the CSV file
import pandas as pd
#and numpy to deal with maths
import numpy as np

# plotting packages
import matplotlib.pyplot as plt
import seaborn as sns

# model building package
import sklearn

# package to clean text
import re

In [None]:
#I've published the Google spreadsheet as a CSV - store the url...
tweeturl = "https://docs.google.com/spreadsheets/d/e/2PACX-1vTIpXKoHJGy-vA1iX2nuLYLrwog4IAHeufTrUaB3iGdF6yABBgW6ng6puehVkuLDN2kJHbnYEJ1_p9s/pub?gid=1257121167&single=true&output=csv"
#and then read the CSV at that url
tweets = pd.read_csv(tweeturl)

In [None]:
#show the first few rows
tweets.head()

Unnamed: 0,id_str,tweet_url,created_at,text,lang,REGEX,DALEK,retweet_count,screen_name,hashtags,query,url,user_mention,media,in_reply_to_screen_name,in_reply_to_status_id,lat,lng
0,4.04363e+17,https://twitter.com/HoptonChris/status/4043628...,2013-11-23 21:36:29+00:00,Question: Will Peter Carpaldi be known as 12 o...,en,False,,0,HoptonChris,DrWho,#drwho,,,,,,,
1,4.04363e+17,https://twitter.com/EasyStreetD/status/4043628...,2013-11-23 21:36:28+00:00,RT @DenverComicCon: Which Dr. are you? Quick p...,en,False,,1,EasyStreetD,DrWho,#drwho,http://bbc.in/1el7qo6,DenverComicCon,,,,,
2,4.04363e+17,https://twitter.com/EastonNicky/status/4043628...,2013-11-23 21:36:27+00:00,RT @huxley06: Help us doctors ..you are our on...,en,False,,2,EastonNicky,drwho,#drwho,,huxley06,https://pbs.twimg.com/media/BZySuHPCMAAmpSF.jpg,,,,
3,4.04363e+17,https://twitter.com/jonnybeardo/status/4043628...,2013-11-23 21:36:27+00:00,Got to love #drwho #DayoftheDoctor,en,False,,0,jonnybeardo,drwho,#drwho,,,,,,,
4,4.04363e+17,https://twitter.com/juliaargy/status/404362836...,2013-11-23 21:36:27+00:00,@rod_ster #yougotme #hadtobail #DrWho,en,False,,0,juliaargy,yougotme,#drwho,,rod_ster,,rod_ster,4.04347e+17,,


In [None]:
#check what types the columns are
tweets.dtypes

id_str                     float64
tweet_url                   object
created_at                  object
text                        object
lang                        object
retweet_count                int64
screen_name                 object
hashtags                    object
query                       object
url                         object
user_mention                object
media                       object
in_reply_to_screen_name     object
in_reply_to_status_id      float64
lat                        float64
lng                        float64
dtype: object

## Check how many tweets are unique

*The rest of this code is from [this tutorial](https://ourcodingclub.github.io/tutorials/topic-modelling-python/)...*

The `.unique()` function can tell us how many are... unique.

In [None]:
#show how many tweets
tweets['text'].shape

(38586,)

In [None]:
#show how many are unique
tweets['text'].unique().shape

(31571,)

In [None]:
# make a new column to highlight retweets
tweets['is_retweet'] = tweets['text'].apply(lambda x: x[:2]=='RT')
tweets['is_retweet'].sum()  # number of retweets

10842

In [None]:
# number of unique retweets
tweets.loc[tweets['is_retweet']].text.unique().size

4132

## Enter NLTK

From section 6 of the tutorial at https://ourcodingclub.github.io/tutorials/topic-modelling-python/

We need the `nltk` library to do topic modelling. Below we import that as well as some specific tools from that such as RegexpTokenizer, [which is described like this:](https://www.kite.com/python/docs/nltk.tokenize.regexp)

> "A RegexpTokenizer splits a string into substrings using a regular expression."

While `stopwords` is a simple list of words like 'the', 'to', etc. which we are likely to want to remove from our analysis because of their high frequency and low significance.

In [None]:
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

### Defining some functions to remove links and users

Two functions are created below to remove more elements which are unlikely to be relevant to analysis: links and users.

In [None]:
def remove_links(tweet):
    '''Takes a string and removes web links from it'''
    tweet = re.sub(r'http\S+', '', tweet) # remove http links
    tweet = re.sub(r'bit.ly/\S+', '', tweet) # rempve bitly links
    tweet = tweet.strip('[link]') # remove [links]
    return tweet

def remove_users(tweet):
    '''Takes a string and removes retweet and @user information'''
    tweet = re.sub('(RT\s@[A-Za-z]+[A-Za-z0-9-_]+)', '', tweet) # remove retweet
    tweet = re.sub('(@[A-Za-z]+[A-Za-z0-9-_]+)', '', tweet) # remove tweeted at
    return tweet

Those two functions are used inside another function, defined below, which cleans tweets and then creates a 'token list'.

In [None]:
my_stopwords = nltk.corpus.stopwords.words('english')
word_rooter = nltk.stem.snowball.PorterStemmer(ignore_stopwords=False).stem
my_punctuation = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•@'

# cleaning master function
def clean_tweet(tweet, bigrams=False):
    tweet = remove_users(tweet)
    tweet = remove_links(tweet)
    tweet = tweet.lower() # lower case
    tweet = re.sub('['+my_punctuation + ']+', ' ', tweet) # strip punctuation
    tweet = re.sub('\s+', ' ', tweet) #remove double spacing
    tweet = re.sub('([0-9]+)', '', tweet) # remove numbers
    tweet_token_list = [word for word in tweet.split(' ')
                            if word not in my_stopwords] # remove stopwords

    tweet_token_list = [word_rooter(word) if '#' not in word else word
                        for word in tweet_token_list] # apply word rooter
    if bigrams:
        tweet_token_list = tweet_token_list+[tweet_token_list[i]+'_'+tweet_token_list[i+1]
                                            for i in range(len(tweet_token_list)-1)]
    tweet = ' '.join(tweet_token_list)
    return tweet

We then use that function to add another column to our dataframe containing the cleaned version of each tweet.

In [None]:
#apply the function 'clean_tweet' to the 'text' column of the tweets dataframe
#and create a new column with the reults
tweets['clean_tweet'] = tweets['text'].apply(clean_tweet)


In [None]:
#show the first 10
tweets['clean_tweet'][:10]

0    question peter carpaldi known #drwho #drwhoth ...
1     dr quick person quiz bbc honor #drwho th anni...
2                          help us doctor hope #drwho 
3                      got love #drwho #dayofthedoctor
4                          #yougotme #hadtobail #drwho
5     total brilliant best cinema crowd ever #drwho...
6                                     awesom dr #drwho
7                      great men forg fire epic #drwho
8                         ahhh #drwho amaz #savetheday
9                                          #drwho epic
Name: clean_tweet, dtype: object

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# the vectorizer object will be used to transform text to vector form
vectorizer = CountVectorizer(max_df=0.9, min_df=25, token_pattern='\w+|\$[\d\.]+|\S+')

# apply transformation
tf = vectorizer.fit_transform(tweets['clean_tweet']).toarray()

# tf_feature_names tells us what word each column in the matric represents
tf_feature_names = vectorizer.get_feature_names()



In [None]:
from sklearn.decomposition import LatentDirichletAllocation

number_of_topics = 10

model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)

In [None]:
model.fit(tf)

LatentDirichletAllocation(random_state=0)

In [None]:
def display_topics(model, feature_names, no_top_words):
    topic_dict = {}
    for topic_idx, topic in enumerate(model.components_):
        topic_dict["Topic %d words" % (topic_idx)]= ['{}'.format(feature_names[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
        topic_dict["Topic %d weights" % (topic_idx)]= ['{:.1f}'.format(topic[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
    return pd.DataFrame(topic_dict)

## Show the 10 'topics' extracted

Below we display a table of the 10 topics and the words that appear most in each. 

In [None]:
no_top_words = 10
display_topics(model, tf_feature_names, no_top_words)

Unnamed: 0,Topic 0 words,Topic 0 weights,Topic 1 words,Topic 1 weights,Topic 2 words,Topic 2 weights,Topic 3 words,Topic 3 weights,Topic 4 words,Topic 4 weights,Topic 5 words,Topic 5 weights,Topic 6 words,Topic 6 weights,Topic 7 words,Topic 7 weights,Topic 8 words,Topic 8 weights,Topic 9 words,Topic 9 weights
0,â,2245.3,#doctorwho,1185.8,th,3455.6,tardi,660.2,watch,1561.1,googl,1605.0,time,690.4,dalek,1242.5,doctor,4790.4,time,1747.0
1,de,2129.1,#savetheday,770.3,anniversari,2377.1,ever,630.5,love,922.3,today,1218.4,go,682.9,david,780.9,#doctorwho,2904.4,year,1678.5
2,el,1013.1,amp,537.0,happi,1218.4,#tardis,544.5,good,741.4,baker,905.8,like,649.8,tennant,750.5,day,2476.2,celebr,1110.9
3,la,560.1,#thedayofthedoctor,418.7,birthday,643.8,best,491.2,get,724.9,tom,873.5,look,565.6,dr,739.7,#savetheday,2296.8,special,1060.9
4,que,538.1,#dayofthedoctor,372.8,tomorrow,619.8,one,312.4,dr,589.9,doodl,862.4,dr,527.2,year,456.3,#dayofthedoctor,2088.7,rt,769.1
5,€“,440.9,make,341.4,doctor,586.6,theme,305.1,awesom,507.5,dr,807.1,€¦,492.0,bow,455.1,#doctorwhoth,958.3,space,687.1
6,en,428.1,today,326.4,pm,527.1,ã,190.2,see,463.9,smith,663.7,back,405.0,mark,447.3,wait,491.9,anniversari,687.1
7,doodl,339.6,us,309.0,#savetheday,520.2,photo,159.4,episod,453.4,matt,636.3,think,374.0,room,443.1,new,484.5,adventur,638.1
8,un,281.6,fan,296.4,weekend,498.3,made,153.6,oh,430.9,game,597.3,know,367.4,stand,431.1,bbc,456.7,#win,558.1
9,hoy,277.1,€¦,272.1,dr,456.7,box,141.7,tonight,404.0,#drwhoth,537.8,amp,352.3,recept,398.1,excit,434.9,enter,557.1


## Seeing the topics

Look down each column and you can see how each topic shares certain features: topic 2 is all about the anniversary/birthday; topic 3 is all about the tardis/box; topic 5 is all about Tom Baker and Matt Smith.

In [None]:
topictable = display_topics(model, tf_feature_names, no_top_words)
topictable['Topic 5 words']

0       googl
1       today
2       baker
3         tom
4       doodl
5          dr
6       smith
7        matt
8        game
9    #drwhoth
Name: Topic 5 words, dtype: object