# Westminster Tweet Database Initial Analysis

* A quick investgation of the database of MP's tweets. This has been exported as a CSV.

* Note that we are only looking at the actual tweets and not the followings or profile data at this point.

* Data collected approximately 23-29th December 2017. 

* Save to 'process

In [1]:
import pandas as pd
%matplotlib inline

In [2]:
df = pd.read_csv('tweets.csv')

In [3]:
df.head()

Unnamed: 0,user_name,constituency,party,gender,tweet_id,permalink,text,date,retweets,favourites,geo,replies
0,skinnock,Aberavon,Labour,Male,947017416047058944,https://twitter.com/SKinnock/status/9470174160...,Devastating resignation letter from Andrew Ado...,2017-12-30 08:12:01,17,43,,11
1,skinnock,Aberavon,Labour,Male,944285195372548097,https://twitter.com/SKinnock/status/9442851953...,The gov need to act to allow more flexibility ...,2017-12-22 19:15:09,2,5,,2
2,skinnock,Aberavon,Labour,Male,943841995390423040,https://twitter.com/SKinnock/status/9438419953...,Here are the fabled sector analyses. Damp squi...,2017-12-21 13:54:02,21,29,,8
3,skinnock,Aberavon,Labour,Male,943595206225559552,https://twitter.com/SKinnock/status/9435952062...,"Fallon, Patel and now Green. Three strikes and...",2017-12-20 21:33:22,7,31,,5
4,skinnock,Aberavon,Labour,Male,943552279189258245,https://twitter.com/SKinnock/status/9435522791...,Waiting to hear from @DavidGauke re my concern...,2017-12-20 18:42:48,3,1,,1


## Inital Data Preprocessing

As well as having a quick peak at the data there are a few preprocessing tasks that need to be completed:

* The Twitter search API that we used to collect the tweets occasionally includes retweets as well as tweets. These need to be removed.

* Need to check if there are any MPs that did not tweet anything or have only posted a few tweets. It is dossible something went wrong whilst collecting their tweets.

* We can drop the geo column as these are all null.

* Combine Labour and Labour and Co-operative parties

### Initial Look

In [4]:
# First let's have a look at the data

df.describe()

Unnamed: 0,tweet_id,retweets,favourites,geo,replies
count,2718145.0,2718145.0,2718145.0,0.0,2718145.0
mean,5.05739e+17,10.33543,14.62614,,2.556086
std,2.725769e+17,160.1534,341.1777,,27.19414
min,479780700.0,0.0,0.0,,0.0
25%,2.937094e+17,0.0,0.0,,0.0
50%,5.315462e+17,0.0,0.0,,0.0
75%,7.395344e+17,3.0,3.0,,1.0
max,9.477393e+17,62405.0,158004.0,,14784.0


In [5]:
# How many tweets do we have?

df.count()

user_name       2718145
constituency    2718145
party           2718145
gender          2718145
tweet_id        2718145
permalink       2718145
text            2716935
date            2718145
retweets        2718145
favourites      2718145
geo                   0
replies         2718145
dtype: int64

In [6]:
#Have we got much null data? Looks like all geo values are null so we can get rid of that column.

df.isnull().sum()

user_name             0
constituency          0
party                 0
gender                0
tweet_id              0
permalink             0
text               1210
date                  0
retweets              0
favourites            0
geo             2718145
replies               0
dtype: int64

### Remove retweets

* We can do this by looking at the permalink column. If the user name is not in the permalink field then it must be a retweet. Although could be a retweet of themselves, but it is likely that this will be only a small number of the Tweets collected.
    

In [7]:
def is_retweet(row):
    """
    Look at the permalink
    If the user name is in this link then it is not a retweet.

    """
    
    permalink = row['permalink'].lower()
    
    username = row['user_name'].lower()
    
    if username in permalink:
        
        return False
    
    else:
        
        return True

In [8]:
#Apply is_retweet function

df['is_retweet'] = df[['user_name', 'permalink']].apply(is_retweet, axis=1)

In [9]:
# How many are there?

df['tweet_id'][df['is_retweet']==True].count()

142

In [10]:
# Remove them

df = df[df['is_retweet']==False]

### Tweet Collection Errors

* Rank the MP's by number of tweets. Then look at the bottom and see if we have any MPs with either none or very few tweets. This may indicate something went wrong during the Westminster-Tweet-Database build process.


* It is possible we have the wrong or old accounts for some MPs. Alternatively, during the creation of the database the merging of the 2017 election results and the twitter account CSV may have gone wrong. I should check this manually at some point, but for now we are ok.


* Some MPs they seem to have made a substantial number of tweets, but these do not show up when using the twitter search functionality. For example drlisacameronmp only has 35 tweets, but a browse through her timeline shows that she has made many more. Hopefully this will not bias the data too much.

In [11]:
grouped = df[['tweet_id', 'user_name']].groupby(by='user_name')

In [12]:
# Looks like we have a small number of MPs who have made very few tweets.
# Most of these seem to have actually made very few tweets although there are some exceptions e.g. drlisacameronmp

grouped.count().sort_values('tweet_id').head(15)

Unnamed: 0_level_0,tweet_id
user_name,Unnamed: 1_level_1
jonathanlord,1
jamiehwstone,2
adamhollowaymp,4
johnstevensonmp,5
amcarmichaelmp,28
adrianbaileymp,30
drlisacameronmp,35
electnigel,43
johnmcnallymp,47
damienmooremp,50


### Drop Geo Column

In [13]:
df.drop('geo', inplace=True, axis=1)
df.drop('is_retweet', inplace=True, axis=1)

### Create new merged column for Labour and Labour and Co-operative

* These parties are kind of the same. Long history. Create a merged party column to take this into account

In [14]:
df['party'].unique()

array(['Labour', 'Conservative', 'Scottish National Party', 'Plaid Cymru',
       'Labour and Co-operative', 'Liberal Democrat',
       'Democratic Unionist Party', 'Sinn Fein', 'Green'], dtype=object)

In [15]:
def merge_labour(row):
    
    if row['party'] == 'Labour and Co-operative':
        
        party = 'Labour'
        
    else:
        
        party = row['party']
        
    return party

In [16]:
df['party_new'] = df[['party']].apply(merge_labour, axis=1)

In [17]:
df.head()

Unnamed: 0,user_name,constituency,party,gender,tweet_id,permalink,text,date,retweets,favourites,replies,party_new
0,skinnock,Aberavon,Labour,Male,947017416047058944,https://twitter.com/SKinnock/status/9470174160...,Devastating resignation letter from Andrew Ado...,2017-12-30 08:12:01,17,43,11,Labour
1,skinnock,Aberavon,Labour,Male,944285195372548097,https://twitter.com/SKinnock/status/9442851953...,The gov need to act to allow more flexibility ...,2017-12-22 19:15:09,2,5,2,Labour
2,skinnock,Aberavon,Labour,Male,943841995390423040,https://twitter.com/SKinnock/status/9438419953...,Here are the fabled sector analyses. Damp squi...,2017-12-21 13:54:02,21,29,8,Labour
3,skinnock,Aberavon,Labour,Male,943595206225559552,https://twitter.com/SKinnock/status/9435952062...,"Fallon, Patel and now Green. Three strikes and...",2017-12-20 21:33:22,7,31,5,Labour
4,skinnock,Aberavon,Labour,Male,943552279189258245,https://twitter.com/SKinnock/status/9435522791...,Waiting to hear from @DavidGauke re my concern...,2017-12-20 18:42:48,3,1,1,Labour


In [18]:
df.to_csv('processed_tweets.csv')