# Twitter Brexit Analysis

## General Dataset Description

The data used in this project are tweets exported from Twitter in context of UK brexit. 

The data are generated from the following hashtags: #eureferendum, #euref, #brexit, #no2eu, #yes2eu, #notoeu, #yestoeu, #betteroffout, #betteroffin, #voteout, #votein, #eureform, #ukineu, #Bremain ,#EUpoll, #UKreferendum, #UKandEU, #EUpol, #ImagineEurope, #EdEUref, #MyImageOfTheEU #eu, #referendum, #europe, #UKRef, #ref, #migrant, #refugee #strongerin, #leadnotleave, #voteremain, #britainout, '#leaveeu, #voteleave, #beleave, #loveeuropeleaveeu, #greenerin, #britin, #eunegotiation, #eurenegotiation, #grassrootsout, #projectfear, #projectfact, #remaineu, #europeanunion, #brexitfears, #remain, #leave, #takecontrol, #euinorout, #leavechaos, #labourin, #conservatives, #bregret, #brexitvote, #brexitin5words, #labourcoup, #eurefresults, #projectfear, #VoteLeaveLoseControl, #regrexit, #wearethe48, #scexit, #niineurope, #scotlandineurope, #article50, #scotlandineu

The generated data are stored as JSON files. In total they are approximately 18GB of compressed Tar files.

The dataset is collected on 5 days around brexit.

- 21 June 2016
- 22 June 2016
- 23 June 2016
- 22 September 2017
- 23 September 2017

For each day there are JSON files for tweets about brexit collected every hour in 24 hours.

Each JSON files varies with one another in size, ranging from 4MB in to 500MB.

## Reading File

In this project I use the smallest JSON file in the dataset to analyze. The file sized around 4MB. In the JSON file the tweets are stored line by line. Each line represent a fully formatted JSON file representing tweet informations.

In [14]:
twitter_filename = "2017-09-22:03:05:01.json"
twitter_file = open(twitter_filename, 'r')

with twitter_file as f:
    first_line = f.readline()
    print(first_line)

{"extended_tweet":{"extended_entities":{"media":[{"display_url":"pic.twitter.com/p0lZvp4Uqv","indices":[133,156],"sizes":{"small":{"w":680,"h":179,"resize":"fit"},"large":{"w":1024,"h":270,"resize":"fit"},"thumb":{"w":150,"h":150,"resize":"crop"},"medium":{"w":1024,"h":270,"resize":"fit"}},"id_str":"911048653409832965","expanded_url":"https://twitter.com/nathymora/status/911048657474113536/photo/1","media_url_https":"https://pbs.twimg.com/media/DKSx7F-XUAU6G5T.jpg","id":911048653409832965,"type":"photo","media_url":"http://pbs.twimg.com/media/DKSx7F-XUAU6G5T.jpg","url":"https://t.co/p0lZvp4Uqv"}]},"entities":{"urls":[{"display_url":"europeunion.press/?p=24973","indices":[26,49],"expanded_url":"http://europeunion.press/?p=24973","url":"https://t.co/iwDoMV68ZM"}],"hashtags":[{"indices":[50,53],"text":"EU"},{"indices":[54,59],"text":"News"},{"indices":[60,72],"text":"EuropeUnion"}],"media":[{"display_url":"pic.twitter.com/p0lZvp4Uqv","indices":[133,156],"sizes":{"small":{"w":680,"h":179,"

## Parse JSON Formatting

To analyze the data better we need to change the string read by line into JSON. We can use the JSON library from python.

In [21]:
import json
first_line_json = json.loads(first_line)
print(json.dumps(first_line_json, indent=4))

{
    "extended_tweet": {
        "extended_entities": {
            "media": [
                {
                    "display_url": "pic.twitter.com/p0lZvp4Uqv",
                    "indices": [
                        133,
                        156
                    ],
                    "sizes": {
                        "small": {
                            "w": 680,
                            "h": 179,
                            "resize": "fit"
                        },
                        "large": {
                            "w": 1024,
                            "h": 270,
                            "resize": "fit"
                        },
                        "thumb": {
                            "w": 150,
                            "h": 150,
                            "resize": "crop"
                        },
                        "medium": {
                            "w": 1024,
                            "h": 270,
                            "res

In [1]:
import json
import pandas as pd
from pandas.io.json import json_normalize

# Pandas by default limit the maximum columns display. This will remove it.
pd.set_option('display.max_columns', None)

tweets = []
df_tweets = pd.DataFrame()

with twitter_file as f:
    for line in f:
        tweet = json.loads(line)
        df_tweets = df_tweets.append(json_normalize(tweet))   


In [10]:
df_tweets.columns

Index(['contributors', 'coordinates', 'coordinates.coordinates',
       'coordinates.type', 'created_at', 'display_text_range',
       'entities.hashtags', 'entities.media', 'entities.symbols',
       'entities.urls',
       ...
       'user.profile_use_background_image', 'user.protected',
       'user.screen_name', 'user.statuses_count', 'user.time_zone',
       'user.translator_type', 'user.url', 'user.utc_offset', 'user.verified',
       'withheld_in_countries'],
      dtype='object', length=371)

In [4]:
df_tweets.head()

Unnamed: 0,contributors,coordinates,coordinates.coordinates,coordinates.type,created_at,display_text_range,entities.hashtags,entities.media,entities.symbols,entities.urls,...,user.profile_use_background_image,user.protected,user.screen_name,user.statuses_count,user.time_zone,user.translator_type,user.url,user.utc_offset,user.verified,withheld_in_countries
0,,,,,Fri Sep 22 02:05:01 +0000 2017,"[0, 140]","[{'indices': [50, 53], 'text': 'EU'}, {'indice...",,[],"[{'display_url': 'europeunion.press/?p=24973',...",...,True,False,nathymora,26297,Greenland,none,,-7200.0,False,
0,,,,,Fri Sep 22 02:05:05 +0000 2017,"[0, 140]","[{'indices': [36, 39], 'text': 'EU'}, {'indice...",,[],"[{'display_url': 'europeunion.press/?p=24975',...",...,False,False,EuropeUnion,38358,London,none,http://EuropeUnion.press,3600.0,False,
0,,,,,Fri Sep 22 02:05:08 +0000 2017,"[0, 140]","[{'indices': [50, 53], 'text': 'EU'}, {'indice...",,[],"[{'display_url': 'europeunion.press/?p=24974',...",...,True,False,nathymora,26298,Greenland,none,,-7200.0,False,
0,,,,,Fri Sep 22 02:05:09 +0000 2017,,"[{'indices': [65, 74], 'text': 'buddhism'}, {'...",,[],[{'display_url': 'tsemrinpoche.com/tsem-tulku-...,...,True,False,KumarTa69783205,420,,none,,,False,
0,,,,,Fri Sep 22 02:05:14 +0000 2017,"[0, 15]","[{'indices': [8, 15], 'text': 'Brexit'}]",,[],[{'display_url': 'twitter.com/GuitarMoog/sta…'...,...,True,False,SergioTavaresUK,8616,Europe/London,none,https://www.facebook.com/lusoscot/,3600.0,False,


In [8]:
df_tweets['user.time_zone'].unique()

array(['Greenland', 'London', None, 'Europe/London', 'Casablanca',
       'Copenhagen', 'Eastern Time (US & Canada)',
       'Pacific Time (US & Canada)', 'Athens',
       'Central Time (US & Canada)', 'Central America', 'Bern', 'Dublin',
       'Belgrade', 'Brasilia', 'Stockholm', 'Pretoria', 'Perth',
       'Ljubljana', 'Madrid', 'Paris', 'Quito', 'Melbourne', 'Vienna',
       'International Date Line West', 'Edinburgh', 'UTC',
       'Atlantic Time (Canada)', 'Mountain Time (US & Canada)', 'Tokyo',
       'Auckland', 'Amsterdam', 'Sri Jayawardenepura', 'Singapore',
       'New Delhi', 'Mexico City', 'Hong Kong', 'America/Los_Angeles',
       'Abu Dhabi', 'Lisbon', 'Mumbai', 'Sydney', 'Volgograd', 'Kyiv',
       'Fiji', 'Hawaii', 'Brussels', 'Tehran', 'Alaska', 'Arizona',
       'Berlin', 'Caracas', 'Beijing', 'Baghdad', 'Cairo', 'Jerusalem',
       "Nuku'alofa", 'Helsinki', 'America/Phoenix', 'America/Toronto'], dtype=object)