# Twitter Brexit Analysis

## General Dataset Description

The data used in this project are tweets exported from Twitter in context of UK brexit. 

The data are generated from the following hashtags: #eureferendum, #euref, #brexit, #no2eu, #yes2eu, #notoeu, #yestoeu, #betteroffout, #betteroffin, #voteout, #votein, #eureform, #ukineu, #Bremain ,#EUpoll, #UKreferendum, #UKandEU, #EUpol, #ImagineEurope, #EdEUref, #MyImageOfTheEU #eu, #referendum, #europe, #UKRef, #ref, #migrant, #refugee #strongerin, #leadnotleave, #voteremain, #britainout, '#leaveeu, #voteleave, #beleave, #loveeuropeleaveeu, #greenerin, #britin, #eunegotiation, #eurenegotiation, #grassrootsout, #projectfear, #projectfact, #remaineu, #europeanunion, #brexitfears, #remain, #leave, #takecontrol, #euinorout, #leavechaos, #labourin, #conservatives, #bregret, #brexitvote, #brexitin5words, #labourcoup, #eurefresults, #projectfear, #VoteLeaveLoseControl, #regrexit, #wearethe48, #scexit, #niineurope, #scotlandineurope, #article50, #scotlandineu

The generated data are stored as JSON files. In total they are approximately 18GB of compressed Tar files.

The dataset is collected on 5 days around brexit.

- 21 June 2016
- 22 June 2016
- 23 June 2016
- 22 September 2017
- 23 September 2017

For each day there are JSON files for tweets about brexit collected every hour in 24 hours.

Each JSON files varies with one another in size, ranging from 4MB in to 500MB.

One of the question for this data is how do we analyze the users based on their physical locations. As Brexit is a political event revolving UK and EU the most active ones should be twitter users in UK and EU. And how do the users felt about the Brexit.

This project will try to analyze the tweets based on the user locations and their sentiments.

## Reading File

In this project I use the smallest JSON file in the dataset to analyze. The file sized around 4MB. In the JSON file the tweets are stored line by line. Each line represent a fully formatted JSON file representing tweet informations.

In [14]:
twitter_filename = "2017-09-22:03:05:01.json"
twitter_file = open(twitter_filename, 'r')

with twitter_file as f:
    first_line = f.readline()
    print(first_line)

{"extended_tweet":{"extended_entities":{"media":[{"display_url":"pic.twitter.com/p0lZvp4Uqv","indices":[133,156],"sizes":{"small":{"w":680,"h":179,"resize":"fit"},"large":{"w":1024,"h":270,"resize":"fit"},"thumb":{"w":150,"h":150,"resize":"crop"},"medium":{"w":1024,"h":270,"resize":"fit"}},"id_str":"911048653409832965","expanded_url":"https://twitter.com/nathymora/status/911048657474113536/photo/1","media_url_https":"https://pbs.twimg.com/media/DKSx7F-XUAU6G5T.jpg","id":911048653409832965,"type":"photo","media_url":"http://pbs.twimg.com/media/DKSx7F-XUAU6G5T.jpg","url":"https://t.co/p0lZvp4Uqv"}]},"entities":{"urls":[{"display_url":"europeunion.press/?p=24973","indices":[26,49],"expanded_url":"http://europeunion.press/?p=24973","url":"https://t.co/iwDoMV68ZM"}],"hashtags":[{"indices":[50,53],"text":"EU"},{"indices":[54,59],"text":"News"},{"indices":[60,72],"text":"EuropeUnion"}],"media":[{"display_url":"pic.twitter.com/p0lZvp4Uqv","indices":[133,156],"sizes":{"small":{"w":680,"h":179,"

## Parse JSON Formatting

To analyze the data better we need to change the string read by line into JSON. We can use the JSON library from python to convert the regular string to a more readable JSON format.

In [33]:
import json
first_line_json = json.loads(first_line)
print(json.dumps(first_line_json, indent=2))

{
  "extended_tweet": {
    "extended_entities": {
      "media": [
        {
          "display_url": "pic.twitter.com/p0lZvp4Uqv",
          "indices": [
            133,
            156
          ],
          "sizes": {
            "small": {
              "w": 680,
              "h": 179,
              "resize": "fit"
            },
            "large": {
              "w": 1024,
              "h": 270,
              "resize": "fit"
            },
            "thumb": {
              "w": 150,
              "h": 150,
              "resize": "crop"
            },
            "medium": {
              "w": 1024,
              "h": 270,
              "resize": "fit"
            }
          },
          "id_str": "911048653409832965",
          "expanded_url": "https://twitter.com/nathymora/status/911048657474113536/photo/1",
          "media_url_https": "https://pbs.twimg.com/media/DKSx7F-XUAU6G5T.jpg",
          "id": 911048653409832965,
          "type": "photo",
          "media_ur

To analyze the JSON we can use json_normalize to flatten the JSON into flat table.

In [40]:
import pandas as pd
from pandas.io.json import json_normalize

# Pandas by default limit the maximum columns display. This will remove the limit.
pd.set_option('display.max_columns', None)

tweets = []
df_tweets = pd.DataFrame()

df_tweets = df_tweets.append(json_normalize(first_line_json))

In [41]:
df_tweets

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities.hashtags,entities.symbols,entities.urls,entities.user_mentions,extended_tweet.display_text_range,extended_tweet.entities.hashtags,extended_tweet.entities.media,extended_tweet.entities.symbols,extended_tweet.entities.urls,extended_tweet.entities.user_mentions,extended_tweet.extended_entities.media,extended_tweet.full_text,favorite_count,favorited,filter_level,geo,id,id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,is_quote_status,lang,place,possibly_sensitive,quote_count,reply_count,retweet_count,retweeted,source,text,timestamp_ms,truncated,user.contributors_enabled,user.created_at,user.default_profile,user.default_profile_image,user.description,user.favourites_count,user.follow_request_sent,user.followers_count,user.following,user.friends_count,user.geo_enabled,user.id,user.id_str,user.is_translator,user.lang,user.listed_count,user.location,user.name,user.notifications,user.profile_background_color,user.profile_background_image_url,user.profile_background_image_url_https,user.profile_background_tile,user.profile_banner_url,user.profile_image_url,user.profile_image_url_https,user.profile_link_color,user.profile_sidebar_border_color,user.profile_sidebar_fill_color,user.profile_text_color,user.profile_use_background_image,user.protected,user.screen_name,user.statuses_count,user.time_zone,user.translator_type,user.url,user.utc_offset,user.verified
0,,,Fri Sep 22 02:05:01 +0000 2017,"[0, 140]","[{'indices': [50, 53], 'text': 'EU'}, {'indice...",[],"[{'display_url': 'europeunion.press/?p=24973',...",[],"[0, 132]","[{'indices': [50, 53], 'text': 'EU'}, {'indice...","[{'display_url': 'pic.twitter.com/p0lZvp4Uqv',...",[],"[{'display_url': 'europeunion.press/?p=24973',...",[],"[{'display_url': 'pic.twitter.com/p0lZvp4Uqv',...",EuropeUnion : Click HERE▶️https://t.co/iwDoMV6...,0,False,low,,911048657474113536,911048657474113536,,,,,,False,fr,,False,0,0,0,False,"<a href=""https://ifttt.com"" rel=""nofollow"">IFT...",EuropeUnion : Click HERE▶️https://t.co/iwDoMV6...,1506045901064,True,False,Mon May 30 18:54:26 +0000 2011,False,False,"Inteligente, simpática, interesante, preparada...",25,,611,,404,False,308046422,308046422,False,es,6,,Nathy Mora,,F34970,http://pbs.twimg.com/profile_background_images...,https://pbs.twimg.com/profile_background_image...,True,https://pbs.twimg.com/profile_banners/30804642...,http://pbs.twimg.com/profile_images/3788000003...,https://pbs.twimg.com/profile_images/378800000...,AD1B64,32003E,46094C,70135A,True,False,nathymora,26297,Greenland,none,,-7200,False


We can see the columns for the data and how many column the tweet has.

In [52]:
print(df_tweets.columns)
print('There are {} columns for the tweet'.format(len(df_tweets.columns)))

Index(['contributors', 'coordinates', 'created_at', 'display_text_range',
       'entities.hashtags', 'entities.symbols', 'entities.urls',
       'entities.user_mentions', 'extended_tweet.display_text_range',
       'extended_tweet.entities.hashtags', 'extended_tweet.entities.media',
       'extended_tweet.entities.symbols', 'extended_tweet.entities.urls',
       'extended_tweet.entities.user_mentions',
       'extended_tweet.extended_entities.media', 'extended_tweet.full_text',
       'favorite_count', 'favorited', 'filter_level', 'geo', 'id', 'id_str',
       'in_reply_to_screen_name', 'in_reply_to_status_id',
       'in_reply_to_status_id_str', 'in_reply_to_user_id',
       'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'place',
       'possibly_sensitive', 'quote_count', 'reply_count', 'retweet_count',
       'retweeted', 'source', 'text', 'timestamp_ms', 'truncated',
       'user.contributors_enabled', 'user.created_at', 'user.default_profile',
       'user.default_profile_

And for the type of the columns we can see here.

In [51]:
df_tweets.dtypes

contributors                               object
coordinates                                object
created_at                                 object
display_text_range                         object
entities.hashtags                          object
entities.symbols                           object
entities.urls                              object
entities.user_mentions                     object
extended_tweet.display_text_range          object
extended_tweet.entities.hashtags           object
extended_tweet.entities.media              object
extended_tweet.entities.symbols            object
extended_tweet.entities.urls               object
extended_tweet.entities.user_mentions      object
extended_tweet.extended_entities.media     object
extended_tweet.full_text                   object
favorite_count                              int64
favorited                                    bool
filter_level                               object
geo                                        object


Here we can see that not all of the columns are useful. Columns such as user.profile_background_color and others alike are not really useful. To answer the question posed in the description on top I will use the following columns to explore the data:
- id
- text
- user.id
- user.location
- user.time_zone

Choosing these columns is also a part of optimization. Because the data and accompanying informations are quite large, analysis can take quite a long time. By choosing exact columns to analyze we can speed up the analysis by reducing the unneeded columns.

## Parsing and reading all tweets data

Now we will parse all of the tweets inside the file and choose exact columns to analyze.

In [None]:
# with twitter_file as f:
#     for line in f:
#         tweet = json.loads(line)
#         df_tweets = df_tweets.append(json_normalize(tweet))   

In [56]:
df_tweets['user.location'].unique()

array([None], dtype=object)