# Twitter Brexit Analysis

## General Dataset Description

The data used in this project are tweets exported from Twitter in context of UK brexit. 

The data are generated from the following hashtags: #eureferendum, #euref, #brexit, #no2eu, #yes2eu, #notoeu, #yestoeu, #betteroffout, #betteroffin, #voteout, #votein, #eureform, #ukineu, #Bremain ,#EUpoll, #UKreferendum, #UKandEU, #EUpol, #ImagineEurope, #EdEUref, #MyImageOfTheEU #eu, #referendum, #europe, #UKRef, #ref, #migrant, #refugee #strongerin, #leadnotleave, #voteremain, #britainout, '#leaveeu, #voteleave, #beleave, #loveeuropeleaveeu, #greenerin, #britin, #eunegotiation, #eurenegotiation, #grassrootsout, #projectfear, #projectfact, #remaineu, #europeanunion, #brexitfears, #remain, #leave, #takecontrol, #euinorout, #leavechaos, #labourin, #conservatives, #bregret, #brexitvote, #brexitin5words, #labourcoup, #eurefresults, #projectfear, #VoteLeaveLoseControl, #regrexit, #wearethe48, #scexit, #niineurope, #scotlandineurope, #article50, #scotlandineu

The generated data are stored as JSON files. In total they are approximately 18GB of compressed Tar files.

The dataset is collected on 5 days around brexit.

- 21 June 2016
- 22 June 2016
- 23 June 2016
- 22 September 2017
- 23 September 2017

For each day there are JSON files for tweets about brexit collected every hour in 24 hours.

Each JSON files varies with one another in size, ranging from 4MB in to 500MB.

One of the question for this data is how do we analyze the users based on their physical locations. As Brexit is a political event revolving UK and EU the most active ones should be twitter users in UK and EU. And how do the users felt about the Brexit.

This project will try to analyze the tweets based on the user locations and their sentiments.

## Reading File

In this project I use the smallest JSON file in the dataset to analyze. The file sized around 4MB. In the JSON file the tweets are stored line by line. Each line represent a fully formatted JSON file representing tweet informations.

In [33]:
twitter_filename = "2017-09-23:11:05:01.json"
twitter_file = open(twitter_filename, 'r')

with twitter_file as f:
    first_line = f.readline()
    print(first_line)

{"in_reply_to_status_id_str":null,"in_reply_to_status_id":null,"created_at":"Sat Sep 23 10:05:04 +0000 2017","in_reply_to_user_id_str":null,"source":"<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client<\/a>","retweeted_status":{"in_reply_to_status_id_str":null,"in_reply_to_status_id":null,"created_at":"Sat Sep 23 09:59:13 +0000 2017","in_reply_to_user_id_str":null,"source":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone<\/a>","retweet_count":2,"retweeted":false,"geo":null,"filter_level":"low","in_reply_to_screen_name":null,"is_quote_status":false,"id_str":"911530384046870528","in_reply_to_user_id":null,"favorite_count":0,"id":911530384046870528,"text":"#Brexit will turn EU from equal partner to competitor with better trade deals worldwide &amp; a more attractive market 500 vs 65 Mln! #ScotRef","place":null,"lang":"en","quote_count":0,"favorited":false,"coordinates":null,"truncated":false,"reply_count":0,"entities":{"urls":[],"hashtags"

## Parse JSON Formatting

To analyze the data better we need to change the string read by line into JSON. We can use the JSON library from python to convert the regular string to a more readable JSON format.

In [20]:
import json
first_line_json = json.loads(first_line)
print(json.dumps(first_line_json, indent=2))

{
  "extended_tweet": {
    "extended_entities": {
      "media": [
        {
          "display_url": "pic.twitter.com/p0lZvp4Uqv",
          "indices": [
            133,
            156
          ],
          "sizes": {
            "small": {
              "w": 680,
              "h": 179,
              "resize": "fit"
            },
            "large": {
              "w": 1024,
              "h": 270,
              "resize": "fit"
            },
            "thumb": {
              "w": 150,
              "h": 150,
              "resize": "crop"
            },
            "medium": {
              "w": 1024,
              "h": 270,
              "resize": "fit"
            }
          },
          "id_str": "911048653409832965",
          "expanded_url": "https://twitter.com/nathymora/status/911048657474113536/photo/1",
          "media_url_https": "https://pbs.twimg.com/media/DKSx7F-XUAU6G5T.jpg",
          "id": 911048653409832965,
          "type": "photo",
          "media_ur

To analyze the JSON we can use json_normalize to flatten the JSON into flat table.

In [34]:
import pandas as pd
from pandas.io.json import json_normalize

# Pandas by default limit the maximum columns display. This will remove the limit.
pd.set_option('display.max_columns', None)

tweets = []
df_tweet = pd.DataFrame()

df_tweet = df_tweet.append(json_normalize(first_line_json))

In [35]:
df_tweet

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities.hashtags,entities.symbols,entities.urls,entities.user_mentions,extended_tweet.display_text_range,extended_tweet.entities.hashtags,extended_tweet.entities.media,extended_tweet.entities.symbols,extended_tweet.entities.urls,extended_tweet.entities.user_mentions,extended_tweet.extended_entities.media,extended_tweet.full_text,favorite_count,favorited,filter_level,geo,id,id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,is_quote_status,lang,place,possibly_sensitive,quote_count,reply_count,retweet_count,retweeted,source,text,timestamp_ms,truncated,user.contributors_enabled,user.created_at,user.default_profile,user.default_profile_image,user.description,user.favourites_count,user.follow_request_sent,user.followers_count,user.following,user.friends_count,user.geo_enabled,user.id,user.id_str,user.is_translator,user.lang,user.listed_count,user.location,user.name,user.notifications,user.profile_background_color,user.profile_background_image_url,user.profile_background_image_url_https,user.profile_background_tile,user.profile_banner_url,user.profile_image_url,user.profile_image_url_https,user.profile_link_color,user.profile_sidebar_border_color,user.profile_sidebar_fill_color,user.profile_text_color,user.profile_use_background_image,user.protected,user.screen_name,user.statuses_count,user.time_zone,user.translator_type,user.url,user.utc_offset,user.verified
0,,,Fri Sep 22 02:05:01 +0000 2017,"[0, 140]","[{'indices': [50, 53], 'text': 'EU'}, {'indice...",[],"[{'display_url': 'europeunion.press/?p=24973',...",[],"[0, 132]","[{'indices': [50, 53], 'text': 'EU'}, {'indice...","[{'display_url': 'pic.twitter.com/p0lZvp4Uqv',...",[],"[{'display_url': 'europeunion.press/?p=24973',...",[],"[{'display_url': 'pic.twitter.com/p0lZvp4Uqv',...",EuropeUnion : Click HERE▶️https://t.co/iwDoMV6...,0,False,low,,911048657474113536,911048657474113536,,,,,,False,fr,,False,0,0,0,False,"<a href=""https://ifttt.com"" rel=""nofollow"">IFT...",EuropeUnion : Click HERE▶️https://t.co/iwDoMV6...,1506045901064,True,False,Mon May 30 18:54:26 +0000 2011,False,False,"Inteligente, simpática, interesante, preparada...",25,,611,,404,False,308046422,308046422,False,es,6,,Nathy Mora,,F34970,http://pbs.twimg.com/profile_background_images...,https://pbs.twimg.com/profile_background_image...,True,https://pbs.twimg.com/profile_banners/30804642...,http://pbs.twimg.com/profile_images/3788000003...,https://pbs.twimg.com/profile_images/378800000...,AD1B64,32003E,46094C,70135A,True,False,nathymora,26297,Greenland,none,,-7200,False


We can see the columns for the data and how many column the tweet has.

In [36]:
print(df_tweet.columns)
print('There are {} columns for the tweet'.format(len(df_tweet.columns)))

Index(['contributors', 'coordinates', 'created_at', 'display_text_range',
       'entities.hashtags', 'entities.symbols', 'entities.urls',
       'entities.user_mentions', 'extended_tweet.display_text_range',
       'extended_tweet.entities.hashtags', 'extended_tweet.entities.media',
       'extended_tweet.entities.symbols', 'extended_tweet.entities.urls',
       'extended_tweet.entities.user_mentions',
       'extended_tweet.extended_entities.media', 'extended_tweet.full_text',
       'favorite_count', 'favorited', 'filter_level', 'geo', 'id', 'id_str',
       'in_reply_to_screen_name', 'in_reply_to_status_id',
       'in_reply_to_status_id_str', 'in_reply_to_user_id',
       'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'place',
       'possibly_sensitive', 'quote_count', 'reply_count', 'retweet_count',
       'retweeted', 'source', 'text', 'timestamp_ms', 'truncated',
       'user.contributors_enabled', 'user.created_at', 'user.default_profile',
       'user.default_profile_

And for the types of the columns we can see here.

In [24]:
df_tweet.dtypes

contributors                               object
coordinates                                object
created_at                                 object
display_text_range                         object
entities.hashtags                          object
entities.symbols                           object
entities.urls                              object
entities.user_mentions                     object
extended_tweet.display_text_range          object
extended_tweet.entities.hashtags           object
extended_tweet.entities.media              object
extended_tweet.entities.symbols            object
extended_tweet.entities.urls               object
extended_tweet.entities.user_mentions      object
extended_tweet.extended_entities.media     object
extended_tweet.full_text                   object
favorite_count                              int64
favorited                                    bool
filter_level                               object
geo                                        object


Here we can see that not all of the columns are useful. Columns such as user.profile_background_color and others alike are not really useful. To answer the question posed in the description on top I will use the following columns to explore the data:
- id
- text
- user.id
- user.location
- user.time_zone

Choosing these columns is also a part of optimization. Because the data and accompanying informations are quite large, analysis can take quite a long time. By choosing exact columns to analyze we can speed up the analysis by reducing the unneeded columns.

In the dataset actually there are some appropriate columns for looking into the location of users, they are: 'coordinates' and 'geo' columns. But in this analysis I am not using them. This is because in a glance for this particular file there are almost none of the tweets that contain data on on these columns. Therefore, to analyze the user's location I choose 'user.location' and 'user.time_zone' as the base.

## Parsing and reading all tweets data

Now we will parse all of the tweets inside the file and choose exact columns to analyze.

In [37]:
df_tweets = pd.DataFrame()
with open(twitter_filename, 'r') as f:
    for line in f:
        tweet = json.loads(line)
        df_tweet = json_normalize(tweet)
        df_tweets = df_tweets.append(df_tweet.loc[:,['id','text', 'user.id','user.location', 'user.time_zone']], ignore_index=True)

In [40]:
df_tweets.head()

Unnamed: 0,id,text,user.id,user.location,user.time_zone
0,911531854305652738,RT @BigMarkyB: #Brexit will turn EU from equal...,880874466393411585,"Scotland, United Kingdom",UTC
1,911531855555506177,RT @2010LeeHurst: This will be my final commen...,707546304784769024,"North West, England",
2,911531855878295552,RT @Oceaanfietser: Any deal with Islamists ter...,888726432221773824,,
3,911531857266610176,RT @Oceaanfietser: Any deal with Islamists ter...,883617751666835458,,
4,911531866280341504,RT @Oceaanfietser: Any deal with Islamists ter...,889420557590028288,,


In [41]:
df_tweets.tail()

Unnamed: 0,id,text,user.id,user.location,user.time_zone
5730,911546921218699265,#democracia #referendum #dublin #1Oct #ireland...,158130595,"Dublin City, Ireland",Dublin
5731,911546932002197504,RT @MarcusAgrippa4: The #Tories just lost all ...,21178239,Hampshire Cornwall Cuba,London
5732,911546933797408768,RT @petertimmins3: If #Brexit is such a good i...,25040839,Stoke; Proudly still in the EU,London
5733,911546937215787008,RT @jurygroup: Are you EU remainers feeling mo...,55952846,wales,London
5734,911546940302790656,Can only assume Macron was asleep during TM's ...,1888241689,Earth,


## User location

After loading the relevant columns into Pandas we can see the user location through the 'user.location' and 'user.time_zone' columns.

In [50]:
df_tweets['user.location'].value_counts()

London                              168
London, England                     112
United Kingdom                      107
UK                                  105
Scotland, United Kingdom             69
England, United Kingdom              51
Scotland                             44
England                              43
Little England                       42
France                               36
Europe                               29
Barcelona                            28
London, UK                           27
Brussels, Belgium                    25
North West, England                  22
Catalunya                            21
Leicester, England.                  19
日本 東京                                18
South East, England                  17
Glasgow                              16
Global                               16
United States                        16
Perivale, West London                16
Manchester, England                  16
Winchester UK, Houston USA           14


In [52]:
df_tweets['user.time_zone'].value_counts()

London                        1002
Pacific Time (US & Canada)     535
Amsterdam                      196
Madrid                         117
Casablanca                     100
Paris                           89
Edinburgh                       76
Europe/London                   72
Athens                          67
Dublin                          66
Greenland                       55
Eastern Time (US & Canada)      52
Berlin                          45
Hawaii                          34
Brussels                        33
Rome                            30
UTC                             29
Central Time (US & Canada)      20
Bern                            20
Tehran                          19
Lisbon                          15
Ljubljana                       14
Stockholm                       14
Tokyo                           13
Copenhagen                      10
Belgrade                         9
Atlantic Time (Canada)           8
Sydney                           8
Melbourne           

The information in the 'user.location' and 'user.time_zone' are nore entirely reliable. Some of them does not relate to real world and there are also different formats. These data then need to be 'cleaned'.

In [53]:
from geopy import geocoders

ModuleNotFoundError: No module named 'geopy'