# Tweets processing
---

Here we load the tweets previously gathered and process them. Along the way, we flatten the Twitter JSON, select the texts, clean them, compute the sentiment and assign the location of the tweets.

In [1]:
import glob
import json

# list all files containing tweets
files = list(glob.iglob('Twitter/Tweets/*.json'))

tweets_data = []
for file in files:
    
    tweets_file = open(file, "r", encoding = 'utf-8')

    # Read in tweets and store in list: tweets_data
    for line in tweets_file:
        tweet = json.loads(line)
        tweets_data.append(tweet)

    tweets_file.close()

In [2]:
print('There are', len(tweets_data), 'tweets in the dataset.') 

There are 52830 tweets in the dataset.


## Processing JSON
---

There are multiple fields in the Twitter JSON which contains textual data. In a typical tweet, there's the tweet text, the user description, and the user location. In a tweet longer than 140 characters, there's the extended tweet child JSON. And in a quoted tweet, there's the original tweet text and the commentary with the quoted tweet.

To analyze tweets at scale, we will want to __flatten__ the tweet JSON into a single level. This will allow us to store the tweets in a DataFrame format.

It makes sense to define a function to flatten JSON file full of tweets. Let's call this function ```flatten_tweets()```.

In [3]:
def flatten_tweets(tweets):
    ''' Flattens out tweet dictionaries so relevant JSON is 
        in a top-level dictionary. '''
    
    tweets_list = []
    
    # Iterate through each tweet
    for tweet_obj in tweets:
    
        # Store the user screen name in 'user-screen_name'
        tweet_obj['user-screen_name'] = tweet_obj['user']['screen_name']
        
        # Store the user location
        tweet_obj['user-location'] = tweet_obj['user']['location']
    
        # Check if this is a 140+ character tweet
        if 'extended_tweet' in tweet_obj:
            # Store the extended tweet text in 'extended_tweet-full_text'
            tweet_obj['extended_tweet-full_text'] = \
                                    tweet_obj['extended_tweet']['full_text']
    
        if 'retweeted_status' in tweet_obj:
            # Store the retweet user screen name in 
            # 'retweeted_status-user-screen_name'
            tweet_obj['retweeted_status-user-screen_name'] = \
                        tweet_obj['retweeted_status']['user']['screen_name']

            # Store the retweet text in 'retweeted_status-text'
            tweet_obj['retweeted_status-text'] = \
                                        tweet_obj['retweeted_status']['text']
    
            if 'extended_tweet' in tweet_obj['retweeted_status']:
                # Store the extended retweet text in 
                #'retweeted_status-extended_tweet-full_text'
                tweet_obj['retweeted_status-extended_tweet-full_text'] = \
                tweet_obj['retweeted_status']['extended_tweet']['full_text']
                
        if 'quoted_status' in tweet_obj:
            # Store the retweet user screen name in 
            #'retweeted_status-user-screen_name'
            tweet_obj['quoted_status-user-screen_name'] = \
                            tweet_obj['quoted_status']['user']['screen_name']

            # Store the retweet text in 'retweeted_status-text'
            tweet_obj['quoted_status-text'] = \
                                            tweet_obj['quoted_status']['text']
    
            if 'extended_tweet' in tweet_obj['quoted_status']:
                # Store the extended retweet text in 
                #'retweeted_status-extended_tweet-full_text'
                tweet_obj['quoted_status-extended_tweet-full_text'] = \
                    tweet_obj['quoted_status']['extended_tweet']['full_text']
        
        if 'place' in tweet_obj:
            # Store the country code in 'place-country_code'
            try:
                tweet_obj['place-country_code'] = \
                                            tweet_obj['place']['country_code']
            except: pass
        
        tweets_list.append(tweet_obj)
        
    return tweets_list

Here, we are interested in just one text field though. Therefore, we now define a function that selects the ```full_text``` whether the tweet is a principal tweet or a re-tweet. 

We decide to drop the quoted text as it usually repeats itself.

In [4]:
def select_text(tweets_frame):
    ''' Assigns the main text to only one column depending
        on whether the tweet is a RT/quote or not'''
    
    tweets_list = []
    
    # Iterate through each tweet
    for tweet_obj in tweets:
        
        if 'retweeted_status-extended_tweet-full_text' in tweet_obj:
            tweet_obj['text'] = \
                        tweet_obj['retweeted_status-extended_tweet-full_text']
        
        elif 'retweeted_status-text' in tweet_obj:
            tweet_obj['text'] = tweet_obj['retweeted_status-text']
            
        elif 'extended_tweet-full_text' in tweet_obj:
                    tweet_obj['text'] = tweet_obj['extended_tweet-full_text']
                
        tweets_list.append(tweet_obj)
        
    return tweets_list

We now build the data frame.

Notice that we choose the columns relevant for our analysis. This includes the language of the tweet, ```lang```, which we will retain although we will later translate the tweets to English.

We also keep ```user-location``` which is manually set and ```place-country_code``` which appears when the tweet is geo-tagged (we keep the country code and not the coordinates as we rather need the country than the exact location).

In [5]:
import pandas as pd

# flatten tweets
tweets = flatten_tweets(tweets_data)
columns_all_text = ['text', 'extended_tweet-full_text', 'retweeted_status-text', 
           'retweeted_status-extended_tweet-full_text', 'quoted_status-text', 
           'quoted_status-extended_tweet-full_text', 'lang', 'user-location', 
           'place-country_code']

# select text
tweets = select_text(tweets)
columns = ['text', 'lang', 'user-location', 'place-country_code']

# Create a DataFrame from `tweets`
df_tweets = pd.DataFrame(tweets, columns=columns)
# replaces NaNs by Nones
df_tweets.where(pd.notnull(df_tweets), None, inplace=True)
#
df_tweets.head()

Unnamed: 0,text,lang,user-location,place-country_code
0,PLEASE IF Y'ALL COULD SHARE I'D REALLY APPRECI...,en,,
1,"FIFA 20 TOTW 27 Prediction – De Bruyne, Lewand...",nl,USA,
2,FIFA 21 стала самой дорогой игрой в PSN. Она с...,ru,Moscow,
3,➸ New Montage #FIFA20\n➸ Position : R\LB\n➸ ¦ ...,en,,
4,سحب على FIFA21 او قيمتها 60$ 🔥\nالشروط بسيطه:\...,ar,,


In [6]:
df_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52830 entries, 0 to 52829
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   text                52830 non-null  object
 1   lang                52830 non-null  object
 2   user-location       29888 non-null  object
 3   place-country_code  485 non-null    object
dtypes: object(4)
memory usage: 1.6+ MB


__++++++++++++++++++++++++++++++++++++++++  Take just a sample:__

In [7]:
df_tweets_sample = df_tweets.copy()[:50]

__++++++++++++++++++++++++++++++++++++++++__

## Languages
---

In the last part of this process we will replace the languages codes in ```lang``` by the actual language name. We will do this with the auxiliary ```languages_codes.csv``` dataset.

In [8]:
import json

with open('Countries/languages.json', 'r', encoding='utf-8') as json_file:
    languages_dict = json.load(json_file)

{k: languages_dict[k] for k in list(languages_dict)[:5]}

{'aa': {'name': 'Afar', 'native': 'Afar'},
 'ab': {'name': 'Abkhazian', 'native': 'Аҧсуа'},
 'af': {'name': 'Afrikaans', 'native': 'Afrikaans'},
 'ak': {'name': 'Akan', 'native': 'Akana'},
 'am': {'name': 'Amharic', 'native': 'አማርኛ'}}

In [9]:
names = []
for idx, row in df_tweets_sample.iterrows():
    lang = row['lang']
    if lang == 'und':
        names.append(None)
    elif lang == 'in':
        name = languages_dict['id']['name']
        names.append(name)
    elif lang=='iw':
        name = languages_dict['he']['name']
        names.append(name)
    else:
        name = languages_dict[lang]['name']
        names.append(name)

df_tweets_sample['language'] = names
df_tweets_sample.drop(['lang'], axis=1, inplace=True)
#
df_tweets_sample.head(10)

Unnamed: 0,text,user-location,place-country_code,language
0,PLEASE IF Y'ALL COULD SHARE I'D REALLY APPRECI...,,,English
1,"FIFA 20 TOTW 27 Prediction – De Bruyne, Lewand...",USA,,Dutch
2,FIFA 21 стала самой дорогой игрой в PSN. Она с...,Moscow,,Russian
3,➸ New Montage #FIFA20\n➸ Position : R\LB\n➸ ¦ ...,,,English
4,سحب على FIFA21 او قيمتها 60$ 🔥\nالشروط بسيطه:\...,,,Arabic
5,#FIFA20\n#プロクラブ\n本日21:30〜活動を予定してます。\nST（ポストプレイ...,滋賀,,Japanese
6,- سحب على #FIFA21 او #TLOU2 حسب اختيار الفائز ...,"Amman , Jordan",,Arabic
7,"FIFA 20 TOTW 27 Prediction – De Bruyne, Lewand...",USA,,Dutch
8,La de partidos que tiro en #FIFA20.,"Barcelona, España",,Spanish
9,今考えてるのは\nFIFA21(9月頃)、パワプロ、AOテニス\n(スポーツばっかやん),Osaka,,Japanese


## Locations
---

### place-country_code

The data in the ```place``` object is ––obiously–– more reliable than the ```user-location```. 

The country code in ```place-country_code``` comes in ISO 2 form, for which we will translate to ISO 3 form with ```country_converter```.

In [10]:
import country_converter as coco

# change codes to iso3 
to_iso3_func = lambda x: coco.convert(names=x, to='iso3', not_found=None) \
                    if x is not None else x

df_tweets_sample['place-country_code'] = \
                   df_tweets_sample['place-country_code'].apply(to_iso3_func)

### user-locations

Here we take the manually-set ```user-locations``` and translate them to country codes. We do this using ```geopy.geocoders``` +  ```country_converter``` to find the country codes in ISO 3 form.

In [11]:
from geopy.geocoders import Nominatim
from tqdm import tqdm

tqdm.pandas()

def geo_locator(user_location):
    
    # initialize geolocator
    geolocator = Nominatim(user_agent='Tweet_locator')

    if user_location is not None:
        try :
            # get location
            location = geolocator.geocode(user_location, language='en')
            # get coordinates
            location_exact = geolocator.reverse(
                        [location.latitude, location.longitude], language='en')
            # get country codes
            c_code = location_exact.raw['address']['country_code']

            return c_code

        except:
            return None

    else : 
        return None

# apply geo locator to user-location
loc = df_tweets_sample['user-location'].progress_apply(geo_locator)
df_tweets_sample['user_location'] = loc

# change codes to iso3 
df_tweets_sample['user_location'] = \
                        df_tweets_sample['user_location'].apply(to_iso3_func)

# drop old column
df_tweets_sample.drop(['user-location'], axis=1, inplace=True)

#
df_tweets_sample.head(10)

  from pandas import Panel
100%|██████████| 50/50 [00:20<00:00,  2.46it/s]


Unnamed: 0,text,place-country_code,language,user_location
0,PLEASE IF Y'ALL COULD SHARE I'D REALLY APPRECI...,,English,
1,"FIFA 20 TOTW 27 Prediction – De Bruyne, Lewand...",,Dutch,USA
2,FIFA 21 стала самой дорогой игрой в PSN. Она с...,,Russian,RUS
3,➸ New Montage #FIFA20\n➸ Position : R\LB\n➸ ¦ ...,,English,
4,سحب على FIFA21 او قيمتها 60$ 🔥\nالشروط بسيطه:\...,,Arabic,
5,#FIFA20\n#プロクラブ\n本日21:30〜活動を予定してます。\nST（ポストプレイ...,,Japanese,CHN
6,- سحب على #FIFA21 او #TLOU2 حسب اختيار الفائز ...,,Arabic,JOR
7,"FIFA 20 TOTW 27 Prediction – De Bruyne, Lewand...",,Dutch,USA
8,La de partidos que tiro en #FIFA20.,,Spanish,ESP
9,今考えてるのは\nFIFA21(9月頃)、パワプロ、AOテニス\n(スポーツばっかやん),,Japanese,JPN


Finally, we reduce the ```place-country_code``` and ```user_location``` columns to one by keeping the former when it exists, otherwise we keep the latter.

In [12]:
codes = []
for idx, row in df_tweets_sample.iterrows():
    if row['place-country_code'] is None:
        code = row['user_location']
        codes.append(code)
    else :
        codes.append(row['place-country_code'])

df_tweets_sample['location'] = codes
df_tweets_sample.drop(columns=['place-country_code', 'user_location'], 
                      inplace=True)
df_tweets_sample.head(10)

Unnamed: 0,text,language,location
0,PLEASE IF Y'ALL COULD SHARE I'D REALLY APPRECI...,English,
1,"FIFA 20 TOTW 27 Prediction – De Bruyne, Lewand...",Dutch,USA
2,FIFA 21 стала самой дорогой игрой в PSN. Она с...,Russian,RUS
3,➸ New Montage #FIFA20\n➸ Position : R\LB\n➸ ¦ ...,English,
4,سحب على FIFA21 او قيمتها 60$ 🔥\nالشروط بسيطه:\...,Arabic,
5,#FIFA20\n#プロクラブ\n本日21:30〜活動を予定してます。\nST（ポストプレイ...,Japanese,CHN
6,- سحب على #FIFA21 او #TLOU2 حسب اختيار الفائز ...,Arabic,JOR
7,"FIFA 20 TOTW 27 Prediction – De Bruyne, Lewand...",Dutch,USA
8,La de partidos que tiro en #FIFA20.,Spanish,ESP
9,今考えてるのは\nFIFA21(9月頃)、パワプロ、AOテニス\n(スポーツばっかやん),Japanese,JPN


## Sentiment
---

It is now time to process the tweets' text. 

First we will remove non-alphabetic characters using [spaCy](https://spacy.io). This will improve the tweet translation and its sentiment accuracy.

In [13]:
import spacy

nlp = spacy.load('en_core_web_sm')

def cleaner(string):
    
    # Generate list of tokens
    doc = nlp(string)
    lemmas = [token.lemma_ for token in doc]
    # Remove tokens that are not alphabetic 
    a_lemmas = [lemma for lemma in lemmas 
                                    if lemma.isalpha() or lemma == '-PRON-'] 
    # Print string after text cleaning
    return ' '.join(a_lemmas)

df_tweets_sample['text-cleaned'] = \
                            df_tweets_sample['text'].progress_apply(cleaner)
#
df_tweets_sample.head(10)

100%|██████████| 50/50 [00:00<00:00, 96.46it/s] 


Unnamed: 0,text,language,location,text-cleaned
0,PLEASE IF Y'ALL COULD SHARE I'D REALLY APPRECI...,English,,PLEASE if COULD share really APPRECIATE it pla...
1,"FIFA 20 TOTW 27 Prediction – De Bruyne, Lewand...",Dutch,USA,FIFA TOTW prediction De Bruyne Lewandowski amp...
2,FIFA 21 стала самой дорогой игрой в PSN. Она с...,Russian,RUS,FIFA стала самой дорогой игрой в PSN Она стоит...
3,➸ New Montage #FIFA20\n➸ Position : R\LB\n➸ ¦ ...,English,,New Montage position designer ME Enjoy to watch
4,سحب على FIFA21 او قيمتها 60$ 🔥\nالشروط بسيطه:\...,Arabic,,سحب على او قيمتها الشروط بسيطه تابعني تابع رتو...
5,#FIFA20\n#プロクラブ\n本日21:30〜活動を予定してます。\nST（ポストプレイ...,Japanese,CHN,プロクラブ その他にも可能なポジションがありましたら併せてご連絡ください よろしくお願い致します
6,- سحب على #FIFA21 او #TLOU2 حسب اختيار الفائز ...,Arabic,JOR,سحب على او حسب اختيار الفائز تابعني و تابع ورت...
7,"FIFA 20 TOTW 27 Prediction – De Bruyne, Lewand...",Dutch,USA,FIFA TOTW prediction De Bruyne Lewandowski amp...
8,La de partidos que tiro en #FIFA20.,Spanish,ESP,La de partidos que tiro en
9,今考えてるのは\nFIFA21(9月頃)、パワプロ、AOテニス\n(スポーツばっかやん),Japanese,JPN,今考えてるのは スポーツばっかやん


We now use ```googletrans``` to translate the cleaned tweets.

In [14]:
from googletrans import Translator

translator = Translator()

trans = df_tweets_sample['text-cleaned'].progress_apply(
                                            translator.translate, dest='en')
df_tweets_sample['text_english'] = trans.apply(lambda x: x.text)
#
df_tweets_sample.head(10)

100%|██████████| 50/50 [00:05<00:00,  9.84it/s]


Unnamed: 0,text,language,location,text-cleaned,text_english
0,PLEASE IF Y'ALL COULD SHARE I'D REALLY APPRECI...,English,,PLEASE if COULD share really APPRECIATE it pla...,PLEASE if COULD share really APPRECIATE it pla...
1,"FIFA 20 TOTW 27 Prediction – De Bruyne, Lewand...",Dutch,USA,FIFA TOTW prediction De Bruyne Lewandowski amp...,FIFA TOTW prediction De Bruyne Lewandowski amp...
2,FIFA 21 стала самой дорогой игрой в PSN. Она с...,Russian,RUS,FIFA стала самой дорогой игрой в PSN Она стоит...,FIFA has become the most expensive game on PSN...
3,➸ New Montage #FIFA20\n➸ Position : R\LB\n➸ ¦ ...,English,,New Montage position designer ME Enjoy to watch,New Montage position designer ME Enjoy to watch
4,سحب على FIFA21 او قيمتها 60$ 🔥\nالشروط بسيطه:\...,Arabic,,سحب على او قيمتها الشروط بسيطه تابعني تابع رتو...,Pull on the value or conditions continued Rthu...
5,#FIFA20\n#プロクラブ\n本日21:30〜活動を予定してます。\nST（ポストプレイ...,Japanese,CHN,プロクラブ その他にも可能なポジションがありましたら併せてご連絡ください よろしくお願い致します,Together If you have any professional club Oth...
6,- سحب على #FIFA21 او #TLOU2 حسب اختيار الفائز ...,Arabic,JOR,سحب على او حسب اختيار الفائز تابعني و تابع ورت...,Pull on or by choosing the winner and continue...
7,"FIFA 20 TOTW 27 Prediction – De Bruyne, Lewand...",Dutch,USA,FIFA TOTW prediction De Bruyne Lewandowski amp...,FIFA TOTW prediction De Bruyne Lewandowski amp...
8,La de partidos que tiro en #FIFA20.,Spanish,ESP,La de partidos que tiro en,The party that shot
9,今考えてるのは\nFIFA21(9月頃)、パワプロ、AOテニス\n(スポーツばっかやん),Japanese,JPN,今考えてるのは スポーツばっかやん,Yan Bakka sports are thinking now


We finally apply the ```SentimentIntensityAnalyzer``` object from ```nltk.sentiment.vader``` to the translated tweet.

In [15]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# instantiate new SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

sentiment_scores = df_tweets_sample['text_english'].progress_apply(
                                                            sid.polarity_scores)
sentiment = sentiment_scores.apply(lambda x: x['compound'])
df_tweets_sample['sentiment'] = sentiment
#
df_tweets_sample.head(10)

100%|██████████| 50/50 [00:00<00:00, 3644.31it/s]


Unnamed: 0,text,language,location,text-cleaned,text_english,sentiment
0,PLEASE IF Y'ALL COULD SHARE I'D REALLY APPRECI...,English,,PLEASE if COULD share really APPRECIATE it pla...,PLEASE if COULD share really APPRECIATE it pla...,0.9263
1,"FIFA 20 TOTW 27 Prediction – De Bruyne, Lewand...",Dutch,USA,FIFA TOTW prediction De Bruyne Lewandowski amp...,FIFA TOTW prediction De Bruyne Lewandowski amp...,0.0
2,FIFA 21 стала самой дорогой игрой в PSN. Она с...,Russian,RUS,FIFA стала самой дорогой игрой в PSN Она стоит...,FIFA has become the most expensive game on PSN...,0.5267
3,➸ New Montage #FIFA20\n➸ Position : R\LB\n➸ ¦ ...,English,,New Montage position designer ME Enjoy to watch,New Montage position designer ME Enjoy to watch,0.4939
4,سحب على FIFA21 او قيمتها 60$ 🔥\nالشروط بسيطه:\...,Arabic,,سحب على او قيمتها الشروط بسيطه تابعني تابع رتو...,Pull on the value or conditions continued Rthu...,0.6597
5,#FIFA20\n#プロクラブ\n本日21:30〜活動を予定してます。\nST（ポストプレイ...,Japanese,CHN,プロクラブ その他にも可能なポジションがありましたら併せてご連絡ください よろしくお願い致します,Together If you have any professional club Oth...,0.5859
6,- سحب على #FIFA21 او #TLOU2 حسب اختيار الفائز ...,Arabic,JOR,سحب على او حسب اختيار الفائز تابعني و تابع ورت...,Pull on or by choosing the winner and continue...,0.5859
7,"FIFA 20 TOTW 27 Prediction – De Bruyne, Lewand...",Dutch,USA,FIFA TOTW prediction De Bruyne Lewandowski amp...,FIFA TOTW prediction De Bruyne Lewandowski amp...,0.0
8,La de partidos que tiro en #FIFA20.,Spanish,ESP,La de partidos que tiro en,The party that shot,0.4019
9,今考えてるのは\nFIFA21(9月頃)、パワプロ、AOテニス\n(スポーツばっかやん),Japanese,JPN,今考えてるのは スポーツばっかやん,Yan Bakka sports are thinking now,0.0


In [16]:
df_tweets_sample.drop(columns=['text-cleaned'], axis=1, inplace=True)
#
cols_order = ['text', 'text_english', 'sentiment', 'language', 'location']
df_tweets_sample = df_tweets_sample[cols_order]
#
df_tweets_sample.head(10)

Unnamed: 0,text,text_english,sentiment,language,location
0,PLEASE IF Y'ALL COULD SHARE I'D REALLY APPRECI...,PLEASE if COULD share really APPRECIATE it pla...,0.9263,English,
1,"FIFA 20 TOTW 27 Prediction – De Bruyne, Lewand...",FIFA TOTW prediction De Bruyne Lewandowski amp...,0.0,Dutch,USA
2,FIFA 21 стала самой дорогой игрой в PSN. Она с...,FIFA has become the most expensive game on PSN...,0.5267,Russian,RUS
3,➸ New Montage #FIFA20\n➸ Position : R\LB\n➸ ¦ ...,New Montage position designer ME Enjoy to watch,0.4939,English,
4,سحب على FIFA21 او قيمتها 60$ 🔥\nالشروط بسيطه:\...,Pull on the value or conditions continued Rthu...,0.6597,Arabic,
5,#FIFA20\n#プロクラブ\n本日21:30〜活動を予定してます。\nST（ポストプレイ...,Together If you have any professional club Oth...,0.5859,Japanese,CHN
6,- سحب على #FIFA21 او #TLOU2 حسب اختيار الفائز ...,Pull on or by choosing the winner and continue...,0.5859,Arabic,JOR
7,"FIFA 20 TOTW 27 Prediction – De Bruyne, Lewand...",FIFA TOTW prediction De Bruyne Lewandowski amp...,0.0,Dutch,USA
8,La de partidos que tiro en #FIFA20.,The party that shot,0.4019,Spanish,ESP
9,今考えてるのは\nFIFA21(9月頃)、パワプロ、AOテニス\n(スポーツばっかやん),Yan Bakka sports are thinking now,0.0,Japanese,JPN


In [17]:
df_tweets_sample.to_csv('Twitter/Tweets_cleaned.csv')