# Concatenate tweets

In this notebook, we will consolidate all the tweet files into a single compressed pickle file for further analysis. We have three main sets of data that we need to store: 

1. Data from January 2021.
2. Data from October 2021.
3. Data from April 28 to June 30.

Each of these samples corresponds to a specific moment relevant for our analysis. The October data is used for analyzing our community during election periods, specifically the regional elections in Colombia that took place in October 2019. The data from January 2021 represents the period three months before the "Paro Nacional," allowing us to track our community before the social outbreak. Finally, we have the data from the time of the "Paro Nacional," which will be the focal point of our analysis.

In [1]:
import os
import pandas as pd
import numpy as np
from glob import glob
from tqdm import tqdm

In [2]:
path = r"/mnt/disk2/Data"
pd.set_option("display.max_columns", None)

## Regional elections: October 2019

In [None]:
# We create an empty aux list that will store the tweets.
tweets_aux = []
files_oct = glob(os.path.join(path, 'RawData','users_oct_19/*.csv'))
empties = []
for file in tqdm(files_oct):
    df = pd.read_csv(file, dtype = {'Author ID': int, 'Referenced Tweet Author ID': int})
    if df.empty:
        empties.append(file)
    else:
        tweets_aux.append(df)

# Finally, the tweet dataframe is established and tweets_aux is deleted.
tweets = pd.concat(tweets_aux)
del tweets_aux
tweets = tweets.sort_values('ID').reset_index(drop = True)

# Store results
tweets.to_pickle(os.path.join(path, "Tweets_DataFrames/tweets_oct19.gzip"), compression = "gzip")

## Before Paro Nacional: January 2021

We identify two users with their file corrupted: Usuario_82383620 and Usuario_2526574133

In [None]:
# We create an empty aux list that will store the tweets.
tweets_aux = []
files_jan = glob(os.path.join(path, "RawData", "users_jan/*.csv"))

for file in tqdm(files_jan):
    df = pd.read_csv(file)
    if df.empty:
        empties.append(file)
    else:
        tweets_aux.append(df)

# Finally, the tweet dataframe is established and tweets_aux is deleted.  
tweets_jan = pd.concat(tweets_aux)
del tweets_aux
tweets_jan = tweets_jan.sort_values('ID').reset_index(drop = True)

# Store results
# run sudo chmod 777 /mnt/disk2/Data/Tweets_DataFrames in bash if it is needed
tweets_jan.to_pickle(os.path.join(path, "Tweets_DataFrames/tweets_jan21.gzip"), compression = "gzip")

## Paro Nacional: April 28 - June 30 2021

In [None]:
files_v1 = glob(os.path.join(path, 'RawData/Usuarios_V1/*.csv'))
len(files_v1)

In [None]:
def unique_to_string(x):
    unique_values = x.unique()
    return ', '.join(map(str, unique_values))

user_information = df.groupby(['Author ID', 'Author Name']).agg({
                'Author Location': unique_to_string,
                'Author Description': unique_to_string,
                'Author Followers': lambda x: np.nanmean(x),
                'Author Following': lambda x: np.nanmean(x),
                'Author Tweets': lambda x: np.nanmax(x),
                'Author Verified': unique_to_string})

In [None]:
user_information

In [None]:
df_list = []
users_information = []

# cols = ['ID', 'Author ID', 'Author Name', 'Date', 'Text', 'Replies', 'Retweets', 'Favorites', 'Quotes', 'is Retweet?',
#            'Reply To User Name', 'Mentions', 'Referenced Tweet', 'Reference Type', 'Referenced Tweet Author ID']

problems = []

def unique_to_string(x):
    unique_values = x.unique()
    return ', '.join(map(str, unique_values))

# Counter and variable for keeping track of file names
count = 0 # Amount of Tweets
n = 0 # Number of Checkpoint

# Runtime 1 Hour!!!!!
for file in tqdm(files_v1):
    try:
        # df = pd.read_csv(file, usecols = cols)
        df = pd.read_csv(file)
        if df.empty:
            empties.append(file)
        else:
            pass
        # Fix some datatypes
        df[['Author Followers', 'Author Following', 'Author Tweets']] = df[['Author Followers', 'Author Following', 'Author Tweets']].map(lambda x: pd.to_numeric(x, errors = 'coerce'))
        df_list.append(df)
        count += len(df)

        # Save user information
        user_information = df.groupby(['Author ID', 'Author Name']).agg({
                'Author Location': unique_to_string,
                'Author Description': unique_to_string,
                'Author Followers': lambda x: np.nanmean(x),
                'Author Following': lambda x: np.nanmean(x),
                'Author Tweets': lambda x: np.nanmax(x),
                'Author Verified': unique_to_string})
        
        users_information.append(user_information)
        
        # If we reach or exceed 10 million rows, save the file and reset
        if count >= 7_000_000:
            n += 1
            concat_df = pd.concat(df_list)
            output_filename = f"tweets_paro_{n}.gzip"
            concat_df.to_pickle(os.path.join(path, f"Tweets_DataFrames/{output_filename}"), compression='gzip')
            
            # Reset counter and list
            count = 0
            df_list = []
            
    except (ValueError, KeyError) as e:
        problems.append(file)

problems

In [None]:
empties

In [None]:
# Save any remaining data after the loop
# Runtime 5 minutes
if df_list:
    n += 1
    concat_df = pd.concat(df_list)
    output_filename = f"tweets_paro_{n}.gzip"
    # If necessary, run "sudo chmod 777 Data/Tweets_DataFrames" in bash
    concat_df.to_pickle(os.path.join(path, f"Tweets_DataFrames/{output_filename}"), compression = 'gzip')

del df_list, concat_df

In [None]:
# runtime 22 minutes
concat_users_information = pd.concat(users_information)
concat_users_information = concat_users_information.groupby(['Author ID', 'Author Name']) \
    .agg({'Author Location': unique_to_string,
            'Author Description': unique_to_string,
            'Author Followers': lambda x: np.nanmean(x),
            'Author Following': lambda x: np.nanmean(x),
            'Author Tweets': lambda x: np.nanmax(x),
            'Author Verified': unique_to_string})
concat_users_information.to_pickle(os.path.join(path, "Tweets_DataFrames/users_information.gzip"), 
                                   compression = 'gzip')

In [None]:
# We should correct this
concat_users_information

In [3]:
tweets_paro = glob('/mnt/disk2/Data/Tweets_DataFrames/tweets_paro_*')

tweets = pd.DataFrame()
for file in tqdm(tweets_paro):
    tweets_df = pd.read_pickle(file, compression = "gzip")
    tweets = pd.concat([tweets, tweets_df], axis = 0)
    del tweets_df
    
# Fill tweets that doesn't reference anyone as original tweet
tweets["Reference Type"] = tweets["Reference Type"].fillna("original tweet")

# Quoted and Mentions tweet aren't used
#tweets = tweets[(tweets["Reference Type"] == 'original tweet') | (tweets["Reference Type"] == 'retweeted')]

# Drop Values we don't know anything about
tweets.dropna(subset='Author ID', inplace=True)
tweets.head()

100%|██████████| 7/7 [03:20<00:00, 28.60s/it]


Unnamed: 0,ID,Permalink,Author ID,Author Name,Author Location,Author Description,Author Followers,Author Following,Author Tweets,Author Profile Image,Author Verified,Date,Text,Replies,Retweets,Favorites,Quotes,is Retweet?,Reply To User Name,Mentions,Referenced Tweet,Reference Type,Referenced Tweet Author ID,Media URLs,Media Keys
0,1.409619e+18,/hmauriciojg/status/1409618955283668996,138377765.0,hmauriciojg,"Bucaramanga, Colombia",,22.0,558.0,873.0,https://pbs.twimg.com/profile_images/154468480...,False,2021/06/28 16:05:23,@DanielSamperO A vida hp!!. @IvanDuque fue y s...,0.0,0.0,0.0,0.0,False,DanielSamperO,DanielSamperO IvanDuque petrogustavo,1.409586e+18,replied_to,134855300.0,,
1,1.409575e+18,/hmauriciojg/status/1409574993596452867,138377765.0,hmauriciojg,"Bucaramanga, Colombia",,22.0,558.0,873.0,https://pbs.twimg.com/profile_images/154468480...,False,2021/06/28 13:10:41,@alejarojas_g A bueno de pronto si @petrogusta...,0.0,0.0,0.0,0.0,False,alejarojas_g,alejarojas_g petrogustavo,1.409192e+18,replied_to,1131821000.0,,
2,1.409302e+18,/hmauriciojg/status/1409302180847292417,138377765.0,hmauriciojg,"Bucaramanga, Colombia",,22.0,558.0,873.0,https://pbs.twimg.com/profile_images/154468480...,False,2021/06/27 19:06:38,@gabodelascasas Ahí la tiene https://t.co/2WJZ...,0.0,0.0,0.0,0.0,False,gabodelascasas,gabodelascasas,1.409298e+18,replied_to,62337500.0,https://pbs.twimg.com/media/E47Y3H4XMAMtHHu.jpg,3_1409302174933397507
3,1.407446e+18,/hmauriciojg/status/1407446306113691662,138377765.0,hmauriciojg,"Bucaramanga, Colombia",,22.0,558.0,873.0,https://pbs.twimg.com/profile_images/154468480...,False,2021/06/22 16:12:03,@JOHANVE_LAND Deberías hacerle esa pregunta ta...,0.0,0.0,0.0,0.0,False,JOHANVE_LAND,JOHANVE_LAND petrogustavo,1.407171e+18,replied_to,576647400.0,,
4,1.407176e+18,/hmauriciojg/status/1407176029635067904,138377765.0,hmauriciojg,"Bucaramanga, Colombia",,22.0,558.0,873.0,https://pbs.twimg.com/profile_images/154468480...,False,2021/06/21 22:18:04,@ClaraLopezObre @petrogustavo Que susto tan hp...,0.0,0.0,0.0,0.0,False,ClaraLopezObre,ClaraLopezObre petrogustavo,1.40675e+18,replied_to,126832600.0,,


In [4]:
tweets['Reference Type'].value_counts(dropna=False)

Reference Type
retweeted         30918011
replied_to         7847155
original tweet     4543242
quoted             2022201
Name: count, dtype: int64

In [6]:
tweets['Referenced Tweet'][tweets['Reference Type'] == 'retweeted'].isin(tweets['ID'][tweets['Reference Type'] == 'replied_to']).sum()

860561

In [9]:
original_retweet_id = set(tweets['Referenced Tweet'][tweets['Reference Type'] == 'retweeted'])
len(original_retweet_id)


5912692

In [10]:
original_tweets_id = set(tweets['ID'][tweets['Reference Type'] == 'original tweet'])
len(original_tweets_id)

4542660

In [11]:
len(original_retweet_id - original_tweets_id)

5101213

In [12]:
original_retweet_id = set(tweets['Referenced Tweet'][tweets['Reference Type'] == 'retweeted'])
original_replies_id = set(tweets['ID'][tweets['Reference Type'] == 'replied_to'])
original_quoted_id = set(tweets['ID'][tweets['Reference Type'] == 'quoted'])

In [None]:
original_replies_id - original_quoted_id - original_tweets_id

In [14]:
print(f"{len(original_retweet_id - (original_replies_id.union(original_quoted_id).union(original_tweets_id))):,}")

4509953:,


In [9]:
len(tweets['Referenced Tweet'].unique())

5912693

In [10]:
tweets[tweets['ID'] == 1391113172245831680]

Unnamed: 0,ID,Permalink,Author ID,Author Name,Author Location,Author Description,Author Followers,Author Following,Author Tweets,Author Profile Image,Author Verified,Date,Text,Replies,Retweets,Favorites,Quotes,is Retweet?,Reply To User Name,Mentions,Referenced Tweet,Reference Type,Referenced Tweet Author ID,Media URLs,Media Keys
2448,1.391113e+18,/radio1040am/status/1391113172245831691,2434157000.0,radio1040am,Popayán Colombia,Emisora de Red Sonora Radio. Pasión por el Cau...,6054.0,239.0,24269.0,https://pbs.twimg.com/profile_images/821124947...,False,2021/05/08 14:30:00,#Noticias1040 \nEl Fiscal General y el Defenso...,0.0,0.0,0.0,0.0,False,,,,original tweet,,,
1184,1.391113e+18,/nuevodiaibague/status/1391113172245831692,61925350.0,nuevodiaibague,Ibague - Colombia,El periódico de los tolimenses.\n#Tolima #Ibagué,53192.0,1829.0,234956.0,https://pbs.twimg.com/profile_images/144258337...,False,2021/05/08 14:30:00,👉 El emprendimiento se convirtió en una altern...,0.0,0.0,0.0,0.0,False,,,,original tweet,,,


In [8]:
# UUUUUUUUUUUUUh que se hace aqui
tweets['ID'][tweets['ID'] == 1391113172245831680].iloc[0] == tweets['ID'][tweets['ID'] == 1391113172245831680].iloc[1]

True

In [4]:
retweets = tweets[tweets['Reference Type'] == 'retweeted'].drop(columns='Reference Type')
original_tweets = tweets[tweets['Reference Type'] == 'original tweet'].drop(columns='Reference Type')
def get_reference_author_name(x):
    try:
        return x.split(': ')[0].split('@')[1]
    except:
        return np.nan
    
retweets['Referenced Tweet Author Name'] = retweets['Text'].apply(get_reference_author_name)
retweets.head()

Unnamed: 0,ID,Permalink,Author ID,Author Name,Author Location,Author Description,Author Followers,Author Following,Author Tweets,Author Profile Image,Author Verified,Date,Text,Replies,Retweets,Favorites,Quotes,is Retweet?,Reply To User Name,Mentions,Referenced Tweet,Referenced Tweet Author ID,Media URLs,Media Keys,Referenced Tweet Author Name
2,1.409515e+18,/Laura_Milena98/status/1409514751328202757,788250746.0,Laura_Milena98,Bogotá / Colombia,Ingeniera Ambiental - Colombiana - Show Must G...,254.0,521.0,11942.0,https://pbs.twimg.com/profile_images/153177212...,False,2021/06/28 09:11:18,RT @Jokeraton: ¡No le creo al gobierno absolut...,0.0,1136.0,0.0,0.0,True,,Jokeraton,1.408756e+18,142491200.0,,,Jokeraton
5,1.408912e+18,/Laura_Milena98/status/1408911843230425096,788250746.0,Laura_Milena98,Bogotá / Colombia,Ingeniera Ambiental - Colombiana - Show Must G...,254.0,521.0,11942.0,https://pbs.twimg.com/profile_images/153177212...,False,2021/06/26 17:15:34,RT @majogomez30: Mi abuelo de 87 años tiene do...,0.0,3189.0,0.0,0.0,True,,majogomez30,1.408428e+18,261704700.0,,,majogomez30
9,1.408233e+18,/Laura_Milena98/status/1408232541690208256,788250746.0,Laura_Milena98,Bogotá / Colombia,Ingeniera Ambiental - Colombiana - Show Must G...,254.0,521.0,11942.0,https://pbs.twimg.com/profile_images/153177212...,False,2021/06/24 20:16:16,RT @ManuelBeltrn14: ¿Será que la cepa de Trans...,0.0,148.0,0.0,0.0,True,,ManuelBeltrn14,1.407308e+18,8.305394e+17,,3_1407308089783734273,ManuelBeltrn14
10,1.408232e+18,/Laura_Milena98/status/1408232169361846288,788250746.0,Laura_Milena98,Bogotá / Colombia,Ingeniera Ambiental - Colombiana - Show Must G...,254.0,521.0,11942.0,https://pbs.twimg.com/profile_images/153177212...,False,2021/06/24 20:14:47,RT @santorendon: La cifra de fallecidos hoy po...,0.0,328.0,0.0,0.0,True,,santorendon MinSaludCol,1.408212e+18,56713270.0,,3_1408211990200389634,santorendon
12,1.407883e+18,/Laura_Milena98/status/1407882737101545475,788250746.0,Laura_Milena98,Bogotá / Colombia,Ingeniera Ambiental - Colombiana - Show Must G...,254.0,521.0,11942.0,https://pbs.twimg.com/profile_images/153177212...,False,2021/06/23 21:06:16,RT @PATATAdibujo: Nos roban los árbitros nos r...,0.0,60.0,0.0,0.0,True,,PATATAdibujo,1.407883e+18,1.140705e+18,,,PATATAdibujo


In [18]:
count_tweets = tweets.groupby(["Author ID", "Author Name", "Reference Type"]).size().reset_index(name = "n")
count_tweets.to_pickle(os.path.join(path, "count_tweets.gzip"), compression = "gzip")

In [19]:
count_tweets

Unnamed: 0,Author ID,Author Name,Reference Type,n
0,0.000000e+00,0,original tweet,108
1,1.000000e+00,0,original tweet,8
2,2.000000e+00,0,original tweet,11
3,3.000000e+00,0,original tweet,10
4,3.000000e+00,1,original tweet,1
...,...,...,...,...
137903,1.389769e+18,VaneLen18,retweeted,591
137904,1.389784e+18,kars0518,original tweet,3
137905,1.389784e+18,kars0518,quoted,13
137906,1.389784e+18,kars0518,replied_to,3


### Tweets Lite
We create a reduced version of the Paro data frame. This will have the same amount of rows but we will only store four columns: 'Author ID', 'Date', 'Reference Type', 'Referenced Tweet Author ID'.

In [10]:
# Get just the columns that we need for the Graph construction
cols = ['ID', 'Author ID','Author Name', 'Referenced Tweet Author ID', 'Date', 'Reference Type', 'Referenced Tweet']
tweets_lite = tweets[cols].reset_index(drop = True)
tweets_lite.rename(columns={'ID': 'Tweet ID'}, inplace=True)
# Store results
# run sudo chmod 777 Data/Tweets_DataFrames in bash if it is needed
tweets_lite.to_pickle(os.path.join(path, "Tweets_DataFrames/tweets_lite.gzip"), compression = "gzip")
del tweets_lite

In [8]:
# Get just the columns that we need for the Graph construction
cols = ['ID', 'Author ID','Author Name', 'Referenced Tweet Author ID', 'Referenced Tweet Author Name', 'Date', 'Referenced Tweet']
retweets_lite = retweets[cols].reset_index(drop = True)
retweets_lite.rename(columns={'ID': 'Tweet ID'}, inplace=True)
# Store results
# run sudo chmod 777 Data/Tweets_DataFrames in bash if it is needed
retweets_lite.to_pickle(os.path.join(path, "Tweets_DataFrames/retweets_lite.gzip"), compression = "gzip")

In [9]:
# Get just the columns that we need for the Graph construction
cols = ['ID', 'Author ID','Author Name', 'Date']
original_tweets_lite = original_tweets[cols].reset_index(drop = True)
original_tweets_lite.rename(columns={'ID': 'Tweet ID'}, inplace=True)
# Store results
# run sudo chmod 777 Data/Tweets_DataFrames in bash if it is needed
original_tweets_lite.to_pickle(os.path.join(path, "Tweets_DataFrames/original_tweets_lite.gzip"), compression = "gzip")

## Outputs

The output of this Notebook are stored "/mnt/disk2/Data/Tweets_DataFrames" and are listed below:

- **tweets_jan21.gzip**: Dataframe for the Tweets for our users during January of 2021. 3 Months before the Paro
- **tweets_oct19.gzip**: Dataframe for the Tweets for our users during October of 2019. Regional elections Period
- **tweets_paro_i.gzip**: 5 dataframes for the tweets of our users between April 28 to June 30 of 2021
- **tweets_lite.pkl**: Lite version of **tweets_Usuarios_V1.gzip** that contains just the colmns needed for the graph construction. Which is Author ID, Reference Type, Date and Retweet Author