# Concatenate tweets

In this notebook, we will consolidate all the tweet files into a single compressed pickle file for further analysis. We have three main sets of data that we need to store: 

1. Data from January 2021.
2. Data from October 2021.
3. Data from April 28 to June 30.

Each of these samples corresponds to a specific moment relevant for our analysis. The October data is used for analyzing our community during election periods, specifically the regional elections in Colombia that took place in October 2019. The data from January 2021 represents the period three months before the "Paro Nacional," allowing us to track our community before the social outbreak. Finally, we have the data from the time of the "Paro Nacional," which will be the focal point of our analysis.

In [1]:
import os
import pandas as pd
import numpy as np
from glob import glob
from tqdm import tqdm

In [2]:
path = r"/mnt/disk2/Data"
pd.set_option("display.max_columns", None)

## Regional elections: October 2019

In [5]:
# We create an empty aux list that will store the tweets.
tweets_aux = []
files_oct = glob(os.path.join(path, 'RawData','users_oct_19/*.csv'))
empties = []
for file in tqdm(files_oct):
    df = pd.read_csv(file, dtype = {'Author ID': int, 'Referenced Tweet Author ID': int})
    if df.empty:
        empties.append(file)
    else:
        tweets_aux.append(df)

# Finally, the tweet dataframe is established and tweets_aux is deleted.
tweets = pd.concat(tweets_aux)
del tweets_aux
tweets = tweets.sort_values('ID').reset_index(drop = True)

# Store results
tweets.to_pickle(os.path.join(path, "Tweets_DataFrames/tweets_oct19.gzip"), compression = "gzip")

  0%|          | 0/25125 [00:00<?, ?it/s]




ValueError: Integer column has NA values in column 22

## Before Paro Nacional: January 2021

We identify two users with their file corrupted: Usuario_82383620 and Usuario_2526574133

In [4]:
# We create an empty aux list that will store the tweets.
tweets_aux = []
files_jan = glob(os.path.join(path, "RawData", "users_jan/*.csv"))

for file in tqdm(files_jan):
    df = pd.read_csv(file)
    if df.empty:
        empties.append(file)
    else:
        tweets_aux.append(df)

# Finally, the tweet dataframe is established and tweets_aux is deleted.  
tweets_jan = pd.concat(tweets_aux)
del tweets_aux
tweets_jan = tweets_jan.sort_values('ID').reset_index(drop = True)

# Store results
# run sudo chmod 777 /mnt/disk2/Data/Tweets_DataFrames in bash if it is needed
tweets_jan.to_pickle(os.path.join(path, "Tweets_DataFrames/tweets_jan21.gzip"), compression = "gzip")

100%|██████████| 34048/34048 [02:17<00:00, 247.12it/s]


## Paro Nacional: April 28 - June 30 2021

In [3]:
files_v1 = glob(os.path.join(path, 'RawData/Usuarios_V1/*.csv'))
len(files_v1)

37324

In [5]:
df = pd.read_csv(files_v1[0])
df['Reference Type']

Unnamed: 0,ID,Permalink,Author ID,Author Name,Author Location,Author Description,Author Followers,Author Following,Author Tweets,Author Profile Image,...,Favorites,Quotes,is Retweet?,Reply To User Name,Mentions,Referenced Tweet,Reference Type,Referenced Tweet Author ID,Media URLs,Media Keys
0,1410095246730485760,/Resistenciahera/status/1410095246730485760,918059636082823173,Resistenciahera,Cali,café adicto☕️☕️☕️☕️☕️ orgulloso de tener el co...,3181,4972,55097,https://pbs.twimg.com/profile_images/147008408...,...,0,0,True,,JUANCAELBROKY,1.410065e+18,retweeted,141943900.0,https://pbs.twimg.com/ext_tw_video_thumb/14100...,7_1410064556207058948
1,1410095063800111104,/Resistenciahera/status/1410095063800111104,918059636082823173,Resistenciahera,Cali,café adicto☕️☕️☕️☕️☕️ orgulloso de tener el co...,3181,4972,55097,https://pbs.twimg.com/profile_images/147008408...,...,0,0,True,,ElParcheCritico ClaudiaLopez,1.410063e+18,retweeted,8.628063e+17,,
2,1410093637279571970,/Resistenciahera/status/1410093637279571970,918059636082823173,Resistenciahera,Cali,café adicto☕️☕️☕️☕️☕️ orgulloso de tener el co...,3181,4972,55097,https://pbs.twimg.com/profile_images/147008408...,...,0,0,True,,Estudianteslas1,1.410092e+18,retweeted,1.402301e+18,,7_1410091724656029700
3,1410093409419894787,/Resistenciahera/status/1410093409419894787,918059636082823173,Resistenciahera,Cali,café adicto☕️☕️☕️☕️☕️ orgulloso de tener el co...,3181,4972,55097,https://pbs.twimg.com/profile_images/147008408...,...,0,0,True,,ma_camiladiaz elpais_america,1.410015e+18,retweeted,382419800.0,,3_1410015253472202757
4,1410093148932608006,/Resistenciahera/status/1410093148932608006,918059636082823173,Resistenciahera,Cali,café adicto☕️☕️☕️☕️☕️ orgulloso de tener el co...,3181,4972,55097,https://pbs.twimg.com/profile_images/147008408...,...,0,0,True,,elespectador,1.409839e+18,retweeted,14834300.0,,


In [15]:
def unique_to_string(x):
    unique_values = x.unique()
    return ', '.join(map(str, unique_values))

user_information = df.groupby(['Author ID', 'Author Name']).agg({
                'Author Location': unique_to_string,
                'Author Description': unique_to_string,
                'Author Followers': lambda x: np.nanmean(x),
                'Author Following': lambda x: np.nanmean(x),
                'Author Tweets': lambda x: np.nanmax(x),
                'Author Verified': unique_to_string})



In [16]:
user_information

Unnamed: 0_level_0,Unnamed: 1_level_0,Author Location,Author Description,Author Followers,Author Following,Author Tweets,Author Verified,Author Retweets
Author ID,Author Name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
918059636082823173,Resistenciahera,Cali,café adicto☕️☕️☕️☕️☕️ orgulloso de tener el co...,3181.0,4972.0,55097,False,8929


In [6]:
df_list = []
users_information = []

# cols = ['ID', 'Author ID', 'Author Name', 'Date', 'Text', 'Replies', 'Retweets', 'Favorites', 'Quotes', 'is Retweet?',
#            'Reply To User Name', 'Mentions', 'Referenced Tweet', 'Reference Type', 'Referenced Tweet Author ID']

problems = []

def unique_to_string(x):
    unique_values = x.unique()
    return ', '.join(map(str, unique_values))

# Counter and variable for keeping track of file names
count = 0 # Amount of Tweets
n = 0 # Number of Checkpoint

# Runtime 1 Hour!!!!!
for file in tqdm(files_v1):
    try:
        # df = pd.read_csv(file, usecols = cols)
        df = pd.read_csv(file)
        if df.empty:
            empties.append(file)
        else:
            pass
        # Fix some datatypes
        df[['Author Followers', 'Author Following', 'Author Tweets']] = df[['Author Followers', 'Author Following', 'Author Tweets']].map(lambda x: pd.to_numeric(x, errors = 'coerce'))
        df_list.append(df)
        count += len(df)

        # Save user information
        user_information = df.groupby(['Author ID', 'Author Name']).agg({
                'Author Location': unique_to_string,
                'Author Description': unique_to_string,
                'Author Followers': lambda x: np.nanmean(x),
                'Author Following': lambda x: np.nanmean(x),
                'Author Tweets': lambda x: np.nanmax(x),
                'Author Verified': unique_to_string})
        
        users_information.append(user_information)
        
        # If we reach or exceed 10 million rows, save the file and reset
        if count >= 7_000_000:
            n += 1
            concat_df = pd.concat(df_list)
            output_filename = f"tweets_paro_{n}.gzip"
            concat_df.to_pickle(os.path.join(path, f"Tweets_DataFrames/{output_filename}"), compression='gzip')
            
            # Reset counter and list
            count = 0
            df_list = []
            
    except (ValueError, KeyError) as e:
        problems.append(file)

problems

  'Author Followers': lambda x: np.nanmean(x),
  'Author Following': lambda x: np.nanmean(x),
  'Author Tweets': lambda x: np.nanmax(x),
  'Author Followers': lambda x: np.nanmean(x),
  'Author Tweets': lambda x: np.nanmax(x),
  'Author Followers': lambda x: np.nanmean(x),
  'Author Following': lambda x: np.nanmean(x),
  'Author Tweets': lambda x: np.nanmax(x),
  'Author Followers': lambda x: np.nanmean(x),
  'Author Tweets': lambda x: np.nanmax(x),
  'Author Followers': lambda x: np.nanmean(x),
  'Author Tweets': lambda x: np.nanmax(x),
  'Author Followers': lambda x: np.nanmean(x),
  'Author Tweets': lambda x: np.nanmax(x),
  'Author Followers': lambda x: np.nanmean(x),
  'Author Tweets': lambda x: np.nanmax(x),
  'Author Followers': lambda x: np.nanmean(x),
  'Author Tweets': lambda x: np.nanmax(x),
  'Author Followers': lambda x: np.nanmean(x),
  'Author Tweets': lambda x: np.nanmax(x),
  'Author Followers': lambda x: np.nanmean(x),
  'Author Following': lambda x: np.nanmean(x),
  

['/mnt/disk2/Data/RawData/Usuarios_V1/Usuario_31172486-Juan’s MacBook Air.csv',
 '/mnt/disk2/Data/RawData/Usuarios_V1/Usuario_418406996-Juan’s MacBook Air.csv',
 '/mnt/disk2/Data/RawData/Usuarios_V1/Usuario_2564362444-Juan’s MacBook Air.csv',
 '/mnt/disk2/Data/RawData/Usuarios_V1/Usuario_721013234-Juan’s MacBook Air.csv',
 '/mnt/disk2/Data/RawData/Usuarios_V1/Usuario_286818396-Juan’s MacBook Air.csv',
 '/mnt/disk2/Data/RawData/Usuarios_V1/Usuario_186496554-Juan’s MacBook Air.csv',
 '/mnt/disk2/Data/RawData/Usuarios_V1/Usuario_799005686-Juan’s MacBook Air.csv',
 '/mnt/disk2/Data/RawData/Usuarios_V1/Usuario_3327640233-Juan’s MacBook Air.csv',
 '/mnt/disk2/Data/RawData/Usuarios_V1/Usuario_2183412805-Juan’s MacBook Air.csv',
 '/mnt/disk2/Data/RawData/Usuarios_V1/Usuario_1286016270517841931-Juan’s MacBook Air.csv',
 '/mnt/disk2/Data/RawData/Usuarios_V1/Usuario_820659585770016769-Juan’s MacBook Air.csv',
 '/mnt/disk2/Data/RawData/Usuarios_V1/Usuario_180601646-Juan’s MacBook Air.csv',
 '/mnt/

In [7]:
empties

['/mnt/disk2/Data/RawData/users_oct_19/Usuario_197098512.csv',
 '/mnt/disk2/Data/RawData/users_oct_19/Usuario_924784894810705920.csv',
 '/mnt/disk2/Data/RawData/users_oct_19/Usuario_1245376536745738241.csv',
 '/mnt/disk2/Data/RawData/users_oct_19/Usuario_339265087.csv',
 '/mnt/disk2/Data/RawData/users_oct_19/Usuario_1304325486298923008.csv',
 '/mnt/disk2/Data/RawData/users_oct_19/Usuario_3632842755.csv',
 '/mnt/disk2/Data/RawData/users_oct_19/Usuario_297579509.csv',
 '/mnt/disk2/Data/RawData/users_oct_19/Usuario_992142025331003392.csv',
 '/mnt/disk2/Data/RawData/users_oct_19/Usuario_90046706.csv',
 '/mnt/disk2/Data/RawData/users_oct_19/Usuario_2534100571.csv',
 '/mnt/disk2/Data/RawData/users_oct_19/Usuario_1128415295012470784.csv',
 '/mnt/disk2/Data/RawData/users_jan/Usuario_152356732.csv',
 '/mnt/disk2/Data/RawData/users_jan/Usuario_2983488825.csv',
 '/mnt/disk2/Data/RawData/users_jan/Usuario_342353211.csv',
 '/mnt/disk2/Data/RawData/users_jan/Usuario_355634793.csv',
 '/mnt/disk2/Data

In [8]:
# Save any remaining data after the loop
# Runtime 5 minutes
if df_list:
    n += 1
    concat_df = pd.concat(df_list)
    output_filename = f"tweets_paro_{n}.gzip"
    # If necessary, run "sudo chmod 777 Data/Tweets_DataFrames" in bash
    concat_df.to_pickle(os.path.join(path, f"Tweets_DataFrames/{output_filename}"), compression = 'gzip')

del df_list, concat_df

In [9]:
# runtime 22 minutes
concat_users_information = pd.concat(users_information)
concat_users_information = concat_users_information.groupby(['Author ID', 'Author Name']) \
    .agg({'Author Location': unique_to_string,
            'Author Description': unique_to_string,
            'Author Followers': lambda x: np.nanmean(x),
            'Author Following': lambda x: np.nanmean(x),
            'Author Tweets': lambda x: np.nanmax(x),
            'Author Verified': unique_to_string})
concat_users_information.to_pickle(os.path.join(path, "Tweets_DataFrames/users_information.gzip"), 
                                   compression = 'gzip')

  'Author Followers': lambda x: np.nanmean(x),
  'Author Following': lambda x: np.nanmean(x),
  'Author Tweets': lambda x: np.nanmax(x),


In [10]:
# We should correct this
concat_users_information

Unnamed: 0_level_0,Unnamed: 1_level_0,Author Location,Author Description,Author Followers,Author Following,Author Tweets,Author Verified
Author ID,Author Name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0.000000e+00,0,"True, False, True, False","nan, equidad_mujer, mariumega, PattyRosi24",,1.384355e+18,,"nan, https://pbs.twimg.com/media/E2C38-kXEAIkE..."
1.000000e+00,0,False,"nan, equidad_mujer",,1.398276e+18,,"nan, https://pbs.twimg.com/media/E19UAtWXsAYjz..."
2.000000e+00,0,False,"nan, estoacaquees, thearchipielago, equidad_mu...",,1.396578e+18,,https://pbs.twimg.com/ext_tw_video_thumb/14011...
3.000000e+00,0,"False, True","MinTransporteCo, nan, Supertransporte",,1.396701e+18,,https://pbs.twimg.com/media/E0ysfTuWUAUo8by.jp...
3.000000e+00,1,False,,,,,https://pbs.twimg.com/media/E1dlhtaXEAYTwol.jpg
...,...,...,...,...,...,...,...
1.389722e+18,Neoplasticista,Colombia,Arquitecto. Contra Corriente.,91.0,4.980000e+02,3534.0,False
1.389737e+18,JC13177979,"Bogotá, D.C., Colombia",Aunque nadie ha podido regresar y hacer un nue...,94.0,1.780000e+02,7083.0,False
1.389741e+18,JhonatanVRojo,"Medellín, Colombia",El mundo es más que blanco & negro.,103.0,4.270000e+02,1257.0,False
1.389769e+18,VaneLen18,Colombia,,8.0,9.300000e+01,1179.0,False


In [3]:
tweets_paro = glob('/mnt/disk2/Data/Tweets_DataFrames/tweets_paro_*')

tweets = pd.DataFrame()
for file in tqdm(tweets_paro):
    tweets_df = pd.read_pickle(file, compression = "gzip")
    tweets = pd.concat([tweets, tweets_df], axis = 0)
    del tweets_df
tweets.head()

100%|██████████| 7/7 [03:20<00:00, 28.58s/it]


Unnamed: 0,ID,Permalink,Author ID,Author Name,Author Location,Author Description,Author Followers,Author Following,Author Tweets,Author Profile Image,Author Verified,Date,Text,Replies,Retweets,Favorites,Quotes,is Retweet?,Reply To User Name,Mentions,Referenced Tweet,Reference Type,Referenced Tweet Author ID,Media URLs,Media Keys
0,1.409619e+18,/hmauriciojg/status/1409618955283668996,138377765.0,hmauriciojg,"Bucaramanga, Colombia",,22.0,558.0,873.0,https://pbs.twimg.com/profile_images/154468480...,False,2021/06/28 16:05:23,@DanielSamperO A vida hp!!. @IvanDuque fue y s...,0.0,0.0,0.0,0.0,False,DanielSamperO,DanielSamperO IvanDuque petrogustavo,1.409586e+18,replied_to,134855300.0,,
1,1.409575e+18,/hmauriciojg/status/1409574993596452867,138377765.0,hmauriciojg,"Bucaramanga, Colombia",,22.0,558.0,873.0,https://pbs.twimg.com/profile_images/154468480...,False,2021/06/28 13:10:41,@alejarojas_g A bueno de pronto si @petrogusta...,0.0,0.0,0.0,0.0,False,alejarojas_g,alejarojas_g petrogustavo,1.409192e+18,replied_to,1131821000.0,,
2,1.409302e+18,/hmauriciojg/status/1409302180847292417,138377765.0,hmauriciojg,"Bucaramanga, Colombia",,22.0,558.0,873.0,https://pbs.twimg.com/profile_images/154468480...,False,2021/06/27 19:06:38,@gabodelascasas Ahí la tiene https://t.co/2WJZ...,0.0,0.0,0.0,0.0,False,gabodelascasas,gabodelascasas,1.409298e+18,replied_to,62337500.0,https://pbs.twimg.com/media/E47Y3H4XMAMtHHu.jpg,3_1409302174933397507
3,1.407446e+18,/hmauriciojg/status/1407446306113691662,138377765.0,hmauriciojg,"Bucaramanga, Colombia",,22.0,558.0,873.0,https://pbs.twimg.com/profile_images/154468480...,False,2021/06/22 16:12:03,@JOHANVE_LAND Deberías hacerle esa pregunta ta...,0.0,0.0,0.0,0.0,False,JOHANVE_LAND,JOHANVE_LAND petrogustavo,1.407171e+18,replied_to,576647400.0,,
4,1.407176e+18,/hmauriciojg/status/1407176029635067904,138377765.0,hmauriciojg,"Bucaramanga, Colombia",,22.0,558.0,873.0,https://pbs.twimg.com/profile_images/154468480...,False,2021/06/21 22:18:04,@ClaraLopezObre @petrogustavo Que susto tan hp...,0.0,0.0,0.0,0.0,False,ClaraLopezObre,ClaraLopezObre petrogustavo,1.40675e+18,replied_to,126832600.0,,


### Tweets Lite
We create a reduced version of the Paro data frame. This will have the same amount of rows but we will only store four columns: 'Author ID', 'Date', 'Reference Type', 'Referenced Tweet Author ID'.

In [4]:
# Get just the columns that we need for the Graph construction
cols = ['Author ID','Author Name', 'Referenced Tweet Author ID', 'Date', 'Reference Type', 'Referenced Tweet']
tweets_lite = tweets[cols].reset_index(drop = True)
# Store results
# run sudo chmod 777 Data/Tweets_DataFrames in bash if it is needed
tweets_lite.to_pickle(os.path.join(path, "Tweets_DataFrames/tweets_lite.gzip"), compression = "gzip")

## Outputs

The output of this Notebook are stored "/mnt/disk2/Data/Tweets_DataFrames" and are listed below:

- **tweets_jan21.gzip**: Dataframe for the Tweets for our users during January of 2021. 3 Months before the Paro
- **tweets_oct19.gzip**: Dataframe for the Tweets for our users during October of 2019. Regional elections Period
- **tweets_paro_i.gzip**: 5 dataframes for the tweets of our users between April 28 to June 30 of 2021
- **tweets_lite.pkl**: Lite version of **tweets_Usuarios_V1.gzip** that contains just the colmns needed for the graph construction. Which is Author ID, Reference Type, Date and Retweet Author