<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Renata - Dataset preparation

## Importing the required libraries

In [1]:
import re
import pandas as pd
from bs4 import BeautifulSoup

## Setting input and output filenames

Set the `input_filename` with the filename of the file to be processed.

In [2]:
input_filename = 'cl_st1_renata_truthbrush_20240616_raw_data.jsonl'
suffix = '_prep'

def add_suffix(filename):
    # Extract the base filename without the extension
    base_filename = re.match(r'^([A-Za-z0-9-_,\s]+)\.[A-Za-z]{1,5}$', filename).group(1)
    
    # Append suffix to the base filename
    new_filename = f'{base_filename}{suffix}'
    
    # Add the original file extension back
    new_filename += re.search(r'\.[A-Za-z]{1,5}$', filename).group()
    
    return new_filename

output_filename = add_suffix(input_filename)

## Data wrangling

### Importing the tweet raw data into a dataframe

In [3]:
df_tweets_raw_data = pd.read_json(input_filename, lines=True)

In [4]:
df_tweets_raw_data.head(5)

Unnamed: 0,id,created_at,in_reply_to_id,quote_id,in_reply_to_account_id,sensitive,spoiler_text,visibility,language,uri,...,reblogs_count,favourites_count,favourited,reblogged,muted,pinned,bookmarked,poll,emojis,_pulled
0,112628415816970752,2024-06-16 21:27:19.392000+00:00,,,,False,,public,en,https://truthsocial.com/@AlexJones/11262841581...,...,20,40,False,False,False,False,False,,[],2024-06-16T19:32:14.220929
1,112627604328468576,2024-06-16 18:00:57.062000+00:00,,,,False,,public,en,https://truthsocial.com/@AlexJones/11262760432...,...,47,139,False,False,False,False,False,,[],2024-06-16T19:32:14.221289
2,112621239396080064,2024-06-15 15:02:15.900000+00:00,,,,False,,public,en,https://truthsocial.com/@AlexJones/11262123939...,...,36,111,False,False,False,False,False,,[],2024-06-16T19:32:14.221532
3,112617723858598704,2024-06-15 00:08:13.055000+00:00,,,,False,,public,en,https://truthsocial.com/@AlexJones/11261772385...,...,80,201,False,False,False,False,False,,[],2024-06-16T19:32:14.221757
4,112616115915769904,2024-06-14 17:19:17.792000+00:00,,,,False,,public,en,https://truthsocial.com/@AlexJones/11261611591...,...,37,98,False,False,False,False,False,,[],2024-06-16T19:32:14.222031


In [5]:
df_tweets_raw_data.shape

(762993, 33)

### Checking if data types are consistent

In [6]:
df_tweets_raw_data.dtypes

id                                      int64
created_at                datetime64[ns, UTC]
in_reply_to_id                        float64
quote_id                              float64
in_reply_to_account_id                float64
sensitive                                bool
spoiler_text                           object
visibility                             object
language                               object
uri                                    object
url                                    object
content                                object
account                                object
media_attachments                      object
mentions                               object
tags                                   object
card                                   object
group                                  object
quote                                  object
in_reply_to                           float64
reblog                                 object
sponsored                         

### Checking the columns `id`, `created_at`, `language`, `url`, `content` and `account` missing values

In [7]:
print(df_tweets_raw_data['id'].isnull().sum())

0


In [8]:
print(df_tweets_raw_data['created_at'].isnull().sum())

0


The parameter `language` is not reliable.

In [9]:
print(df_tweets_raw_data['language'].isnull().sum())

253915


In [10]:
print(df_tweets_raw_data['url'].isnull().sum())

0


In [11]:
print(df_tweets_raw_data['content'].isnull().sum())

0


In [12]:
print(df_tweets_raw_data['account'].isnull().sum())

0


### Dropping unnecessary columns

#### Listing the columns

In [13]:
df_tweets_raw_data.columns.values.tolist()

['id',
 'created_at',
 'in_reply_to_id',
 'quote_id',
 'in_reply_to_account_id',
 'sensitive',
 'spoiler_text',
 'visibility',
 'language',
 'uri',
 'url',
 'content',
 'account',
 'media_attachments',
 'mentions',
 'tags',
 'card',
 'group',
 'quote',
 'in_reply_to',
 'reblog',
 'sponsored',
 'replies_count',
 'reblogs_count',
 'favourites_count',
 'favourited',
 'reblogged',
 'muted',
 'pinned',
 'bookmarked',
 'poll',
 'emojis',
 '_pulled']

#### Selecting the columns that are being dropped

1. Edit the previous list and comment 'id', 'created_at', 'language', 'url', 'content' and 'account';
2. Update the following command with the updated list before running it.

In [14]:
df_tweets_raw_data = df_tweets_raw_data.drop(columns=[
#    'id',
#    'created_at',
    'in_reply_to_id',
    'quote_id',
    'in_reply_to_account_id',
    'sensitive',
    'spoiler_text',
    'visibility',
#    'language',
    'uri',
#    'url',
#    'content',
#    'account',
    'media_attachments',
    'mentions',
    'tags',
    'card',
    'group',
    'quote',
    'in_reply_to',
    'reblog',
    'sponsored',
    'replies_count',
    'reblogs_count',
    'favourites_count',
    'favourited',
    'reblogged',
    'muted',
    'pinned',
    'bookmarked',
    'poll',
    'emojis',
    '_pulled'
])

In [15]:
df_tweets_raw_data.columns.values.tolist()

['id', 'created_at', 'language', 'url', 'content', 'account']

### Listing the values of the parameter `language`

This parameter is not reliable. Even though different languages are informed, it seems that all of the posts are actually in English. Therefore, it has been decided not to exclude any posts based on this information.

In [16]:
df_tweets_raw_data['language'].unique()

array(['en', None, 'co', 'de', 'nl', 'lb', 'es', 'no', 'eo', 'sr', 'af',
       'st', 'fy', 'mt', 'bg', 'pl', 'ca', 'da', 'ga', 'et', 'ms', 'fr',
       'it', 'la', 'lt', '', 'jv', 'el', 'sv', 'pt', 'lv', 'id', 'vi',
       'zu', 'gl', 'gd', 'hu', 'fi', 'ig', 'sn', 'sl', 'ha', 'mg', 'so',
       'cy', 'zh', 'xh', 'eu', 'ht', 'ja', 'mi', 'sw', 'mk', 'bs', 'su',
       'yo', 'sm', 'az', 'sq', 'ny', 'sk', 'ru', 'uz', 'cs', 'ku', 'tg',
       'hi', 'ky', 'ro', 'tr', 'kk', 'fa', 'hr', 'be', 'ko', 'mr', 'uk',
       'ne', 'ka'], dtype=object)

### Extracting the column `username`

In [17]:
# Flatten the nested JSON 'account' attribute
df_tweets_raw_data_flattened_user = pd.json_normalize(df_tweets_raw_data['account'])

# Extract the 'username' attribute
username = df_tweets_raw_data_flattened_user['username']

# Create a new column 'username'
df_tweets_raw_data['username'] = username

### Extracting the column `author_id`

In [18]:
# Extract the 'id' attribute
author_id = df_tweets_raw_data_flattened_user['id']

# Create a new column 'author_id'
df_tweets_raw_data['author_id'] = author_id

### Extracting the column `tweet_url`

In [19]:
# Extract the 'url' attribute
tweet_url = df_tweets_raw_data['url']

# Create a new column 'tweet_url'
df_tweets_raw_data['tweet_url'] = tweet_url

### Extracting the column `text`

`BeautifulSoup` is being used to extract plain text from the column `content`.

In [20]:
# Extracting plain text from the 'content' column and storing it in column 'text'
df_tweets_raw_data['text'] = df_tweets_raw_data['content'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())

In [21]:
df_tweets_raw_data

Unnamed: 0,id,created_at,language,url,content,account,username,author_id,tweet_url,text
0,112628415816970752,2024-06-16 21:27:19.392000+00:00,en,https://truthsocial.com/@AlexJones/11262841581...,"<p><a href=""https://truthsocial.com/tags/AlexJ...","{'id': '107838014712814235', 'username': 'Alex...",AlexJones,107838014712814235,https://truthsocial.com/@AlexJones/11262841581...,#AlexJonesShow Sunday Live: Hollywood Comes to...
1,112627604328468576,2024-06-16 18:00:57.062000+00:00,en,https://truthsocial.com/@AlexJones/11262760432...,<p>Watch: Alex Jones Calls Out Globalist Tyran...,"{'id': '107838014712814235', 'username': 'Alex...",AlexJones,107838014712814235,https://truthsocial.com/@AlexJones/11262760432...,Watch: Alex Jones Calls Out Globalist Tyrants ...
2,112621239396080064,2024-06-15 15:02:15.900000+00:00,en,https://truthsocial.com/@AlexJones/11262123939...,<p>Alex Jones Emergency Saturday Broadcast: Le...,"{'id': '107838014712814235', 'username': 'Alex...",AlexJones,107838014712814235,https://truthsocial.com/@AlexJones/11262123939...,Alex Jones Emergency Saturday Broadcast: Learn...
3,112617723858598704,2024-06-15 00:08:13.055000+00:00,en,https://truthsocial.com/@AlexJones/11261772385...,<p>Exclusive: Alex Jones Makes First Statement...,"{'id': '107838014712814235', 'username': 'Alex...",AlexJones,107838014712814235,https://truthsocial.com/@AlexJones/11261772385...,Exclusive: Alex Jones Makes First Statements A...
4,112616115915769904,2024-06-14 17:19:17.792000+00:00,en,https://truthsocial.com/@AlexJones/11261611591...,"<p><a href=""https://truthsocial.com/tags/AlexJ...","{'id': '107838014712814235', 'username': 'Alex...",AlexJones,107838014712814235,https://truthsocial.com/@AlexJones/11261611591...,"#AlexJonesShow LIVE: Tucker Carlson, Russell B..."
...,...,...,...,...,...,...,...,...,...,...
762988,107821332077590816,2022-02-18 22:22:42.617000+00:00,en,https://truthsocial.com/@truthsocial/107821332...,<p>Remember this is SOCIAL media. It’s meant ...,"{'id': '107759501782461327', 'username': 'trut...",truthsocial,107759501782461327,https://truthsocial.com/@truthsocial/107821332...,Remember this is SOCIAL media. It’s meant to ...
762989,107820127429653232,2022-02-18 17:16:21.149000+00:00,en,https://truthsocial.com/@truthsocial/107820127...,<p>The dream has become a reality. You have yo...,"{'id': '107759501782461327', 'username': 'trut...",truthsocial,107759501782461327,https://truthsocial.com/@truthsocial/107820127...,The dream has become a reality. You have your ...
762990,108116645317044176,2022-04-12 02:04:45.040000+00:00,en,https://truthsocial.com/@warrendavidson/108116...,"<p>NATO should be reinforced, not re-imagined ...","{'id': '107838824115445312', 'username': 'warr...",warrendavidson,107838824115445312,https://truthsocial.com/@warrendavidson/108116...,"NATO should be reinforced, not re-imagined wit..."
762991,108086202801522400,2022-04-06 17:02:49.117000+00:00,en,https://truthsocial.com/@warrendavidson/108086...,"<p>Yesterday, I was one of 63 members of Congr...","{'id': '107838824115445312', 'username': 'warr...",warrendavidson,107838824115445312,https://truthsocial.com/@warrendavidson/108086...,"Yesterday, I was one of 63 members of Congress..."


### Inspecting the data

In [22]:
inspected_row = 160
print('username:' + df_tweets_raw_data.loc[inspected_row, 'username'])
print('content:' + df_tweets_raw_data.loc[inspected_row, 'content'])
print('text:' + df_tweets_raw_data.loc[inspected_row, 'text'])
print('tweet_url:' + df_tweets_raw_data.loc[inspected_row, 'tweet_url'])

username:AlexJones
content:<p>Viral Talk Show Host Jackson Hinkle Joins Alex Jones Live In-Studio For Epic Debate  <a href="https://links.truthsocial.com/link/112317395759093712" rel="nofollow noopener noreferrer" target="_blank"><span class="invisible">https://</span><span class="ellipsis">madmaxworld.tv/watch?id=6625bc</span><span class="invisible">f80b8f941d11961839</span></a></p>
text:Viral Talk Show Host Jackson Hinkle Joins Alex Jones Live In-Studio For Epic Debate  https://madmaxworld.tv/watch?id=6625bcf80b8f941d11961839
tweet_url:https://truthsocial.com/@AlexJones/112317395758273362


### Creating the output file

In [23]:
df_tweets_raw_data[['created_at', 'author_id', 'username', 'tweet_url', 'content', 'text']].to_json(output_filename, orient='records', lines=True)