<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Phase 1 - Quérem

## Prerequisites

Make sure the prerequisites in [CL_LMDA_prerequisites](https://github.com/laelgelc/laelgelc/blob/main/CL_LMDA_prerequisites.ipynb) are satisfied.

## Dataset

Please download the following dataset (Right-click on the link and choose `Open link in a new tab` to download the corresponding file):
- [tweets_all2.tsv](https://pucsp-my.sharepoint.com/:u:/g/personal/ra00341729_pucsp_edu_br/Edr0Gb9w9s1CsLhz8bpGVn8BGQwPRoXeiM01wpCDWpN5qA?e=g1WZwE)

## Importing the required libraries

In [1]:
import pandas as pd
import demoji
import re
import os
from collections import Counter
import html

## Data wrangling

### Importing the tweet raw data into a DataFrame

In [2]:
df_tweets_raw_data = pd.read_csv('tweets_all2.tsv', sep='\t')

In [3]:
df_tweets_raw_data.head(5)

Unnamed: 0,file,created_at,text_uniq,conversation_id,author_id,username,tweet_url,text,text_emojified,photo_uniq_id
0,file,created_at,tweet_00000001,conversation_id,author_id,username,tweet_url,text,text_emojified,file00000001.url
1,bolsonaro201803_n_00101.json,2018-03-30 10:21:59,tweet_00000002,convo_979665031997612034,id_731971002817744896,gizelda_m,https://twitter.com/gizelda_m/status/979665031...,RT @MendesOnca: Fracasso Bolsonaro só tem públ...,RT @MendesOnca: Fracasso Bolsonaro só tem públ...,
2,bolsonaro201803_n_00101.json,2018-03-30 10:21:32,tweet_00000003,convo_979664919955214337,id_14372459,unknown,unknown,"RT @pelegrini65: Após caluniar, ameaçar, incit...","RT @pelegrini65: Após caluniar, ameaçar, incit...",
3,bolsonaro201803_n_00101.json,2018-03-28 15:31:51,tweet_00000004,convo_979018234736373760,id_287765295,pelegrini65,https://twitter.com/pelegrini65/status/9790182...,"Após caluniar, ameaçar, incitar as pessoas con...","Após caluniar, ameaçar, incitar as pessoas con...",
4,bolsonaro201803_n_00101.json,2018-03-30 10:21:21,tweet_00000005,convo_979664873788502016,id_2551060160,PauliGVivas,https://twitter.com/PauliGVivas/status/9796648...,"RT @BohnGass: ""Nenhum apoiador de Bolsonaro ap...","RT @BohnGass: ""Nenhum apoiador de Bolsonaro ap...",


In [4]:
# Dropping the first row, which contains no useful data, and resetting the index
df_tweets_raw_data = df_tweets_raw_data.drop(index=0).reset_index(drop=True)

In [5]:
# Dropping the columns 'text_emojified' and 'photo_uniq_id' which are not used in this analysis
df_tweets_raw_data = df_tweets_raw_data.drop(columns=['text_emojified', 'photo_uniq_id'])

In [6]:
df_tweets_raw_data.shape

(1783138, 8)

#### Inspecting a few tweets

In [7]:
inspected_row = 0
print('username:' + df_tweets_raw_data.loc[inspected_row, 'username'])
print('text:' + df_tweets_raw_data.loc[inspected_row, 'text'])
print('tweet_url:' + df_tweets_raw_data.loc[inspected_row, 'tweet_url'])

username:gizelda_m
text:RT @MendesOnca: Fracasso Bolsonaro só tem público nas redes sociais. De robôs e fakes, e todos cafajestes https://t.co/B6trtJrQ7v https://…
tweet_url:https://twitter.com/gizelda_m/status/979665031997612034


### Inspecting the dataset and eliminating malformed data

#### Identifying rows that are empty in column `text`

In [8]:
print(df_tweets_raw_data['text'].isnull().sum())

24


In [9]:
df_tweets_raw_data[df_tweets_raw_data['text'].isnull()]

Unnamed: 0,file,created_at,text_uniq,conversation_id,author_id,username,tweet_url,text
18096,#Bolsonaro2018,@ultimosegundo Ahhhhhhhh.. Então tem q prender...,,,,,,
18097,#Bolsonaro2018,,,,,,,
19543,Alguém no PT deve ter uma paixão platônica pel...,@analisdocsPF @paulo_x Foi o Bolsonaro! {{Emoj...,,,,,,
19544,Alguém no PT deve ter uma paixão platônica pel...,,,,,,,
37888,Generalizando e rotulando tem que cortar mal p...,,,,,,,
37889,Bolsonaro racista safado!,@Tilitelly2 @VEJA @MPF_PGR Ele falou que todos...,,,,,,
37890,Generalizando e rotulando tem que cortar mal p...,,,,,,,
37891,Bolsonaro racista safado!,,,,,,,
49732,"Oportunista, irresponsável.",@jairbolsonaro,,,,,,
49733,"Oportunista, irresponsável.",,,,,,,


#### Dropping the rows that are empty in the column `text`

In [10]:
# Drop the rows whose column 'text' is NaN
df_tweets_raw_data = df_tweets_raw_data.dropna(subset=['text'])

# Reset the index
df_tweets_raw_data = df_tweets_raw_data.reset_index(drop=True)

In [11]:
print(df_tweets_raw_data['text'].isnull().sum())

0


#### Checking if data types are consistent

In [12]:
df_tweets_raw_data.dtypes

file               object
created_at         object
text_uniq          object
conversation_id    object
author_id          object
username           object
tweet_url          object
text               object
dtype: object

In [13]:
df_tweets_raw_data['created_at'] = pd.to_datetime(df_tweets_raw_data['created_at'])

In [14]:
df_tweets_raw_data.dtypes

file                       object
created_at         datetime64[ns]
text_uniq                  object
conversation_id            object
author_id                  object
username                   object
tweet_url                  object
text                       object
dtype: object

## Sampling the raw data according to filtering expressions

In [15]:
# Defining the filtering expressions
filter_words = ['arma', 'pátria', 'ladrão', 'cristão', 'comunista', 'família', 'liberdade', 'conservador', 'deus']

# Creating a boolean mask for filtering
mask = df_tweets_raw_data['text'].str.contains('|'.join(filter_words), case=False)

# Applying the mask to create 'df_tweets_filtered'
df_tweets_filtered = df_tweets_raw_data[mask]
df_tweets_filtered = df_tweets_filtered.reset_index(drop=True)
df_tweets_filtered.shape

(114289, 8)

In [16]:
df_tweets_filtered

Unnamed: 0,file,created_at,text_uniq,conversation_id,author_id,username,tweet_url,text
0,bolsonaro201803_n_00101.json,2018-03-28 15:31:51,tweet_00000004,convo_979018234736373760,id_287765295,pelegrini65,https://twitter.com/pelegrini65/status/9790182...,"Após caluniar, ameaçar, incitar as pessoas con..."
1,bolsonaro201803_n_00101.json,2018-03-30 04:12:13,tweet_00000046,convo_979571975545872384,id_16794066,BlogdoNoblat,https://twitter.com/BlogdoNoblat/status/979571...,Bolsonaro deve saber o que está fazendo. Porqu...
2,bolsonaro201803_n_00101.json,2018-03-30 10:16:00,tweet_00000051,convo_979363111089135617,id_955901617148235776,MariaOl25529153,https://twitter.com/MariaOl25529153/status/979...,"@FlavioBolsonaro Mais um Romário na política ,..."
3,bolsonaro201803_n_00101.json,2018-03-30 10:11:56,tweet_00000065,convo_979363111089135617,id_955901617148235776,MariaOl25529153,https://twitter.com/MariaOl25529153/status/979...,@FlavioBolsonaro Jogadores de futebol na polít...
4,bolsonaro201803_n_00102.json,2018-03-30 04:12:13,tweet_00000046,convo_979571975545872384,id_16794066,BlogdoNoblat,https://twitter.com/BlogdoNoblat/status/979571...,Bolsonaro deve saber o que está fazendo. Porqu...
...,...,...,...,...,...,...,...,...
114284,bolsonaro202304_n_00206.json,2023-04-29 13:47:12,tweet_00483876,convo_1652308555968532481,id_45473463,CarlosZarattini,https://twitter.com/CarlosZarattini/status/165...,Bolsonaristas ativaram os seus robôs e perfis ...
114285,bolsonaro202304_n_00206.json,2023-04-29 19:18:25,tweet_00487132,convo_1651912640837349377,id_1236078194878541824,JDB33858086,https://twitter.com/JDB33858086/status/1651912...,@Joovito81551003 @odilabueno1 @ORenanxD @Senso...
114286,bolsonaro202304_n_00206.json,2023-04-29 19:18:22,tweet_00481588,convo_1652391894079488000,id_950791261417627649,talitaxaguiar,https://twitter.com/talitaxaguiar/status/16523...,RT @princesacowboy: meu deus do céu de vez em ...
114287,bolsonaro202304_n_00206.json,2023-04-28 16:01:48,tweet_00481589,convo_1651980041234837518,id_1100427039406997504,princesacowboy,https://twitter.com/princesacowboy/status/1651...,meu deus do céu de vez em quando eu lembro o q...


## Cleaning data

### Removing specific Unicode characters

The dataset may need to be cleaned of invisible Unicode characters.

##### Detecting `U+2066` and `U+2069` characters

- [U+2066](https://www.compart.com/en/unicode/U+2066)
- [U+2069](https://www.compart.com/en/unicode/U+2069)

Please refer to:
- [Python RegEx](https://www.w3schools.com/python/python_regex.asp)
- [regex101](https://regex101.com/)
- [RegExr](https://regexr.com/)

In [17]:
# Defining a function to detect specific Unicode characters
def extract_unicode_characters(df, column_name):
    unicode_chars = Counter()  # Initialize a Counter to store Unicode character counts

    for value in df[column_name]:
        if isinstance(value, str):
            # Use RegEx to find non-ASCII characters (Unicode)
#            non_ascii_chars = re.findall(r'[^\x00-\x7F]+', value)
            # Use RegEx to find specific Unicode characters - adjust the expression accordingly
            specific_unicode_chars = re.findall(r'[\u2066\u2069]', value)
            unicode_chars.update(specific_unicode_chars)

    return unicode_chars

# Inspect the dataframe for specific Unicode characters
unicode_counts = extract_unicode_characters(df_tweets_filtered, 'text')

# Print the results
for char, count in unicode_counts.items():
    print(f'Character {char}: Count = {count}')

Character ⁦: Count = 96
Character ⁩: Count = 102


#### Removing `U+2066` and `U+2069` characters

In [18]:
# Defining a function to remove specific Unicode characters
def remove_specific_unicode(input_line):
    # Using RegEx to replace specific Unicode characters - adjust the expression accordingly
    cleaned_line = re.sub(r'[\u2066\u2069]', '', input_line)
    return cleaned_line

# Removing specific Unicode characters
df_tweets_filtered['text'] = df_tweets_filtered['text'].apply(remove_specific_unicode)

### Replacing the `LF` character by a space

Some tweets, especially the retweeted ones, contain multiple lines of text.

In [19]:
# Defining a function to replace the `LF` character by a space
def remove_cr_lf(input_line):
    # Using RegEx to replace LF by a space
    cleaned_line = re.sub(r'\n', ' ', input_line)
    return cleaned_line

# Applying the function to the 'text' column in your DataFrame
df_tweets_filtered['text'] = df_tweets_filtered['text'].apply(remove_cr_lf)

### Replacing the `U+00A0` (No-Break Space) character by a space

In [20]:
# Defining a function to replace the `U+A0` character by a space
def replace_no_break_space(input_line):
    # Using RegEx to replace LF by a space
    cleaned_line = re.sub(r'\u00a0', ' ', input_line)
    return cleaned_line

# Applying the function to the 'text' column in your DataFrame
df_tweets_filtered['text'] = df_tweets_filtered['text'].apply(replace_no_break_space)

### Removing URLs

In [21]:
# Defining a function to remove URLs
def remove_urls(input_string):
    modified_string = re.sub(r"((http|https):\/\/)?([a-zA-Z0-9-]+\.)+[a-zA-Z]{2,6}(\/[a-zA-Z0-9-._~:\/?#[\]@!$&'()*+,;=]*)?\/?", '', input_string)
    return modified_string

# Removing URLs
df_tweets_filtered['text'] = df_tweets_filtered['text'].apply(remove_urls)

### Cleaning HTML entities

In [22]:
def clean_html_entities(input_line):
    # Converting HTML entities to their corresponding characters
    decoded_line = html.unescape(input_line)
    # Removing HTML tags
    cleaned_line = re.sub(r'<.*?>', '', decoded_line)
    cleaned_line = re.sub(r'<', '', cleaned_line)
    cleaned_line = re.sub(r'>', '', cleaned_line)
    return cleaned_line

# Applying the function to the 'text' column in your DataFrame
df_tweets_filtered['text'] = df_tweets_filtered['text'].apply(clean_html_entities)

### Removing the preceding `.`

In [23]:
# Defining a function to remove preceding the dot
def clean_text(input_line):
    # Removing preceding dot
    cleaned_line = re.sub(r'(^\.)(?=[ #\w+@])', '', input_line)
    return cleaned_line

# Applying the function to the 'text' column in your DataFrame
df_tweets_filtered['text'] = df_tweets_filtered['text'].apply(clean_text)

### Dropping duplicates

#### Retweets

Retweets bear the following RegEx patterns at the beginning of the column `text`. They should be often dropped because they are duplicates of the original tweets. In the case of this study, though, the dataset may not include the original tweets because it is a 1% random sample. Therefore, only the first occurrence of the retweet is being kept.

- \bRT @\w+\s*:
- \brt @\w+\s*:
- \bRT @\w+\s*
- \bRT:\s*
- \bRT\s*

In [24]:
# Create a new column 'no_retweet' containing the contents of the column 'text' without any preceding 'RT @mentions:'
df_tweets_filtered['no_retweet'] = df_tweets_filtered['text'].str.replace(r'\bRT @\w+\s*:|\brt @\w+\s*:|\bRT @\w+\s*|\bRT:\s*|\bRT\s*', '', regex=True)

In [25]:
# Drop duplicate rows except the first occurrence based on 'no_mention'
df_tweets_filtered.drop_duplicates(subset='no_retweet', keep='first', inplace=True)
df_tweets_filtered = df_tweets_filtered.reset_index(drop=True)
df_tweets_filtered.shape

(33361, 9)

#### Duplicate tweets

The dataset was build in a way that if a certain tweet had more than one photo, one copy of the tweet was included per unique photo. Since we are concerned with analysing just the text, those duplicates should be removed. Tweets that bear the same 'tweet_url' are duplicates - we are going to keep only the first.

In [26]:
df_tweets_filtered.drop_duplicates(subset='tweet_url', keep='first', inplace=True)
df_tweets_filtered = df_tweets_filtered.reset_index(drop=True)
df_tweets_filtered.shape

(21888, 9)

#### @mentioned tweets

A few users @mention copies of tweets towards other specific users creating multiple copies of the same tweet - those duplicates should be removed.

In [27]:
# Create a new column 'no_mention' containing the contents of the column 'text' without any preceding @mentions
df_tweets_filtered['no_mention'] = df_tweets_filtered['text'].str.replace(r'@\w+\s*', '', regex=True)

# Drop duplicate rows except the first occurrence based on 'no_mention'
df_tweets_filtered.drop_duplicates(subset='no_mention', keep='first', inplace=True)
df_tweets_filtered = df_tweets_filtered.reset_index(drop=True)
df_tweets_filtered.shape

(20840, 10)

#### Duplicate texts

Checking for identical posts in terms of content of the column `text` in order to eliminate duplicates.

In [28]:
df_tweets_filtered.drop_duplicates(subset='text', keep='first', inplace=True)
df_tweets_filtered = df_tweets_filtered.reset_index(drop=True)
df_tweets_filtered.shape

(20840, 10)

## Inspecting and eliminating duplicates

### Creating a DataFrame index

In [29]:
df_tweets_filtered['df_index'] = df_tweets_filtered.index.astype(str).str.zfill(6)

### Sorting the DataFrame by the column `text` to enable duplicate detection

In [30]:
# Sorting the DataFrame by the 'text' column in ascending order
df_tweets_filtered = df_tweets_filtered.sort_values(by='text', ascending=True)

### Exporting the filtered data into a file for inspection

In [31]:
df_tweets_filtered[['df_index', 'text']].to_csv('tweets_emojified.tsv', sep='\t', index=False, encoding='utf-8', lineterminator='\n')

### Inspecting a few tweets

In [33]:
inspected_row = 1277
print('text:' + df_tweets_filtered.loc[inspected_row, 'text'])

text:#3em1 Alckmin: “armas não resolvem o problema da violência e segurança” Bolsonaro: “quando Alckmin deixar de andar de carro blindado e seguranças armados, eu passo a acreditar nele” 😂😂😂😂😂😂


### Dropping identified duplicates

In [34]:
# Define the list of indexes to drop
indexes_to_drop = [
    1277, 
    650, 
    11054, 
    7878, 
    9648, 
    19477, 
    81, 
    175, 
    167, 
    110, 
    113, 
    21, 
    156, 
    185, 
    15920, 
    15032, 
    15031, 
    15029, 
    15023, 
    15033, 
    15034, 
    15024, 
    14982, 
    14981, 
    14987, 
    14988, 
    15056, 
    15021, 
    15070, 
    15022, 
    14978, 
    15046, 
    15071, 
    10445, 
    17674, 
    17621, 
    17621, 
    17630, 
    17613, 
    17623, 
    17672, 
    17633, 
    17616, 
    17659, 
    17631, 
    17559, 
    17639, 
    17608, 
    8793, 
    716, 
    3411, 
    7933, 
    14757, 
    10727, 
    16427, 
    16681, 
    16691, 
    14045, 
    5674, 
    17434, 
    13247, 
    20624, 
    17674, 
    17621, 
    17630, 
    17613, 
    17623, 
    17672, 
    17633, 
    17616, 
    17659, 
    17631, 
    17559, 
    17685, 
    17608, 
    6449, 
    6450, 
    9690, 
    4711, 
    4238, 
    12567, 
    8779, 
    4448, 
    4491, 
    16674, 
    16768, 
    3737, 
    3749, 
    3597, 
    1242, 
    8930, 
    10687, 
    10664, 
    3016, 
    2950, 
    7975, 
    3123, 
    5399, 
    11833, 
    12439, 
    12559, 
    12579, 
    12615, 
    12662, 
    16933, 
    16501, 
    13325, 
    4674, 
    16420, 
    16730, 
    16551, 
    16423, 
    5889, 
    5870, 
    5822, 
    5908, 
    5791, 
    5830, 
    5876, 
    10501, 
    52, 
    133, 
    5854, 
    5872, 
    5857, 
    2562, 
    646, 
    5369, 
    5406, 
    5416, 
    19284, 
    18859, 
    19194, 
    19174, 
    19121, 
    19038, 
    18952, 
    19021, 
    18840, 
    19181, 
    19170, 
    19264, 
    18898, 
    2370, 
    2357, 
    2266, 
    2290, 
    4725, 
    4924, 
    4902, 
    3331, 
    3446, 
    3393, 
    8906, 
    2949, 
    7895, 
    7983, 
    7683, 
    19574, 
    19610, 
    19633, 
    7429, 
    8166, 
    7597, 
    14637, 
    9147, 
    8996, 
    8861, 
    18907, 
    10546, 
    15942, 
    15933, 
    15930, 
    15916, 
    15923, 
    139, 
    2644, 
    10526, 
    10870, 
    14213, 
    2683, 
    5741, 
    3554, 
    12876, 
    12755, 
    12988, 
    12715, 
    2664, 
    2244, 
    101, 
    17055, 
    1115, 
    6188, 
    14847, 
    14732, 
    15088, 
    7514, 
    2363, 
    10605, 
    3163, 
    3142, 
    2280, 
    2599, 
    2342, 
    2364, 
    2226, 
    8120, 
    8009, 
    8007, 
    8280, 
    8195, 
    11303, 
    10409, 
    10771, 
    10945, 
    10806, 
    10741, 
    17021, 
    17610, 
    9887, 
    16207, 
    13424, 
    13420, 
    13574, 
    13611, 
    13456, 
    13664, 
    1965, 
    5542, 
    12859, 
    12882, 
    12492, 
    12401, 
    3170, 
    3133, 
    3169, 
    3167, 
    3164, 
    3143, 
    3166, 
    3162, 
    3158, 
    3156, 
    3157, 
    3153, 
    3150, 
    3149, 
    3147, 
    3601, 
    3471, 
    19191, 
    18851
]

# Dropping the rows with the specified indexes
df_tweets_filtered = df_tweets_filtered.drop(indexes_to_drop)

### Sorting the DataFrame by the index to revert the DataFrame back to its original order

In [35]:
# Sorting the DataFrame back to the original order by the index
df_tweets_filtered = df_tweets_filtered.sort_index()
df_tweets_filtered = df_tweets_filtered.reset_index(drop=True)

In [36]:
df_tweets_filtered

Unnamed: 0,file,created_at,text_uniq,conversation_id,author_id,username,tweet_url,text,no_retweet,no_mention,df_index
0,bolsonaro201803_n_00101.json,2018-03-28 15:31:51,tweet_00000004,convo_979018234736373760,id_287765295,pelegrini65,https://twitter.com/pelegrini65/status/9790182...,"Após caluniar, ameaçar, incitar as pessoas con...","Após caluniar, ameaçar, incitar as pessoas con...","Após caluniar, ameaçar, incitar as pessoas con...",000000
1,bolsonaro201803_n_00101.json,2018-03-30 04:12:13,tweet_00000046,convo_979571975545872384,id_16794066,BlogdoNoblat,https://twitter.com/BlogdoNoblat/status/979571...,Bolsonaro deve saber o que está fazendo. Porqu...,Bolsonaro deve saber o que está fazendo. Porqu...,Bolsonaro deve saber o que está fazendo. Porqu...,000001
2,bolsonaro201803_n_00101.json,2018-03-30 10:16:00,tweet_00000051,convo_979363111089135617,id_955901617148235776,MariaOl25529153,https://twitter.com/MariaOl25529153/status/979...,"@FlavioBolsonaro Mais um Romário na política ,...","@FlavioBolsonaro Mais um Romário na política ,...","Mais um Romário na política , que Deus ajude o...",000002
3,bolsonaro201803_n_00102.json,2018-03-28 18:29:28,tweet_00000100,convo_979062935103459331,id_44449830,lucianagenro,https://twitter.com/lucianagenro/status/979062...,A esquerda não tem conseguido comunicar suas p...,A esquerda não tem conseguido comunicar suas p...,A esquerda não tem conseguido comunicar suas p...,000003
4,bolsonaro201803_n_00102.json,2018-03-30 09:57:48,tweet_00000110,convo_979658943818584064,id_912132396,rocoguima,https://twitter.com/rocoguima/status/979658943...,RT @AurystellaS: @BlogdoNoblat Vc sabe informa...,@BlogdoNoblat Vc sabe informar quantas vezes ...,RT : Vc sabe informar quantas vezes Bolsonaro ...,000004
...,...,...,...,...,...,...,...,...,...,...,...
20596,bolsonaro202304_n_00205.json,2023-04-29 19:19:35,tweet_00487119,convo_1652392200167145472,id_1547227306913153026,LuccaSo44679209,https://twitter.com/LuccaSo44679209/status/165...,"RT @LuccaSo44679209: @CiresCanisio Não, o Lula...","@CiresCanisio Não, o Lula não da uma refinari...","RT : Não, o Lula não da uma refinaria atroca d...",020835
20597,bolsonaro202304_n_00205.json,2023-04-29 19:19:18,tweet_00487120,convo_1652359568280678401,id_1547227306913153026,LuccaSo44679209,https://twitter.com/LuccaSo44679209/status/165...,"@CiresCanisio Não, o Lula não da uma refinaria...","@CiresCanisio Não, o Lula não da uma refinaria...","Não, o Lula não da uma refinaria atroca de pro...",020836
20598,bolsonaro202304_n_00205.json,2023-04-29 19:19:14,tweet_00487123,convo_1652308555968532481,id_1554492869825683457,Andre19lll,https://twitter.com/Andre19lll/status/16523085...,@eunaovoupararde @CarlosZarattini Os índices d...,@eunaovoupararde @CarlosZarattini Os índices d...,"Os índices de desemprego, PIB, inflação e qual...",020837
20599,bolsonaro202304_n_00206.json,2023-04-29 19:18:59,tweet_00487127,convo_1651881289497149440,id_1585200142440882179,priscila19865,https://twitter.com/priscila19865/status/16518...,@ValS265451870 @Guthbsb @marcia_miami Tá errad...,@ValS265451870 @Guthbsb @marcia_miami Tá errad...,"Tá errado,EU podendo m.a.t.a.v.a um FDP desse,...",020838


## Exporting to a file

### JSONL format

In [37]:
df_tweets_filtered[['created_at', 'author_id', 'username', 'tweet_url', 'text']].to_json('tweets_filtered.jsonl', orient='records', lines=True)

### TSV format

In [38]:
df_tweets_filtered[['created_at', 'author_id', 'username', 'tweet_url', 'text']].to_csv('tweets_filtered.tsv', sep='\t', index=False, encoding='utf-8', lineterminator='\n')

## Importing the Target Corpus into a DataFrame

In [39]:
df_tweets_filtered = pd.read_json('tweets_filtered.jsonl', lines=True)

In [40]:
df_tweets_filtered.head(5)

Unnamed: 0,created_at,author_id,username,tweet_url,text
0,2018-03-28 15:31:51,id_287765295,pelegrini65,https://twitter.com/pelegrini65/status/9790182...,"Após caluniar, ameaçar, incitar as pessoas con..."
1,2018-03-30 04:12:13,id_16794066,BlogdoNoblat,https://twitter.com/BlogdoNoblat/status/979571...,Bolsonaro deve saber o que está fazendo. Porqu...
2,2018-03-30 10:16:00,id_955901617148235776,MariaOl25529153,https://twitter.com/MariaOl25529153/status/979...,"@FlavioBolsonaro Mais um Romário na política ,..."
3,2018-03-28 18:29:28,id_44449830,lucianagenro,https://twitter.com/lucianagenro/status/979062...,A esquerda não tem conseguido comunicar suas p...
4,2018-03-30 09:57:48,id_912132396,rocoguima,https://twitter.com/rocoguima/status/979658943...,RT @AurystellaS: @BlogdoNoblat Vc sabe informa...


In [41]:
df_tweets_filtered.shape

(20601, 5)

In [42]:
df_tweets_filtered.dtypes

created_at    datetime64[ns]
author_id             object
username              object
tweet_url             object
text                  object
dtype: object

## Replacing hashtags

In [43]:
# Defining a function to format the hashtagged string
def format_hashtagged_string(input_line):
    # Defining a function to format the hashtagged string using RegEx
    def process_hashtagged_string(s):
            # Lowercase the string
            s = s.lower()
            # Add the appropriate prefixes and suffixes
            s = f'HASHTAG{s}_h'
            return s

    # Use RegEx to find and process each hashtagged string
    processed_line = re.sub(r'(#\w+)', lambda match: process_hashtagged_string(match.group(1)), input_line)
    return processed_line

# Formatting the hashtagged strings
df_tweets_filtered['text'] = df_tweets_filtered['text'].apply(format_hashtagged_string)

## Replacing emojis

### Demojifying the column `text`

In [44]:
# Defining a function to demojify a string
def demojify_line(input_line):
    demojified_line = demoji.replace_with_desc(input_line, sep='<em>')
    return demojified_line

df_tweets_filtered['text'] = df_tweets_filtered['text'].apply(demojify_line)

#### Exporting the filtered data into a file for inspection

In [45]:
df_tweets_filtered[['text']].to_csv('tweets_emojified1.tsv', sep='\t', index=False, encoding='utf-8', lineterminator='\n')

### Separating the demojified strings with spaces

In [46]:
# Defining a function to separate the demojified strings with spaces
def preprocess_line(input_line):
    # Add a space before the first delimiter '<em>', if it is not already preceded by one
    preprocessed_line = re.sub(r'(?<! )<em>', ' <em>', input_line)
    # Add a space after the first delimiter '<em>', if it is not already followed by one
    preprocessed_line = re.sub(r'<em>(?! )', '<em> ', preprocessed_line)
    return preprocessed_line

# Separating the demojified strings with spaces
df_tweets_filtered['text'] = df_tweets_filtered['text'].apply(preprocess_line)

#### Exporting the filtered data into a file for inspection

In [47]:
df_tweets_filtered[['text']].to_csv('tweets_emojified2.tsv', sep='\t', index=False, encoding='utf-8', lineterminator='\n')

### Formatting the demojified strings

In [48]:
# Defining a function to format the demojified string
def format_demojified_string(input_line):
    # Defining a function to format the demojified string using RegEx
    def process_demojified_string(s):
            # Lowercase the string
            s = s.lower()
            # Replace spaces and colons followed by a space with underscores
            s = re.sub(r'(: )| ', '_', s)
            # Add the appropriate prefixes and suffixes
            s = f'EMOJI{s}e'
            return s

    # Use RegEx to find and process each demojified string
    processed_line = re.sub(r'<em>(.*?)<em>', lambda match: process_demojified_string(match.group(1)), input_line)
    return processed_line

# Formatting the demojified strings
df_tweets_filtered['text'] = df_tweets_filtered['text'].apply(format_demojified_string)

### Replacing the `pipe` character by the `-` character in the `text` column

Further on, a few columns of the dataframe are going to be exported into the file `tweets.txt` whose columns need to be delimited by the `pipe` character. Therefore, it is recommended that any occurrences of the `pipe` character in the `text` column are replaced by another character.

In [49]:
# Defining a function to replace the 'pipe' character by the '-' character
def replace_pipe_with_hyphen(input_string):
    modified_string = re.sub(r'\|', '-', input_string)
    return modified_string

# Replacing the 'pipe' character by the '-' character
df_tweets_filtered['text'] = df_tweets_filtered['text'].apply(replace_pipe_with_hyphen)

#### Exporting the filtered data into a file for inspection

In [50]:
df_tweets_filtered[['text']].to_csv('tweets_emojified3.tsv', sep='\t', index=False, encoding='utf-8', lineterminator='\n')

## Tokenising

Please refer to [What is tokenization in NLP?](https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/).

In [51]:
# Defining a function to tokenise a string
def tokenise_string(input_line):
    # Replace URLs with placeholders
    url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+\b'
    placeholder = '<URL>'  # Choose a unique placeholder
    urls = re.findall(url_pattern, input_line)
    tokenised_line = re.sub(url_pattern, placeholder, input_line)  # Replace URLs with placeholders
    
    # Replace curly quotes with straight ones
    tokenised_line = tokenised_line.replace('“', '"').replace('”', '"').replace("‘", "'").replace("’", "'")
    # Separate common punctuation marks with spaces
    tokenised_line = re.sub(r'([.\!?,"\'/()])', r' \1 ', tokenised_line)
    # Add a space before '#'
    tokenised_line = re.sub(r'(?<!\s)#', r' #', tokenised_line)  # Add a space before '#' if it is not already preceded by one
    # Reduce extra spaces by a single space
    tokenised_line = re.sub(r'\s+', ' ', tokenised_line)
    
    # Replace the placeholders with the respective URLs
    for url in urls:
        tokenised_line = tokenised_line.replace(placeholder, url, 1)
    
    return tokenised_line

# Tokenising the strings
df_tweets_filtered['text'] = df_tweets_filtered['text'].apply(tokenise_string)

## Creating the files `file_index.txt` and `tweets.txt`

### Creating column `text_id`

In [52]:
df_tweets_filtered['text_id'] = 't' + df_tweets_filtered.index.astype(str).str.zfill(6)

### Creating column `conversation`

In [53]:
df_tweets_filtered['conversation'] = 'v:' + df_tweets_filtered['author_id'].str.replace('id_', '')

### Creating column `date`

In [54]:
# Convert 'created_at' to datetime format
df_tweets_filtered['created_at'] = pd.to_datetime(df_tweets_filtered['created_at'])

# Extract the date part (without time) into a new column 'date'
df_tweets_filtered['date'] = df_tweets_filtered['created_at'].dt.date

# Add the prefix 'd:' to the 'date' values
df_tweets_filtered['date'] = 'd:' + df_tweets_filtered['date'].astype(str)

### Creating column `text_url`

In [55]:
df_tweets_filtered['text_url'] = 'url:' + df_tweets_filtered['tweet_url']

### Creating column `user`

In [56]:
df_tweets_filtered['user'] = 'u:' + df_tweets_filtered['username']

### Creating column `content`

In [57]:
df_tweets_filtered['content'] = 'c:' + df_tweets_filtered['text']

### Reordering the created columns

Please refer to:
- [Python - List Comprehension 1](https://www.w3schools.com/python/python_lists_comprehension.asp)
- [Python - List Comprehension 2](https://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/)

In [58]:
# Reorder the columns (we use list comprehension to create a list of all columns except 'text_id', 'variable', 'date' and 'text_url')
df_tweets_filtered = df_tweets_filtered[['text_id', 'conversation', 'date', 'text_url', 'user', 'content'] + [col for col in df_tweets_filtered.columns if col not in ['text_id', 'conversation', 'date', 'text_url', 'user', 'content']]]

In [59]:
df_tweets_filtered

Unnamed: 0,text_id,conversation,date,text_url,user,content,created_at,author_id,username,tweet_url,text
0,t000000,v:287765295,d:2018-03-28,url:https://twitter.com/pelegrini65/status/979...,u:pelegrini65,"c:Após caluniar , ameaçar , incitar as pessoas...",2018-03-28 15:31:51,id_287765295,pelegrini65,https://twitter.com/pelegrini65/status/9790182...,"Após caluniar , ameaçar , incitar as pessoas c..."
1,t000001,v:16794066,d:2018-03-30,url:https://twitter.com/BlogdoNoblat/status/97...,u:BlogdoNoblat,c:Bolsonaro deve saber o que está fazendo . Po...,2018-03-30 04:12:13,id_16794066,BlogdoNoblat,https://twitter.com/BlogdoNoblat/status/979571...,Bolsonaro deve saber o que está fazendo . Porq...
2,t000002,v:955901617148235776,d:2018-03-30,url:https://twitter.com/MariaOl25529153/status...,u:MariaOl25529153,c:@FlavioBolsonaro Mais um Romário na política...,2018-03-30 10:16:00,id_955901617148235776,MariaOl25529153,https://twitter.com/MariaOl25529153/status/979...,"@FlavioBolsonaro Mais um Romário na política ,..."
3,t000003,v:44449830,d:2018-03-28,url:https://twitter.com/lucianagenro/status/97...,u:lucianagenro,c:A esquerda não tem conseguido comunicar suas...,2018-03-28 18:29:28,id_44449830,lucianagenro,https://twitter.com/lucianagenro/status/979062...,A esquerda não tem conseguido comunicar suas p...
4,t000004,v:912132396,d:2018-03-30,url:https://twitter.com/rocoguima/status/97965...,u:rocoguima,c:RT @AurystellaS: @BlogdoNoblat Vc sabe infor...,2018-03-30 09:57:48,id_912132396,rocoguima,https://twitter.com/rocoguima/status/979658943...,RT @AurystellaS: @BlogdoNoblat Vc sabe informa...
...,...,...,...,...,...,...,...,...,...,...,...
20596,t020596,v:1547227306913153026,d:2023-04-29,url:https://twitter.com/LuccaSo44679209/status...,u:LuccaSo44679209,"c:RT @LuccaSo44679209: @CiresCanisio Não , o L...",2023-04-29 19:19:35,id_1547227306913153026,LuccaSo44679209,https://twitter.com/LuccaSo44679209/status/165...,"RT @LuccaSo44679209: @CiresCanisio Não , o Lul..."
20597,t020597,v:1547227306913153026,d:2023-04-29,url:https://twitter.com/LuccaSo44679209/status...,u:LuccaSo44679209,"c:@CiresCanisio Não , o Lula não da uma refina...",2023-04-29 19:19:18,id_1547227306913153026,LuccaSo44679209,https://twitter.com/LuccaSo44679209/status/165...,"@CiresCanisio Não , o Lula não da uma refinari..."
20598,t020598,v:1554492869825683457,d:2023-04-29,url:https://twitter.com/Andre19lll/status/1652...,u:Andre19lll,c:@eunaovoupararde @CarlosZarattini Os índices...,2023-04-29 19:19:14,id_1554492869825683457,Andre19lll,https://twitter.com/Andre19lll/status/16523085...,@eunaovoupararde @CarlosZarattini Os índices d...
20599,t020599,v:1585200142440882179,d:2023-04-29,url:https://twitter.com/priscila19865/status/1...,u:priscila19865,c:@ValS265451870 @Guthbsb @marcia_miami Tá err...,2023-04-29 19:18:59,id_1585200142440882179,priscila19865,https://twitter.com/priscila19865/status/16518...,@ValS265451870 @Guthbsb @marcia_miami Tá errad...


### Creating the file `file_index.txt`

In [60]:
df_tweets_filtered[['text_id', 'conversation', 'date', 'text_url']].to_csv('file_index.txt', sep=' ', index=False, header=False, encoding='utf-8', lineterminator='\n')

### Creating the file `tweets.txt`

In [61]:
folder = 'tweets'
try:
    os.mkdir(folder)
    print(f'Folder {folder} created!')
except FileExistsError:
    print(f'Folder {folder} already exists')

Folder tweets created!


Note: The parameters `doublequote=False` and `escapechar=' '` are required to avoid that the column content is doublequoted with '"' in sentences that use characters that need to be escaped such as double quote '"' itself - this causes a malformed response from TreeTagger.

In [62]:
df_tweets_filtered[['text_id', 'conversation', 'date', 'user', 'content']].to_csv(f'{folder}/tweets.txt', sep='|', index=False, header=False, encoding='utf-8', lineterminator='\n', doublequote=False, escapechar=' ')

## Tagging with TreeTagger

- On Visual Studio Code (VS Code), open the folder where your project is located with `Open Folder...`
- Open a WSL Ubuntu Terminal on VS Code
- **Important**: Activate the `my_env` Python environment by executing `source "$HOME"/my_env/bin/activate`
- Proceed as indicated

Purpose: Annotate the texts in `tweets/tweets.txt` with part-of-speech and lemma information.
- Input
    - `file_index.txt`
    - `tweets/tweets.txt`
- Output
    - `tweets/tagged.txt`

```
eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ source "$HOME"/my_env/bin/activate
(my_env) eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ bash treetagging.sh
--- treetagging t000000 / t018205 ---
        reading parameters ...
        tagging ...
         finished.
--- treetagging t000001 / t018205 ---
        reading parameters ...
        tagging ...
         finished.
--- treetagging t000002 / t018205 ---
        reading parameters ...
        tagging ...
         finished.
--- treetagging t000003 / t018205 ---
        reading parameters ...
        tagging ...
         finished.
<omitted>
```

## Processing `CL_St1_Ph11_Querem.ipynb`

Run the solution `CL_St1_Ph11_Querem.ipynb`

## Processing `tokenstypes`

Purpose: Capture the content tokens (specific occurrences of words) and the content types (general concept of words) from `tweets/tagged.txt`.
- Input
    - `file_index.txt`
    - `tweets/tagged.txt`
- Output
    - `tweets/tokens.txt`
    - `tweets/types.txt`

```
eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ source "$HOME"/my_env/bin/activate
(my_env) eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ bash tokenstypes.sh
--- tokenstypes t000000 / 18206 ---
--- tokenstypes t000001 / 18206 ---
--- tokenstypes t000002 / 18206 ---
--- tokenstypes t000003 / 18206 ---
--- tokenstypes t000004 / 18206 ---
--- tokenstypes t000005 / 18206 ---
<omitted>
```

## Processing `toplemmas`

Purpose: Determine the 1.000 top lemmas. **Important**: This process requires manual inspection. Non-meaningful lemmas should be excluded by updating `stoplist.sed` and reiterating the processing.
- Input
    - `tweets/types.txt`
    - `stoplist.sed`: List of rules that allows the exclusion of a certain lemmas
- Output
    - `selectedwords` = `var_index.txt`

```
eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ source "$HOME"/my_env/bin/activate
(my_env) eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ bash toplemmas.sh
```

## Processing `sas`

Purpose: Prepare input data for processing in SAS.
- Input
    - `tweets/types.txt`
    - `selectedwords`
    - `file_index.txt`
- Output
    - `columns`
    - `sas/data.txt`
    - `sas/dates.txt`
    - `sas/wcount.txt`

```
eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ source "$HOME"/my_env/bin/activate
(my_env) eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ bash sas.sh
--- v000001 ---
--- v000002 ---
--- v000003 ---
--- v000004 ---
--- v000005 ---
<omitted>
--- v001000 ---
[nltk_data] Downloading package punkt to /home/eyamrog/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Word counts written to sas/wcount.txt
```

## Processing `datamatrix`

Purpose: Prepares input data for calculating the correlation matrix.
- Input
    - `file_index.txt`
    - `columns`
    - `selectedwords`
- Output
    - `file_ids.txt`
    - `data.csv`

```
eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ source "$HOME"/my_env/bin/activate
(my_env) eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ bash datamatrix.sh
--- v000001 ---
--- v000002 ---
--- v000003 ---
--- v000004 ---
--- v000005 ---
<omitted>
--- v001000 ---
--- data.csv ...---
```

## Processing `correlationmatrix`

Purpose: Calculates the correlation matrix.
- Input
    - `data.csv`
- Output
    - `correlation`

```
eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ source "$HOME"/my_env/bin/activate
(my_env) eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ bash correlationmatrix.sh
--- python correlation ... ---
```

## Processing `formats`

Purpose: Prepare input data for processing in SAS.
- Input
    - `data.csv`
    - `selectedwords`
- Output
    - `sas/corr.txt`
    - `sas/word_labels_format.sas`

```
eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ source "$HOME"/my_env/bin/activate
(my_env) eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ bash formats.sh
--- sas/sas/corr.txt ---
--- sas/word_labels_format.sas ---
```

## Processing the statistical procedures on SAS

- Log in to your [SAS OnDemand for Academics](https://welcome.oda.sas.com/) account
- Proceed as indicated in this [video tutorial](https://youtu.be/I3u9zD3jyOA?si=68uIKVc2iusGG2KY)

## Processing `examples`

Purpose: Extract examples for analysis.
- Input
    - `sas/output_"$project"/loadtable.html`
    - `sas/output_"$project"/"$project"_scores.tsv`
    - `sas/output_"$project"/"$project"_scores_only.tsv`
- Output
    - `examples/factors`
    - `example files`

```
(my_env) eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ bash examples.sh
6780
1246
698
123
--- examples f1pos ---
--- factor 1 pos # 000001 ---
tr: warning: an unescaped backslash at end of string is not portable
--- factor 1 pos # 000002 ---
tr: warning: an unescaped backslash at end of string is not portable
--- factor 1 pos # 000003 ---
tr: warning: an unescaped backslash at end of string is not portable
--- factor 1 pos # 000004 ---
tr: warning: an unescaped backslash at end of string is not portable
--- factor 1 pos # 000005 ---
tr: warning: an unescaped backslash at end of string is not portable
<ommitted>
```

## Results

Right-click on the link and choose `Open link in a new tab` to download the corresponding file.

- [CL_St1_Querem_Results.zip](https://pucsp-my.sharepoint.com/:u:/g/personal/ra00341729_pucsp_edu_br/ERbP8OEqscBJlh4l6s6_UFgBTUGtnR6PDI1NXZwVBh6Dyg?e=W8YXpq)