<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Mariana

## Prerequisites

Make sure the prerequisites in [CL_LMDA_prerequisites](https://github.com/laelgelc/laelgelc/blob/main/CL_LMDA_prerequisites.ipynb) are satisfied.

## Dataset

Please download the following dataset (Right-click on the link and choose `Save link as` to download the corresponding file):
- [mari201901.jsonl](https://laelgelcawsemrmariana.s3.sa-east-1.amazonaws.com/mari201901.jsonl)

Please refer to [Tweet object](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet)

## Importing the required libraries

In [1]:
import pandas as pd
import demoji
import re
import os
from collections import Counter

## Data wrangling

### Importing the tweet raw data into a dataframe

In [2]:
df_tweets_raw_data = pd.read_json('mari201901.jsonl', lines=True)

In [3]:
df_tweets_raw_data.head(5)

Unnamed: 0,created_at,entities,favorite_count,favorited,filter_level,id,id_str,is_quote_status,lang,quote_count,...,possibly_sensitive,in_reply_to_screen_name,in_reply_to_user_id,in_reply_to_user_id_str,in_reply_to_status_id,in_reply_to_status_id_str,coordinates,geo,place,withheld_in_countries
0,2019-01-15 14:07:11+00:00,"{'hashtags': [{'indices': [67, 77], 'text': 'V...",0,False,low,1085176574671048705,1085176574671048704,False,es,0,...,,,,,,,,,,
1,2019-01-15 14:07:11+00:00,"{'hashtags': [{'indices': [67, 77], 'text': 'V...",0,False,low,1085176574671048705,1085176574671048704,False,es,0,...,,,,,,,,,,
2,2019-01-13 15:05:36+00:00,"{'hashtags': [{'indices': [38, 48], 'text': 'V...",0,False,low,1084466499962707968,1084466499962707968,True,en,0,...,,,,,,,,,,
3,2019-01-13 15:05:36+00:00,"{'hashtags': [{'indices': [38, 48], 'text': 'V...",0,False,low,1084466499962707968,1084466499962707968,True,en,0,...,,,,,,,,,,
4,2019-01-13 13:36:24+00:00,"{'hashtags': [{'indices': [28, 38], 'text': 'V...",0,False,low,1084444052030976002,1084444052030976000,True,es,0,...,,,,,,,,,,


### Checking if data types are consistent

In [13]:
df_tweets_raw_data.dtypes

created_at                   datetime64[ns, UTC]
entities                                  object
favorite_count                             int64
favorited                                   bool
filter_level                              object
id                                         int64
id_str                                    object
is_quote_status                             bool
lang                                      object
quote_count                                int64
reply_count                                int64
retweet_count                              int64
retweeted                                   bool
retweeted_status                          object
source                                    object
text                                      object
timestamp_ms                      datetime64[ns]
truncated                                   bool
user                                      object
quoted_status                             object
quoted_status_id    

#### Converting `id_str` column's data type to `str`

In [11]:
df_tweets_raw_data['id_str'] = df_tweets_raw_data['id_str'].astype(str)

### Listing the values of the parameter `lang`

In [6]:
df_tweets_raw_data['lang'].unique()

array(['es', 'en', 'und', 'pt', 'ca', 'fr', 'eu', 'it', 'de', 'ar', 'ht',
       'zh', 'fa', 'tr', 'sv', 'cy', 'ur', 'ro', 'in', 'uk', 'el', 'hi',
       'nl', 'pl', 'ru', 'cs', 'tl', 'fi', 'no', 'lt', 'ja', 'et', 'sr',
       'hu', 'da'], dtype=object)

### Keeping only the tweets in Portuguese

In [7]:
df_tweets_raw_data = df_tweets_raw_data[df_tweets_raw_data['lang'] == 'pt'].reset_index(drop=True)

### Extracting the column `username`

In [8]:
# Flatten the nested JSON 'user' attribute
df_tweets_raw_data_flattened_user = pd.json_normalize(df_tweets_raw_data['user'])

# Extract the 'screen_name' attribute
username = df_tweets_raw_data_flattened_user['screen_name']

# Create a new column 'username'
df_tweets_raw_data['username'] = username

### Extracting the column `author_id`

In [9]:
# Extract the 'id_str' attribute
author_id = df_tweets_raw_data_flattened_user['id_str']

# Create a new column 'username'
df_tweets_raw_data['author_id'] = author_id

### Extracting the column `tweet_url`

In [12]:
# Construct the tweet URL using the tweet ID and user's screen name
df_tweets_raw_data['tweet_url'] = (
    'https://twitter.com/' + 
    df_tweets_raw_data['username'] + 
    '/status/' + 
    df_tweets_raw_data['id_str']
)

In [None]:
df_tweets_raw_data

In [17]:
df_tweets_raw_data['id_str']

0      1084480777365139456
1      1084480777365139456
2      1084480357959901184
3      1084480357959901184
4      1084479661709631488
              ...         
677    1088340342347313152
678    1088342930245464064
679    1087630380897832960
680    1088353160173928448
681    1087963693873287168
Name: id_str, Length: 682, dtype: object

### Inspecting the dataset and eliminating malformed data

#### Identifying rows that are empty in column `text`

In [None]:
print(df_tweets_raw_data['text'].isnull().sum())

In [None]:
df_tweets_raw_data[df_tweets_raw_data['text'].isnull()]

#### Dropping the rows that are empty in the column `text`

In [None]:
# Drop the rows whose column 'text' is NaN
df_tweets_raw_data = df_tweets_raw_data.dropna(subset=['text'])

# Reset the index
df_tweets_raw_data = df_tweets_raw_data.reset_index(drop=True)

In [None]:
print(df_tweets_raw_data['text'].isnull().sum())

#### Removing specific Unicode characters

The dataset may need to be cleaned of invisible Unicode characters.

##### Detecting `U+2066` and `U+2069` characters

- [U+2066](https://www.compart.com/en/unicode/U+2066)
- [U+2069](https://www.compart.com/en/unicode/U+2069)

Please refer to:
- [Python RegEx](https://www.w3schools.com/python/python_regex.asp)
- [regex101](https://regex101.com/)
- [RegExr](https://regexr.com/)

In [None]:
# Defining a function to detect specific Unicode characters
def extract_unicode_characters(df, column_name):
    unicode_chars = Counter()  # Initialize a Counter to store Unicode character counts

    for value in df[column_name]:
        if isinstance(value, str):
            # Use RegEx to find non-ASCII characters (Unicode)
#            non_ascii_chars = re.findall(r'[^\x00-\x7F]+', value)
            # Use RegEx to find specific Unicode characters - adjust the expression accordingly
            specific_unicode_chars = re.findall(r'[\u2066\u2069]', value)
            unicode_chars.update(specific_unicode_chars)

    return unicode_chars

# Inspect the dataframe for specific Unicode characters
unicode_counts = extract_unicode_characters(df_tweets_raw_data, 'text')

# Print the results
for char, count in unicode_counts.items():
    print(f'Character {char}: Count = {count}')

##### Removing `U+2066` and `U+2069` characters

In [None]:
# Defining a function to remove specific Unicode characters
def remove_specific_unicode(input_line):
    # Using RegEx to replace specific Unicode characters - adjust the expression accordingly
    cleaned_line = re.sub(r'[\u2066\u2069]', '', input_line)
    return cleaned_line

# Removing specific Unicode characters
df_tweets_raw_data['text'] = df_tweets_raw_data['text'].apply(remove_specific_unicode)

### Dropping duplicates

#### Retweets

Retweets bear the RegEx pattern `/\bRT @/gm` or `/\brt @/gm` at the beginning of the column `text`

In [None]:
# Creating a boolean mask for filtering - it is preceded by '~' to invert the selection
mask = ~df_tweets_raw_data['text'].str.contains(r'\bRT @|\brt @', regex=True)

# Applying the mask to overwrite the raw data dataframe with non retweeted tweets
df_tweets_raw_data = df_tweets_raw_data[mask]
df_tweets_raw_data = df_tweets_raw_data.reset_index(drop=True)

In [None]:
df_tweets_raw_data['text']

#### Duplicate tweets

The dataset was build in a way that if a certain tweet had more than one photo, one copy of the tweet was included per unique photo. Since we are concerned with analysing just the text, those duplicates should be removed. Tweets that bear the same 'tweet_url' are duplicates - we are going to keep only the first.

In [18]:
# Drop duplicate rows except the first occurrence based on 'text'
df_tweets_raw_data.drop_duplicates(subset='id_str', keep='first', inplace=True)
df_tweets_raw_data = df_tweets_raw_data.reset_index(drop=True)

In [19]:
df_tweets_raw_data

Unnamed: 0,created_at,entities,favorite_count,favorited,filter_level,id,id_str,is_quote_status,lang,quote_count,...,in_reply_to_user_id_str,in_reply_to_status_id,in_reply_to_status_id_str,coordinates,geo,place,withheld_in_countries,username,author_id,tweet_url
0,2019-01-13 16:02:20+00:00,"{'hashtags': [{'indices': [17, 25], 'text': 'U...",0,False,low,1084480777365139458,1084480777365139456,False,pt,0,...,,,,,,,,TaconThiago,1057374436624609281,https://twitter.com/TaconThiago/status/1084480...
1,2019-01-13 16:00:40+00:00,"{'hashtags': [{'indices': [17, 25], 'text': 'U...",0,False,low,1084480357959901186,1084480357959901184,False,pt,0,...,,,,,,,,Brunobr18373270,1076139993599541248,https://twitter.com/Brunobr18373270/status/108...
2,2019-01-13 15:57:54+00:00,"{'hashtags': [{'indices': [17, 25], 'text': 'U...",0,False,low,1084479661709631494,1084479661709631488,False,pt,0,...,,,,,,,,Pedrodon17,1052964117181583361,https://twitter.com/Pedrodon17/status/10844796...
3,2019-01-13 15:59:09+00:00,"{'hashtags': [{'indices': [17, 25], 'text': 'U...",0,False,low,1084479976244690946,1084479976244690944,False,pt,0,...,,,,,,,,RanieriXBarbosa,936990437100879873,https://twitter.com/RanieriXBarbosa/status/108...
4,2019-01-13 16:09:20+00:00,"{'hashtags': [{'indices': [17, 25], 'text': 'U...",0,False,low,1084482538993782784,1084482538993782784,False,pt,0,...,,,,,,,,Borges2510,1712396508,https://twitter.com/Borges2510/status/10844825...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
614,2019-01-24 07:38:52+00:00,"{'hashtags': [{'indices': [35, 41], 'text': 't...",0,False,low,1088340342347313152,1088340342347313152,False,pt,0,...,,,,,,,,Tioito1,982506039055716352,https://twitter.com/Tioito1/status/10883403423...
615,2019-01-24 07:49:09+00:00,"{'hashtags': [{'indices': [19, 30], 'text': 'C...",0,False,low,1088342930245464065,1088342930245464064,False,pt,0,...,,,,,,,,jonh__fox,896205547,https://twitter.com/jonh__fox/status/108834293...
616,2019-01-22 08:37:44+00:00,"{'hashtags': [{'indices': [76, 86], 'text': 'V...",0,False,low,1087630380897832960,1087630380897832960,False,pt,0,...,,,,,,,,ppdegalicia,13494262,https://twitter.com/ppdegalicia/status/1087630...
617,2019-01-24 08:29:48+00:00,"{'hashtags': [{'indices': [24, 34], 'text': 'V...",0,False,low,1088353160173928448,1088353160173928448,False,pt,0,...,,,,,,,,mavitrejo,980262593515356160,https://twitter.com/mavitrejo/status/108835316...


#### @mentioned tweets

A few users @mention copies of tweets towards other specific users creating multiple copies of the same tweet - those duplicates should be removed.

In [None]:
# Create a new column 'no_mention' containing the contents of the column 'text' without any preceding @mentions
df_tweets_raw_data['no_mention'] = df_tweets_raw_data['text'].str.replace(r'@\w+\s*', '', regex=True)

# Drop duplicate rows except the first occurrence based on 'no_mention'
df_tweets_raw_data.drop_duplicates(subset='no_mention', keep='first', inplace=True)
df_tweets_raw_data = df_tweets_raw_data.reset_index(drop=True)

## Sampling the raw data according to filtering expressions

In [None]:
# Defining the filtering expressions
#filter_words = ['arma', 'pátria', 'ladrão', 'cristão', 'comunista', 'família', 'liberdade', 'conservador', 'deus']
filter_words = ['venezuela']

# Creating a boolean mask for filtering
mask = df_tweets_raw_data['text'].str.contains('|'.join(filter_words), case=False)

# Applying the mask to create 'df_tweets_filtered'
df_tweets_filtered = df_tweets_raw_data[mask]
df_tweets_filtered = df_tweets_filtered.reset_index(drop=True)

In [None]:
df_tweets_filtered

### Exporting the filtered data into a file for inspection

In [None]:
df_tweets_filtered.to_csv('tweets_emojified.tsv', sep='\t', index=False)

## Replacing emojis

### Demojifying the column `text`

In [None]:
# Defining a function to demojify a string
def demojify_line(input_line):
    demojified_line = demoji.replace_with_desc(input_line, sep='<em>')
    return demojified_line

df_tweets_filtered['text'] = df_tweets_filtered['text'].apply(demojify_line)

#### Exporting the filtered data into a file for inspection

In [None]:
df_tweets_filtered.to_csv('tweets_demojified1.tsv', sep='\t', index=False)

### Separating the demojified strings with spaces

In [None]:
# Defining a function to separate the demojified strings with spaces
def preprocess_line(input_line):
    # Add a space before the first delimiter '<em>', if it is not already preceded by one
    preprocessed_line = re.sub(r'(?<! )<em>', ' <em>', input_line)
    # Add a space after the first delimiter '<em>', if it is not already followed by one
    preprocessed_line = re.sub(r'<em>(?! )', '<em> ', preprocessed_line)
    return preprocessed_line

# Separating the demojified strings with spaces
df_tweets_filtered['text'] = df_tweets_filtered['text'].apply(preprocess_line)

#### Exporting the filtered data into a file for inspection

In [None]:
df_tweets_filtered.to_csv('tweets_demojified2.tsv', sep='\t', index=False)

### Formatting the demojified strings

In [None]:
# Defining a function to format the demojified string
def format_demojified_string(input_line):
    # Defining a function to format the demojified string using RegEx
    def process_demojified_string(s):
            # Lowercase the string
            s = s.lower()
            # Replace spaces and colons followed by a space with underscores
            s = re.sub(r'(: )| ', '_', s)
            # Add the appropriate prefixes and suffixes
            s = f'EMOJI{s}e'
            return s

    # Use RegEx to find and process each demojified string
    processed_line = re.sub(r'<em>(.*?)<em>', lambda match: process_demojified_string(match.group(1)), input_line)
    return processed_line

# Formatting the demojified strings
df_tweets_filtered['text'] = df_tweets_filtered['text'].apply(format_demojified_string)

### Replacing the `pipe` character by the `-` character in the `text` column

Further on, a few columns of the dataframe are going to be exported into the file `tweets.txt` whose columns need to be delimited by the `pipe` character. Therefore, it is recommended that any occurrences of the `pipe` character in the `text` column are replaced by another character.

In [None]:
# Defining a function to replace the 'pipe' character by the '-' character
def replace_pipe_with_hyphen(input_string):
    modified_string = re.sub(r'\|', '-', input_string)
    return modified_string

# Replacing the 'pipe' character by the '-' character
df_tweets_filtered['text'] = df_tweets_filtered['text'].apply(replace_pipe_with_hyphen)


#### Exporting the filtered data into a file for inspection

In [None]:
df_tweets_filtered.to_csv('tweets_demojified3.tsv', sep='\t', index=False)

## Tokenising

Please refer to [What is tokenization in NLP?](https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/).

In [None]:
# Defining a function to tokenise a string
def tokenise_string(input_line):
    # Replace URLs with placeholders
    url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+\b'
    placeholder = '<URL>'  # Choose a unique placeholder
    urls = re.findall(url_pattern, input_line)
    tokenised_line = re.sub(url_pattern, placeholder, input_line)  # Replace URLs with placeholders
    
    # Replace curly quotes with straight ones
    tokenised_line = tokenised_line.replace('“', '"').replace('”', '"').replace("‘", "'").replace("’", "'")
    # Separate common punctuation marks with spaces
    tokenised_line = re.sub(r'([.\!?,"\'/()])', r' \1 ', tokenised_line)
    # Add a space before '#'
    tokenised_line = re.sub(r'(?<!\s)#', r' #', tokenised_line)  # Add a space before '#' if it is not already preceded by one
    # Reduce extra spaces by a single space
    tokenised_line = re.sub(r'\s+', ' ', tokenised_line)
    
    # Replace the placeholders with the respective URLs
    for url in urls:
        tokenised_line = tokenised_line.replace(placeholder, url, 1)
    
    return tokenised_line

# Tokenising the strings
df_tweets_filtered['text'] = df_tweets_filtered['text'].apply(tokenise_string)

## Creating the files `file_index.txt` and `tweets.txt`

### Creating column `text_id`

In [None]:
df_tweets_filtered['text_id'] = 't' + df_tweets_filtered.index.astype(str).str.zfill(6)

### Creating column `conversation`

In [None]:
df_tweets_filtered['conversation'] = 'v:' + df_tweets_filtered['author_id'].str.replace('id_', '')

### Creating column `date`

In [None]:
# Convert 'created_at' to datetime format
df_tweets_filtered['created_at'] = pd.to_datetime(df_tweets_filtered['created_at'])

# Extract the date part (without time) into a new column 'date'
df_tweets_filtered['date'] = df_tweets_filtered['created_at'].dt.date

# Add the prefix 'd:' to the 'date' values
df_tweets_filtered['date'] = 'd:' + df_tweets_filtered['date'].astype(str)

### Creating column `text_url`

In [None]:
df_tweets_filtered['text_url'] = 'url:' + df_tweets_filtered['tweet_url']

### Creating column `user`

In [None]:
df_tweets_filtered['user'] = 'u:' + df_tweets_filtered['username']

### Creating column `content`

In [None]:
df_tweets_filtered['content'] = 'c:' + df_tweets_filtered['text']

### Reordering the created columns

Please refer to:
- [Python - List Comprehension 1](https://www.w3schools.com/python/python_lists_comprehension.asp)
- [Python - List Comprehension 2](https://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/)

In [None]:
# Reorder the columns (we use list comprehension to create a list of all columns except 'text_id', 'variable', 'date' and 'text_url')
df_tweets_filtered = df_tweets_filtered[['text_id', 'conversation', 'date', 'text_url', 'user', 'content'] + [col for col in df_tweets_filtered.columns if col not in ['text_id', 'conversation', 'date', 'text_url', 'user', 'content']]]

In [None]:
df_tweets_filtered

### Creating the file `file_index.txt`

In [None]:
df_tweets_filtered[['text_id', 'conversation', 'date', 'text_url']].to_csv('file_index.txt', sep=' ', index=False, header=False, encoding='utf-8', lineterminator='\n')

### Creating the file `tweets.txt`

In [None]:
folder = 'tweets'
try:
    os.mkdir(folder)
    print(f'Folder {folder} created!')
except FileExistsError:
    print(f'Folder {folder} already exists')

Note: The parameters `doublequote=False` and `escapechar=' '` are required to avoid that the column content is doublequoted with '"' in sentences that use characters that need to be escaped such as double quote '"' itself - this causes a malformed response from TreeTagger.

In [None]:
df_tweets_filtered[['text_id', 'conversation', 'date', 'user', 'content']].to_csv(f'{folder}/tweets.txt', sep='|', index=False, header=False, encoding='utf-8', lineterminator='\n', doublequote=False, escapechar=' ')

## Tagging with TreeTagger

- On Visual Studio Code (VS Code), open the folder where your project is located with `Open Folder...`
- Open a WSL Ubuntu Terminal on VS Code
- **Important**: Activate the `my_env` Python environment by executing `source "$HOME"/my_env/bin/activate`
- Proceed as indicated

Note: You have to download and open this Jupyter Notebook on JupyterLab (provided as part of Anaconda Distribution) to visualise the procedure

Purpose: Annotate the texts in `tweets/tweets.txt` with part-of-speech and lemma information.
- Input
    - `file_index.txt`
    - `tweets/tweets.txt`
- Output
    - `tweets/tagged.txt`

## Processing `tokenstypes`

Purpose: Capture the content tokens (specific occurrences of words) and the content types (general concept of words) from `tweets/tagged.txt`.
- Input
    - `file_index.txt`
    - `tweets/tagged.txt`
- Output
    - `tweets/tokens.txt`
    - `tweets/types.txt`

## Processing `toplemmas`

Purpose: Determine the 1.000 top lemmas. **Important**: This process requires manual inspection. Non-meaningful lemmas should be excluded by updating `stoplist.sed` and reiterating the processing.
- Input
    - `tweets/types.txt`
    - `stoplist.sed`: List of rules that allows the exclusion of a certain lemmas
- Output
    - `selectedwords` = `var_index.txt`

## Processing `sas`

Purpose: Prepare input data for processing in SAS.
- Input
    - `tweets/types.txt`
    - `selectedwords`
    - `file_index.txt`
- Output
    - `columns`
    - `sas/data.txt`
    - `sas/dates.txt`
    - `sas/wcount.txt`

## Processing `datamatrix`

Purpose: Prepares input data for calculating the correlation matrix.
- Input
    - `file_index.txt`
    - `columns`
    - `selectedwords`
- Output
    - `file_ids.txt`
    - `data.csv`

## Processing `correlationmatrix`

Purpose: Calculates the correlation matrix.
- Input
    - `data.csv`
- Output
    - `correlation`

## Processing `formats`

Purpose: Prepare input data for processing in SAS.
- Input
    - `data.csv`
    - `selectedwords`
- Output
    - `sas/corr.txt`
    - `sas/word_labels_format.sas`

## Results

Right-click on the link and choose `Save link as` to download the corresponding file.

- [CL_St1_Querem_Results.zip](https://laelgelcquerem.s3.sa-east-1.amazonaws.com/CL_St1_Querem_Results.zip)