<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Phase 1 - Guilherme

## Prerequisites

Make sure the prerequisites in [CL_LMDA_prerequisites](https://github.com/laelgelc/laelgelc/blob/main/CL_LMDA_prerequisites.ipynb) are satisfied.

### Additional prerequisites

#### WebVTT

`webvtt-py` is a Python package for reading/writing WebVTT caption files.

Please refer to:
- [webvtt-py](https://pypi.org/project/webvtt-py/)

##### Installing `webvtt-py` on Anacoda Distribution

As `webvtt-py` is not available in any of the conda channels, the following procedure should be followed on `Anaconda Prompt` to install it in the required environment, in this case `Env20240401`:

Note:
- Replace `Env20240401` by your actual environment name

## WebVTT proof of concept

Please refer to:
- [CL_webvtt-py_Extraction](https://github.com/laelgelc/laelgelc/blob/main/CL_webvtt-py_Extraction.ipynb)

## Dataset

Please download the following dataset (Right-click on the link and choose `Open link in a new tab` to download the corresponding file):
- [cl_st1_guilherme-dataset.zip](https://pucsp-my.sharepoint.com/:u:/g/personal/ra00341729_pucsp_edu_br/EazB9wcuMSNEuxenV9Nb78ABGT8YTV0kwQJFgFymEDwhRA?e=PATjQW)

Extract the .zip file in the directory where this Jupyter Notebook is being executed.

## Importing the required libraries

In [1]:
import webvtt
import pandas as pd
import re
import os

## Data wrangling

### Defining the input and output directory names

In [2]:
input_directory = 'cl_st1_guilherme-dataset'
output_directory = input_directory + '-output'

### Defining a function to extract caption texts

In [3]:
def extract_caption_text(webvtt_file, caption_file):
    vtt = webvtt.read(webvtt_file)
    
    # Writing the text of the caption to the output file
    with open(caption_file, 'w', encoding='utf-8') as f:
        f.write('text' + '\n') # Includes the header that will be used in the dataframe
        for caption in vtt:
            f.write(caption.text + '\n')
    
    # Deduplicating the text of the caption using a dataframe
    df = pd.read_table(caption_file)
    df['text'] = df['text'].map(str)
    df.drop_duplicates(subset='text', keep='first', inplace=True)
    df = df.reset_index(drop=True)
    
    # Creating a single string containing all 'text' values separated by spaces
    text_line = ' '.join(df['text'])

    # Rewriting the output file with the single string
    with open(caption_file, 'w', encoding='utf-8') as f:
        f.write(text_line)

### Defining a function to recursively process the `input_directory` and store the results in `output_directory`

In [4]:
def process_directory(input_directory, output_directory):
    for root, dirs, files in os.walk(input_directory):
        for filename in files:
            if filename.endswith('.vtt'):
                # Constructing the corresponding caption filename
                base_name = os.path.splitext(filename)[0]
                caption_filename = base_name + '.txt'

                # Creating the output subdirectory structure
                relative_path = os.path.relpath(root, input_directory)
                output_subdir = os.path.join(output_directory, relative_path)
                os.makedirs(output_subdir, exist_ok=True)

                # Full paths for input and output files
                input_file_path = os.path.join(root, filename)
                output_file_path = os.path.join(output_subdir, caption_filename)

                # Calling 'extract_caption_text' function
                extract_caption_text(input_file_path, output_file_path)

### Processing the dataset

In [5]:
process_directory(input_directory, output_directory)

### Importing the texts into a dataframe

In [6]:
def read_file_contents(file_path):
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            return f.read()
    except Exception as e:
        print(f'Error reading file {file_path}: {e}')
        return None

def process_output_directory(output_directory):
    # Initialize an empty list to store data
    data = []

    # Recursively iterate through the output_directory
    for root, _, files in os.walk(output_directory):
        for filename in files:
            file_path = os.path.join(root, filename)
            file_contents = read_file_contents(file_path)
            if file_contents is not None:
                data.append({'text': file_contents, 'filepath': file_path})

    # Create a DataFrame from the collected data
    df = pd.DataFrame(data)

    return df

# Importing the texts into the dataframe 'df_tweets_filtered'. Even though this study does not relate to 'tweets', this dataframe name is adopted in order to enable code reuse in subsequent processing stages
df_tweets_filtered = process_output_directory(output_directory)

In [7]:
df_tweets_filtered

Unnamed: 0,text,filepath
0,e a Bíblia é a verdade absoluta é a palavra de...,cl_st1_guilherme-dataset-output/MALAFAIA RESPO...
1,e eu não vou me calar por uma série de coisas ...,cl_st1_guilherme-dataset-output/MALAFAIA RESPO...
2,eu vou dizer uma coisa para você que ele faz e...,cl_st1_guilherme-dataset-output/MALAFAIA RESPO...
3,e decida salvar seu casamento sabe por quê Por...,cl_st1_guilherme-dataset-output/MALAFAIA RESPO...
4,e você vai ter que lutar para receber as prome...,cl_st1_guilherme-dataset-output/MOTIVACIONAL P...
...,...,...
1821,a visão correta da vida você tem que ter um ol...,cl_st1_guilherme-dataset-output/MINUTOS DE VIT...
1822,conseguiu garantir só você tem que entender al...,cl_st1_guilherme-dataset-output/MINUTOS DE VIT...
1823,é o desafio de vencer a nossa natureza a chama...,cl_st1_guilherme-dataset-output/MINUTOS DE VIT...
1824,características de quem quer produzir obras de...,cl_st1_guilherme-dataset-output/MINUTOS DE VIT...


### Dropping duplicates

#### Duplicate texts

Checking for identical texts in terms of content of the column `text` in order to eliminate duplicates.

In [8]:
df_tweets_filtered.drop_duplicates(subset='text', keep='first', inplace=True)
df_tweets_filtered = df_tweets_filtered.reset_index(drop=True)
df_tweets_filtered.shape

(1811, 2)

In [9]:
df_tweets_filtered

Unnamed: 0,text,filepath
0,e a Bíblia é a verdade absoluta é a palavra de...,cl_st1_guilherme-dataset-output/MALAFAIA RESPO...
1,e eu não vou me calar por uma série de coisas ...,cl_st1_guilherme-dataset-output/MALAFAIA RESPO...
2,eu vou dizer uma coisa para você que ele faz e...,cl_st1_guilherme-dataset-output/MALAFAIA RESPO...
3,e decida salvar seu casamento sabe por quê Por...,cl_st1_guilherme-dataset-output/MALAFAIA RESPO...
4,e você vai ter que lutar para receber as prome...,cl_st1_guilherme-dataset-output/MOTIVACIONAL P...
...,...,...
1806,a visão correta da vida você tem que ter um ol...,cl_st1_guilherme-dataset-output/MINUTOS DE VIT...
1807,conseguiu garantir só você tem que entender al...,cl_st1_guilherme-dataset-output/MINUTOS DE VIT...
1808,é o desafio de vencer a nossa natureza a chama...,cl_st1_guilherme-dataset-output/MINUTOS DE VIT...
1809,características de quem quer produzir obras de...,cl_st1_guilherme-dataset-output/MINUTOS DE VIT...


### Inspecting a few tweets

In [10]:
inspected_row = 1277
print('text:' + df_tweets_filtered.loc[inspected_row, 'text'])

text:o povo abençoado do Brasil inacreditável absurdo dos Absurdos uma juíza do Rio Grande do Sul na Lúcia resolveu que a partir da abertura oficial da campanha eleitoral é proibido usar a bandeira do Brasil porque eu tô envolvido com ela o verde e amarelo porque representa um lado ela quer apareceu não tem nada que fazer é petista Unidos onde um povo é nacionalista tá usa a Americana esquerda comunista É por isso agora aqui no Brasil parte da Europa e na América Latina esquerda influenciada pela esquerda comunista por isso que eles usam vermelho bolsonaro trouxe de volta ao brasileiro nacionalismo que o PT apagou Então as manifestações motor ciata Onde bolsonaro vai de maneira espontânea o povo leva E aí essas imagens agora onde Rua Vital PT que vai ser cinismo para enganar mais uma vez o povo que eles usam vermelho que para ele aí deu orgia para cima da Nação de seu símbolo Vocês estão vendo aí eles usam vermelho o vermelho representa a bandeira comunista aqui ó Foi não é daqui a foi

## Exporting to a file

### JSONL format

In [11]:
df_tweets_filtered[['text', 'filepath']].to_json('tweets_filtered.jsonl', orient='records', lines=True)

### TSV format

In [12]:
df_tweets_filtered[['text', 'filepath']].to_csv('tweets_filtered.tsv', sep='\t', index=False, encoding='utf-8', lineterminator='\n')

## Importing the Target Corpus into a DataFrame

In [2]:
df_tweets_filtered = pd.read_json('tweets_filtered.jsonl', lines=True)

### Dropping identified duplicates

An examination of the final results showed that the texts in rows 730 and 738 are duplicates with slight differences in their transcripts.

In [13]:
inspected_row = 738
print('text:' + df_tweets_filtered.loc[inspected_row, 'text'])

text:a página de cristo para todos eu tenho muita certeza que a palavra do senhor é algo que produz vida que traz esperança e que nos dá uma nova dimensão do nosso dever eu estou apresentando pra você mais uma mensagem que vai falar ao seu coração se você quer receber um catálogo de todo o nosso material é só você entrar em contato com a nossa central de telefones 0 o perh adorava 21 2598 2019 você pode receber na sua casa no brasil ou no exterior todo o nosso material nós queremos entregar a mensagem que pode transformar a vida do homem se prepare eu tenho certeza que deus vai falar com você através desta mensagem [Música] diz assim a palavra carta de paulo aos efésios capítulo 6 os quatro primeiros versículos voz filhos serem obedientes a vossos pais no senhor porque isto é justo ou ateu pai ea tua mãe que é o primeiro mandamento com promessa para activar bem e vivas muito tempo sobre a terra e voz pais não provoquei sairá vossos filhos mas criei os na doutrina e admoestação do senho

In [14]:
# Define the list of indexes to drop
indexes_to_drop = [
    738
]

# Dropping the rows with the specified indexes
df_tweets_filtered = df_tweets_filtered.drop(indexes_to_drop)
df_tweets_filtered = df_tweets_filtered.reset_index(drop=True)

In [15]:
df_tweets_filtered.head(5)

Unnamed: 0,text,filepath
0,e a Bíblia é a verdade absoluta é a palavra de...,cl_st1_guilherme-dataset-output/MALAFAIA RESPO...
1,e eu não vou me calar por uma série de coisas ...,cl_st1_guilherme-dataset-output/MALAFAIA RESPO...
2,eu vou dizer uma coisa para você que ele faz e...,cl_st1_guilherme-dataset-output/MALAFAIA RESPO...
3,e decida salvar seu casamento sabe por quê Por...,cl_st1_guilherme-dataset-output/MALAFAIA RESPO...
4,e você vai ter que lutar para receber as prome...,cl_st1_guilherme-dataset-output/MOTIVACIONAL P...


In [16]:
df_tweets_filtered.shape

(1810, 2)

In [17]:
df_tweets_filtered.dtypes

text        object
filepath    object
dtype: object

## Replacing the `pipe` character by the `-` character in the `text` column

Further on, a few columns of the dataframe are going to be exported into the file `tweets.txt` whose columns need to be delimited by the `pipe` character. Therefore, it is recommended that any occurrences of the `pipe` character in the `text` column are replaced by another character.

In [18]:
# Defining a function to replace the 'pipe' character by the '-' character
def replace_pipe_with_hyphen(input_string):
    modified_string = re.sub(r'\|', '-', input_string)
    return modified_string

# Replacing the 'pipe' character by the '-' character
df_tweets_filtered['text'] = df_tweets_filtered['text'].apply(replace_pipe_with_hyphen)

#### Exporting the filtered data into a file for inspection

In [19]:
df_tweets_filtered[['text']].to_csv('tweets_emojified1.tsv', sep='\t', index=False, encoding='utf-8', lineterminator='\n')

## Tokenising

Please refer to [What is tokenization in NLP?](https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/).

In [20]:
# Defining a function to tokenise a string
def tokenise_string(input_line):
    # Replace URLs with placeholders
    url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+\b'
    placeholder = '<URL>'  # Choose a unique placeholder
    urls = re.findall(url_pattern, input_line)
    tokenised_line = re.sub(url_pattern, placeholder, input_line)  # Replace URLs with placeholders
    
    # Replace curly quotes with straight ones
    tokenised_line = tokenised_line.replace('“', '"').replace('”', '"').replace("‘", "'").replace("’", "'")
    # Separate common punctuation marks with spaces
    tokenised_line = re.sub(r'([.\!?,"\'/()])', r' \1 ', tokenised_line)
    # Add a space before '#'
    tokenised_line = re.sub(r'(?<!\s)#', r' #', tokenised_line)  # Add a space before '#' if it is not already preceded by one
    # Reduce extra spaces by a single space
    tokenised_line = re.sub(r'\s+', ' ', tokenised_line)
    
    # Replace the placeholders with the respective URLs
    for url in urls:
        tokenised_line = tokenised_line.replace(placeholder, url, 1)
    
    return tokenised_line

# Tokenising the strings
df_tweets_filtered['text'] = df_tweets_filtered['text'].apply(tokenise_string)

## Creating the files `file_index.txt` and `tweets.txt`

### Creating column `text_id`

In [21]:
df_tweets_filtered['text_id'] = 't' + df_tweets_filtered.index.astype(str).str.zfill(6)

### Creating column `conversation`

In [22]:
df_tweets_filtered['conversation'] = 'v:' + df_tweets_filtered['filepath']

#### Replacing space by the `_` character

**Important**: Since the strings in the original columns contain spaces, Pandas creates `file_index.txt` with the columns enclosed with `"` - this caracter causes issues in `examples.sh` when it is executed. Therefore, spaces should be replaced by another character such as underscore.

In [23]:
# Defining a function to replace space by the '_' character
def replace_space_with_underscore(input_string):
    modified_string = re.sub(r' ', '_', input_string)
    return modified_string

In [24]:
# Replacing space by the '_' character
df_tweets_filtered['conversation'] = df_tweets_filtered['conversation'].apply(replace_space_with_underscore)

### Creating column `date`

The date for all texts are defined as the date Guilherme sent the dataset, 16th April, 2024.

In [25]:
df_tweets_filtered['date'] = 'd:' + '2024-04-16'

### Creating column `text_url`

No URL was considered for all texts.

In [26]:
df_tweets_filtered['text_url'] = 'url:' + 'no_url'

### Creating column `user`

`silas_malafaia` was considered for all texts.

In [27]:
df_tweets_filtered['user'] = 'u:' + 'silas_malafaia'

### Creating column `content`

In [28]:
df_tweets_filtered['content'] = 'c:' + df_tweets_filtered['text']

### Reordering the created columns

Please refer to:
- [Python - List Comprehension 1](https://www.w3schools.com/python/python_lists_comprehension.asp)
- [Python - List Comprehension 2](https://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/)

In [29]:
# Reorder the columns (we use list comprehension to create a list of all columns except 'text_id', 'variable', 'date' and 'text_url')
df_tweets_filtered = df_tweets_filtered[['text_id', 'conversation', 'date', 'text_url', 'user', 'content'] + [col for col in df_tweets_filtered.columns if col not in ['text_id', 'conversation', 'date', 'text_url', 'user', 'content']]]

In [30]:
df_tweets_filtered

Unnamed: 0,text_id,conversation,date,text_url,user,content,text,filepath
0,t000000,v:cl_st1_guilherme-dataset-output/MALAFAIA_RES...,d:2024-04-16,url:no_url,u:silas_malafaia,c:e a Bíblia é a verdade absoluta é a palavra ...,e a Bíblia é a verdade absoluta é a palavra de...,cl_st1_guilherme-dataset-output/MALAFAIA RESPO...
1,t000001,v:cl_st1_guilherme-dataset-output/MALAFAIA_RES...,d:2024-04-16,url:no_url,u:silas_malafaia,c:e eu não vou me calar por uma série de coisa...,e eu não vou me calar por uma série de coisas ...,cl_st1_guilherme-dataset-output/MALAFAIA RESPO...
2,t000002,v:cl_st1_guilherme-dataset-output/MALAFAIA_RES...,d:2024-04-16,url:no_url,u:silas_malafaia,c:eu vou dizer uma coisa para você que ele faz...,eu vou dizer uma coisa para você que ele faz e...,cl_st1_guilherme-dataset-output/MALAFAIA RESPO...
3,t000003,v:cl_st1_guilherme-dataset-output/MALAFAIA_RES...,d:2024-04-16,url:no_url,u:silas_malafaia,c:e decida salvar seu casamento sabe por quê P...,e decida salvar seu casamento sabe por quê Por...,cl_st1_guilherme-dataset-output/MALAFAIA RESPO...
4,t000004,v:cl_st1_guilherme-dataset-output/MOTIVACIONAL...,d:2024-04-16,url:no_url,u:silas_malafaia,c:e você vai ter que lutar para receber as pro...,e você vai ter que lutar para receber as prome...,cl_st1_guilherme-dataset-output/MOTIVACIONAL P...
...,...,...,...,...,...,...,...,...
1805,t001805,v:cl_st1_guilherme-dataset-output/MINUTOS_DE_V...,d:2024-04-16,url:no_url,u:silas_malafaia,c:a visão correta da vida você tem que ter um ...,a visão correta da vida você tem que ter um ol...,cl_st1_guilherme-dataset-output/MINUTOS DE VIT...
1806,t001806,v:cl_st1_guilherme-dataset-output/MINUTOS_DE_V...,d:2024-04-16,url:no_url,u:silas_malafaia,c:conseguiu garantir só você tem que entender ...,conseguiu garantir só você tem que entender al...,cl_st1_guilherme-dataset-output/MINUTOS DE VIT...
1807,t001807,v:cl_st1_guilherme-dataset-output/MINUTOS_DE_V...,d:2024-04-16,url:no_url,u:silas_malafaia,c:é o desafio de vencer a nossa natureza a cha...,é o desafio de vencer a nossa natureza a chama...,cl_st1_guilherme-dataset-output/MINUTOS DE VIT...
1808,t001808,v:cl_st1_guilherme-dataset-output/MINUTOS_DE_V...,d:2024-04-16,url:no_url,u:silas_malafaia,c:características de quem quer produzir obras ...,características de quem quer produzir obras de...,cl_st1_guilherme-dataset-output/MINUTOS DE VIT...


### Creating the file `file_index.txt`

In [31]:
df_tweets_filtered[['text_id', 'conversation', 'date', 'text_url']].to_csv('file_index.txt', sep=' ', index=False, header=False, encoding='utf-8', lineterminator='\n')

### Creating the file `tweets.txt`

In [32]:
folder = 'tweets'
try:
    os.mkdir(folder)
    print(f'Folder {folder} created!')
except FileExistsError:
    print(f'Folder {folder} already exists')

Folder tweets already exists


Note: The parameters `doublequote=False` and `escapechar=' '` are required to avoid that the column content is doublequoted with '"' in sentences that use characters that need to be escaped such as double quote '"' itself - this causes a malformed response from TreeTagger.

In [33]:
df_tweets_filtered[['text_id', 'conversation', 'date', 'user', 'content']].to_csv(f'{folder}/tweets.txt', sep='|', index=False, header=False, encoding='utf-8', lineterminator='\n', doublequote=False, escapechar=' ')

## Tagging with TreeTagger

- On Visual Studio Code (VS Code), open the folder where your project is located with `Open Folder...`
- Open a WSL Ubuntu Terminal on VS Code
- **Important**: Activate the `my_env` Python environment by executing `source "$HOME"/my_env/bin/activate`
- Proceed as indicated

Purpose: Annotate the texts in `tweets/tweets.txt` with part-of-speech and lemma information.
- Input
    - `file_index.txt`
    - `tweets/tweets.txt`
- Output
    - `tweets/tagged.txt`

```
eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ source "$HOME"/my_env/bin/activate
(my_env) eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ bash treetagging.sh
--- treetagging t000000 / t018205 ---
        reading parameters ...
        tagging ...
         finished.
--- treetagging t000001 / t018205 ---
        reading parameters ...
        tagging ...
         finished.
--- treetagging t000002 / t018205 ---
        reading parameters ...
        tagging ...
         finished.
--- treetagging t000003 / t018205 ---
        reading parameters ...
        tagging ...
         finished.
<omitted>
```

## Processing `tokenstypes`

Purpose: Capture the content tokens (specific occurrences of words) and the content types (general concept of words) from `tweets/tagged.txt`.
- Input
    - `file_index.txt`
    - `tweets/tagged.txt`
- Output
    - `tweets/tokens.txt`
    - `tweets/types.txt`

```
eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ source "$HOME"/my_env/bin/activate
(my_env) eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ bash tokenstypes.sh
--- tokenstypes t000000 / 18206 ---
--- tokenstypes t000001 / 18206 ---
--- tokenstypes t000002 / 18206 ---
--- tokenstypes t000003 / 18206 ---
--- tokenstypes t000004 / 18206 ---
--- tokenstypes t000005 / 18206 ---
<omitted>
```

## Processing `toplemmas`

Purpose: Determine the 1.000 top lemmas. **Important**: This process requires manual inspection. Non-meaningful lemmas should be excluded by updating `stoplist.sed` and reiterating the processing.
- Input
    - `tweets/types.txt`
    - `stoplist.sed`: List of rules that allows the exclusion of a certain lemmas
- Output
    - `selectedwords` = `var_index.txt`

```
eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ source "$HOME"/my_env/bin/activate
(my_env) eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ bash toplemmas.sh
```

## Processing `sas`

Purpose: Prepare input data for processing in SAS.
- Input
    - `tweets/types.txt`
    - `selectedwords`
    - `file_index.txt`
- Output
    - `columns`
    - `sas/data.txt`
    - `sas/dates.txt`
    - `sas/wcount.txt`

```
eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ source "$HOME"/my_env/bin/activate
(my_env) eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ bash sas.sh
--- v000001 ---
--- v000002 ---
--- v000003 ---
--- v000004 ---
--- v000005 ---
<omitted>
--- v001000 ---
[nltk_data] Downloading package punkt to /home/eyamrog/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Word counts written to sas/wcount.txt
```

## Processing `datamatrix`

Purpose: Prepares input data for calculating the correlation matrix.
- Input
    - `file_index.txt`
    - `columns`
    - `selectedwords`
- Output
    - `file_ids.txt`
    - `data.csv`

```
eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ source "$HOME"/my_env/bin/activate
(my_env) eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ bash datamatrix.sh
--- v000001 ---
--- v000002 ---
--- v000003 ---
--- v000004 ---
--- v000005 ---
<omitted>
--- v001000 ---
--- data.csv ...---
```

## Processing `correlationmatrix`

Purpose: Calculates the correlation matrix.
- Input
    - `data.csv`
- Output
    - `correlation`

```
eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ source "$HOME"/my_env/bin/activate
(my_env) eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ bash correlationmatrix.sh
--- python correlation ... ---
```

## Processing `formats`

Purpose: Prepare input data for processing in SAS.
- Input
    - `data.csv`
    - `selectedwords`
- Output
    - `sas/corr.txt`
    - `sas/word_labels_format.sas`

```
eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ source "$HOME"/my_env/bin/activate
(my_env) eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ bash formats.sh
--- sas/sas/corr.txt ---
--- sas/word_labels_format.sas ---
```

## Processing the statistical procedures on SAS

- Log in to your [SAS OnDemand for Academics](https://welcome.oda.sas.com/) account
- Proceed as indicated in this [video tutorial](https://youtu.be/I3u9zD3jyOA?si=68uIKVc2iusGG2KY)

## Processing `examples`

Purpose: Extract examples for analysis.
- Input
    - `sas/output_"$project"/loadtable.html`
    - `sas/output_"$project"/"$project"_scores.tsv`
    - `sas/output_"$project"/"$project"_scores_only.tsv`
- Output
    - `examples/factors`
    - `example files`

```
(my_env) eyamrog@Rog-ASUS:/mnt/c/Users/eyamr/Downloads$ bash examples.sh
6780
1246
698
123
--- examples f1pos ---
--- factor 1 pos # 000001 ---
tr: warning: an unescaped backslash at end of string is not portable
--- factor 1 pos # 000002 ---
tr: warning: an unescaped backslash at end of string is not portable
--- factor 1 pos # 000003 ---
tr: warning: an unescaped backslash at end of string is not portable
--- factor 1 pos # 000004 ---
tr: warning: an unescaped backslash at end of string is not portable
--- factor 1 pos # 000005 ---
tr: warning: an unescaped backslash at end of string is not portable
<ommitted>
```

## Results

Right-click on the link and choose `Open link in a new tab` to download the corresponding file.

- [CL_St1_Querem_Results.zip](https://pucsp-my.sharepoint.com/:u:/g/personal/ra00341729_pucsp_edu_br/ERbP8OEqscBJlh4l6s6_UFgBTUGtnR6PDI1NXZwVBh6Dyg?e=W8YXpq)