<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Phase 1_1 - Guilherme

This solution responds to the requirement of adding the part-of-speech (POS) tag as a suffix to the lemmas determined by TreeTagger.

It takes the file `tweets/tagged.txt` as input, performs the appropriate string transformations and returns `tweets/tagged2.txt` as output. Therefore, the solution should be executed after the execution of `treetagging.sh` is completed.

Before moving on to running `tokenstypes.sh`, `tweets/tagged2.txt` should replace `tweets/tagged.txt` as shown below:

```
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ ll
total 33172
drwxr-xr-x 2 eyamrog eyamrog     4096 Sep 11 17:32 ./
drwxr-xr-x 5 eyamrog eyamrog     4096 Sep 11 17:45 ../
-rw-r--r-- 1 eyamrog eyamrog    89062 Sep 11 14:40 tagged.txt
-rw-r--r-- 1 eyamrog eyamrog    99405 Sep 11 17:34 tagged2.txt
-rw-r--r-- 1 eyamrog eyamrog 16842002 Sep 11 14:38 tweets.txt
-rw-r--r-- 1 eyamrog eyamrog 16924433 Sep 11 14:20 tweets_ori.txt
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ mv tagged.txt tagged_ori.txt
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ mv tagged2.txt tagged.txt
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ ll
total 33172
drwxr-xr-x 2 eyamrog eyamrog     4096 Sep 11 17:46 ./
drwxr-xr-x 5 eyamrog eyamrog     4096 Sep 11 17:45 ../
-rw-r--r-- 1 eyamrog eyamrog    99405 Sep 11 17:34 tagged.txt
-rw-r--r-- 1 eyamrog eyamrog    89062 Sep 11 14:40 tagged_ori.txt
-rw-r--r-- 1 eyamrog eyamrog 16842002 Sep 11 14:38 tweets.txt
-rw-r--r-- 1 eyamrog eyamrog 16924433 Sep 11 14:20 tweets_ori.txt
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ 
```

## Importing the required libraries

In [1]:
import pandas as pd
import re

## Importing `tweets/tagged.txt` into a DataFrame

In [2]:
df = pd.read_csv('tweets/tagged.txt', sep='|', names=['text_id', 'conversation', 'date', 'user', 'content'])

In [3]:
df

Unnamed: 0,text_id,conversation,date,user,content
0,t000000,v:cl_st1_guilherme-dataset-output/MALAFAIA_RES...,d:2024-04-16,u:silas_malafaia,c:e\tCCONJ\te~a\tDET.Fem.Sing\to~Bíblia\tPROPN...
1,t000001,v:cl_st1_guilherme-dataset-output/MALAFAIA_RES...,d:2024-04-16,u:silas_malafaia,c:e\tCCONJ\te~eu\tPRON.Sing\teu~não\tADV\tnão~...
2,t000002,v:cl_st1_guilherme-dataset-output/MALAFAIA_RES...,d:2024-04-16,u:silas_malafaia,c:eu\tPRON.Fem.Sing\teu~vou\tAUX.Fin.Sing\tir~...
3,t000003,v:cl_st1_guilherme-dataset-output/MALAFAIA_RES...,d:2024-04-16,u:silas_malafaia,c:e\tCCONJ\te~decida\tVERB.Fin.Sing\tdecidir~s...
4,t000004,v:cl_st1_guilherme-dataset-output/MOTIVACIONAL...,d:2024-04-16,u:silas_malafaia,c:e\tCCONJ\te~você\tPRON.Sing\tvocê~vai\tAUX.F...
...,...,...,...,...,...
1806,t001806,v:cl_st1_guilherme-dataset-output/MINUTOS_DE_V...,d:2024-04-16,u:silas_malafaia,c:a\tDET.Fem.Sing\to~visão\tNOUN.Fem.Sing\tvis...
1807,t001807,v:cl_st1_guilherme-dataset-output/MINUTOS_DE_V...,d:2024-04-16,u:silas_malafaia,c:conseguiu\tVERB.Fin.Sing\tconseguir~garantir...
1808,t001808,v:cl_st1_guilherme-dataset-output/MINUTOS_DE_V...,d:2024-04-16,u:silas_malafaia,c:é\tAUX.Fin.Sing\tser~o\tDET.Masc.Sing\to~des...
1809,t001809,v:cl_st1_guilherme-dataset-output/MINUTOS_DE_V...,d:2024-04-16,u:silas_malafaia,c:características\tNOUN.Fem.Plur\tcaracterísti...


In [4]:
df.dtypes

text_id         object
conversation    object
date            object
user            object
content         object
dtype: object

### Inspecting a few texts

In [5]:
df.loc[1, 'content']

'c:e\tCCONJ\te~eu\tPRON.Sing\teu~não\tADV\tnão~vou\tAUX.Fin.Sing\tir~me\tPRON.Masc.Sing\teu~calar\tVERB.Inf\tcalar~por\tADP\tpor~uma\tDET.Fem.Sing\tum~série\tNOUN.Fem.Sing\tsérie~de\tADP\tde~coisas\tNOUN.Fem.Plur\tcoisa~primeiro\tADV\tprimeiro~que\tSCONJ\tque~química\tNOUN.Fem.Sing\tquímica~é\tADP\té~teus\tNOUN.Masc.Plur\tteu~é\tAUX.Fin.Sing\tser~aquele\tPRON.Masc.Sing\taquele~cala\tVERB.Fin.Sing\tcalar~tudo\tPRON.Masc.Sing\ttudo~é\tAUX.Fin.Sing\tser~12\tNUM\t12~eu\tPRON.Sing\teu~sou\tAUX.Fin.Sing\tser~a\tDET.Fem.Sing\to~voz\tNOUN.Fem.Sing\tvoz~profética\tADJ.Fem.Sing\tprofético~voz\tNOUN.Fem.Sing\tvoz~profética\tADJ.Fem.Sing\tprofético~não\tADV\tnão~se\tPRON\tse~cala\tVERB.Fin.Sing\tcalar~questiona\tVERB.Fin.Sing\tquestionar~eu\tPRON.Sing\teu~tô\tNOUN.Masc.Sing\ttô~tão\tADV\ttão~preparado\tVERB.Part.Masc.Sing\tpreparar~para\tADP\tpara~receber\tVERB.Inf\treceber~críticas\tNOUN.Fem.Plur\tcrítica~para\tADP\tpara~ser\tAUX.Inf\tser~enxovalhado\tVERB.Part.Masc.Sing\tenxovalhar~que\tSCONJ\tq

## Appending a `~` character at the end of each string of the column `content`

The character `~` is required to allow for the detection of the string patterns to transform.

In [6]:
# Appending '~' to the end of each string in the 'content' column
df['content'] = df['content'] + '~'

### Inspecting a few texts

In [7]:
df.loc[1, 'content']

'c:e\tCCONJ\te~eu\tPRON.Sing\teu~não\tADV\tnão~vou\tAUX.Fin.Sing\tir~me\tPRON.Masc.Sing\teu~calar\tVERB.Inf\tcalar~por\tADP\tpor~uma\tDET.Fem.Sing\tum~série\tNOUN.Fem.Sing\tsérie~de\tADP\tde~coisas\tNOUN.Fem.Plur\tcoisa~primeiro\tADV\tprimeiro~que\tSCONJ\tque~química\tNOUN.Fem.Sing\tquímica~é\tADP\té~teus\tNOUN.Masc.Plur\tteu~é\tAUX.Fin.Sing\tser~aquele\tPRON.Masc.Sing\taquele~cala\tVERB.Fin.Sing\tcalar~tudo\tPRON.Masc.Sing\ttudo~é\tAUX.Fin.Sing\tser~12\tNUM\t12~eu\tPRON.Sing\teu~sou\tAUX.Fin.Sing\tser~a\tDET.Fem.Sing\to~voz\tNOUN.Fem.Sing\tvoz~profética\tADJ.Fem.Sing\tprofético~voz\tNOUN.Fem.Sing\tvoz~profética\tADJ.Fem.Sing\tprofético~não\tADV\tnão~se\tPRON\tse~cala\tVERB.Fin.Sing\tcalar~questiona\tVERB.Fin.Sing\tquestionar~eu\tPRON.Sing\teu~tô\tNOUN.Masc.Sing\ttô~tão\tADV\ttão~preparado\tVERB.Part.Masc.Sing\tpreparar~para\tADP\tpara~receber\tVERB.Inf\treceber~críticas\tNOUN.Fem.Plur\tcrítica~para\tADP\tpara~ser\tAUX.Inf\tser~enxovalhado\tVERB.Part.Masc.Sing\tenxovalhar~que\tSCONJ\tq

## Defining a function to transform the tagged strings

In [8]:
def transform_tagged_string(tagged_string):
    # Ensure the input is a string
    tagged_string = str(tagged_string)
    # Function to transform each substring
    def transform_substring(match):
        parts = match.group(1).split('\t')
        if parts[0] in ['HASHTAG', 'EMOJI']:
            substring = f'{parts[0]}\t{parts[1]}~'
        else:
            tag = parts[0].replace('.', '_') # Replacing any occurrence of '.' by '_' to ensure compliance with the next stage of processing
            substring = f'{parts[0]}\t{parts[1]}_{tag}~'
        return substring
    
    # Regular expression to match each substring delimited by '~'
    #pattern = r'(\w+\t\w+)~'
    pattern = r'([a-zA-Z0-9_.]+\t\w+)~'
    
    # Apply the transformation
    transformed_string = re.sub(pattern, lambda match: transform_substring(match), tagged_string)
    
    return transformed_string

## Transforming the tagged strings

In [9]:
# Transforming the tagged strings
df['content'] = df['content'].apply(transform_tagged_string)

### Inspecting a few texts

In [11]:
df.loc[1, 'content']

'c:e\tCCONJ\te_CCONJ~eu\tPRON.Sing\teu_PRON_Sing~não\tADV\tnão_ADV~vou\tAUX.Fin.Sing\tir_AUX_Fin_Sing~me\tPRON.Masc.Sing\teu_PRON_Masc_Sing~calar\tVERB.Inf\tcalar_VERB_Inf~por\tADP\tpor_ADP~uma\tDET.Fem.Sing\tum_DET_Fem_Sing~série\tNOUN.Fem.Sing\tsérie_NOUN_Fem_Sing~de\tADP\tde_ADP~coisas\tNOUN.Fem.Plur\tcoisa_NOUN_Fem_Plur~primeiro\tADV\tprimeiro_ADV~que\tSCONJ\tque_SCONJ~química\tNOUN.Fem.Sing\tquímica_NOUN_Fem_Sing~é\tADP\té_ADP~teus\tNOUN.Masc.Plur\tteu_NOUN_Masc_Plur~é\tAUX.Fin.Sing\tser_AUX_Fin_Sing~aquele\tPRON.Masc.Sing\taquele_PRON_Masc_Sing~cala\tVERB.Fin.Sing\tcalar_VERB_Fin_Sing~tudo\tPRON.Masc.Sing\ttudo_PRON_Masc_Sing~é\tAUX.Fin.Sing\tser_AUX_Fin_Sing~12\tNUM\t12_NUM~eu\tPRON.Sing\teu_PRON_Sing~sou\tAUX.Fin.Sing\tser_AUX_Fin_Sing~a\tDET.Fem.Sing\to_DET_Fem_Sing~voz\tNOUN.Fem.Sing\tvoz_NOUN_Fem_Sing~profética\tADJ.Fem.Sing\tprofético_ADJ_Fem_Sing~voz\tNOUN.Fem.Sing\tvoz_NOUN_Fem_Sing~profética\tADJ.Fem.Sing\tprofético_ADJ_Fem_Sing~não\tADV\tnão_ADV~se\tPRON\tse_PRON~cala\t

## Exporting the DataFrame into `tweets/tagged2.txt`

In [12]:
df.to_csv('tweets/tagged2.txt', sep='|', index=False, header=False, encoding='utf-8', lineterminator='\n', doublequote=False, escapechar=' ')