<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Phase 1_1 - Quérem

This solution responds to the requirement of adding the part-of-speech (POS) tag as a suffix to the lemmas determined by TreeTagger.

It takes the file `tweets/tagged.txt` as input, performs the appropriate string transformations and returns `tweets/tagged2.txt` as output. Therefore, the solution should be executed after the execution of `treetagging.sh` is completed.

Before moving on to running `tokenstypes.sh`, `tweets/tagged2.txt` should replace `tweets/tagged.txt` as shown below:

```
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ ll
total 33172
drwxr-xr-x 2 eyamrog eyamrog     4096 Sep 11 17:32 ./
drwxr-xr-x 5 eyamrog eyamrog     4096 Sep 11 17:45 ../
-rw-r--r-- 1 eyamrog eyamrog    89062 Sep 11 14:40 tagged.txt
-rw-r--r-- 1 eyamrog eyamrog    99405 Sep 11 17:34 tagged2.txt
-rw-r--r-- 1 eyamrog eyamrog 16842002 Sep 11 14:38 tweets.txt
-rw-r--r-- 1 eyamrog eyamrog 16924433 Sep 11 14:20 tweets_ori.txt
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ mv tagged.txt tagged_ori.txt
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ mv tagged2.txt tagged.txt
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ ll
total 33172
drwxr-xr-x 2 eyamrog eyamrog     4096 Sep 11 17:46 ./
drwxr-xr-x 5 eyamrog eyamrog     4096 Sep 11 17:45 ../
-rw-r--r-- 1 eyamrog eyamrog    99405 Sep 11 17:34 tagged.txt
-rw-r--r-- 1 eyamrog eyamrog    89062 Sep 11 14:40 tagged_ori.txt
-rw-r--r-- 1 eyamrog eyamrog 16842002 Sep 11 14:38 tweets.txt
-rw-r--r-- 1 eyamrog eyamrog 16924433 Sep 11 14:20 tweets_ori.txt
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ 
```

## Importing the required libraries

In [1]:
import pandas as pd
import re

## Importing `tweets/tagged.txt` into a DataFrame

In [2]:
df = pd.read_csv('tweets/tagged.txt', sep='|', names=['text_id', 'conversation', 'date', 'user', 'content'])

In [3]:
df

Unnamed: 0,text_id,conversation,date,user,content
0,t000000,v:287765295,d:2018-03-28,u:pelegrini65,c:Após\tADP\tapós~caluniar\tVERB.Inf\tcaluniar...
1,t000001,v:16794066,d:2018-03-30,u:BlogdoNoblat,c:Bolsonaro\tPROPN.Masc.Sing\tbolsonaro~deve\t...
2,t000002,v:955901617148235776,d:2018-03-30,u:MariaOl25529153,c:@FlavioBolsonaro\tVERB.Fin.Sing\ttwitterhand...
3,t000003,v:44449830,d:2018-03-28,u:lucianagenro,c:A\tDET.Fem.Sing\to~esquerda\tNOUN.Fem.Sing\t...
4,t000004,v:912132396,d:2018-03-30,u:rocoguima,c:RT\tPROPN.Masc.Sing\trt~@AurystellaS\tVERB.F...
...,...,...,...,...,...
20596,t020596,v:1547227306913153026,d:2023-04-29,u:LuccaSo44679209,c:RT\tNOUN.Masc.Sing\trt~@LuccaSo44679209\tNUM...
20597,t020597,v:1547227306913153026,d:2023-04-29,u:LuccaSo44679209,c:@CiresCanisio\tVERB.Fin.Sing\ttwitterhandle~...
20598,t020598,v:1554492869825683457,d:2023-04-29,u:Andre19lll,c:@eunaovoupararde\tVERB.Fin.Sing\ttwitterhand...
20599,t020599,v:1585200142440882179,d:2023-04-29,u:priscila19865,c:@ValS265451870\tPROPN.Masc.Sing\ttwitterhand...


In [4]:
df.dtypes

text_id         object
conversation    object
date            object
user            object
content         object
dtype: object

### Inspecting a few texts

In [5]:
df.loc[1, 'content']

'c:Bolsonaro\tPROPN.Masc.Sing\tbolsonaro~deve\tAUX.Fin.Sing\tdever~saber\tVERB.Inf\tsaber~o\tPRON.Masc.Sing\to~que\tPRON.Rel.Masc.Sing\tque~está\tAUX.Fin.Sing\testar~fazendo\tVERB.Ger\tfazer~.\tPUNCT.Sent\t.~Porque\tADV\tporque~pela\tADP_DET.Fem.Sing\tpor_o~primeira\tADJ.Fem.Sing\tprimeiro~vez\tNOUN.Fem.Sing\tvez~,\tPUNCT.Comma\t,~o\tDET.Masc.Sing\to~eleitorado\tNOUN.Masc.Sing\teleitorado~feminino\tADJ.Masc.Sing\tfeminino~será\tAUX.Fin.Sing\tser~maior\tADJ.Fem.Sing\tgrande~nas\tADP_DET.Fem.Plur\tem_o~eleições\tNOUN.Fem.Plur\teleição~.\tPUNCT.Sent\t.~E\tCCONJ\te~ele\tPRON.Masc.Sing\tele~,\tPUNCT.Comma\t,~no\tADP_DET.Masc.Sing\tem_o~entanto\tNOUN.Masc.Sing\tentanto~,\tPUNCT.Comma\t,~só\tADV\tsó~fala\tVERB.Fin.Sing\tfalar~para\tADP\tpara~os\tDET.Masc.Plur\to~homens\tNOUN.Masc.Plur\thomem~e\tCCONJ\te~só\tADV\tsó~aparece\tVERB.Fin.Sing\taparecer~cercado\tVERB.Part.Masc.Sing\tcercar~de\tADP\tde~homens\tNOUN.Masc.Plur\thomem~,\tPUNCT.Comma\t,~parte\tNOUN.Fem.Sing\tparte~deles\tADP_PRON.Masc.P

## Appending a `~` character at the end of each string of the column `content`

The character `~` is required to allow for the detection of the string patterns to transform.

In [6]:
# Appending '~' to the end of each string in the 'content' column
df['content'] = df['content'] + '~'

### Inspecting a few texts

In [7]:
df.loc[1, 'content']

'c:Bolsonaro\tPROPN.Masc.Sing\tbolsonaro~deve\tAUX.Fin.Sing\tdever~saber\tVERB.Inf\tsaber~o\tPRON.Masc.Sing\to~que\tPRON.Rel.Masc.Sing\tque~está\tAUX.Fin.Sing\testar~fazendo\tVERB.Ger\tfazer~.\tPUNCT.Sent\t.~Porque\tADV\tporque~pela\tADP_DET.Fem.Sing\tpor_o~primeira\tADJ.Fem.Sing\tprimeiro~vez\tNOUN.Fem.Sing\tvez~,\tPUNCT.Comma\t,~o\tDET.Masc.Sing\to~eleitorado\tNOUN.Masc.Sing\teleitorado~feminino\tADJ.Masc.Sing\tfeminino~será\tAUX.Fin.Sing\tser~maior\tADJ.Fem.Sing\tgrande~nas\tADP_DET.Fem.Plur\tem_o~eleições\tNOUN.Fem.Plur\teleição~.\tPUNCT.Sent\t.~E\tCCONJ\te~ele\tPRON.Masc.Sing\tele~,\tPUNCT.Comma\t,~no\tADP_DET.Masc.Sing\tem_o~entanto\tNOUN.Masc.Sing\tentanto~,\tPUNCT.Comma\t,~só\tADV\tsó~fala\tVERB.Fin.Sing\tfalar~para\tADP\tpara~os\tDET.Masc.Plur\to~homens\tNOUN.Masc.Plur\thomem~e\tCCONJ\te~só\tADV\tsó~aparece\tVERB.Fin.Sing\taparecer~cercado\tVERB.Part.Masc.Sing\tcercar~de\tADP\tde~homens\tNOUN.Masc.Plur\thomem~,\tPUNCT.Comma\t,~parte\tNOUN.Fem.Sing\tparte~deles\tADP_PRON.Masc.P

## Defining a function to transform the tagged strings

In [8]:
def transform_tagged_string(tagged_string):
    # Ensure the input is a string
    tagged_string = str(tagged_string)
    # Function to transform each substring
    def transform_substring(match):
        parts = match.group(1).split('\t')
        if parts[0] in ['HASHTAG', 'EMOJI']:
            substring = f'{parts[0]}\t{parts[1]}~'
        else:
            tag = parts[0].replace('.', '_') # Replacing any occurrence of '.' by '_' to ensure compliance with the next stage of processing
            substring = f'{parts[0]}\t{parts[1]}_{tag}~'
        return substring
    
    # Regular expression to match each substring delimited by '~'
    #pattern = r'(\w+\t\w+)~'
    pattern = r'([a-zA-Z0-9_.]+\t\w+)~'
    
    # Apply the transformation
    transformed_string = re.sub(pattern, lambda match: transform_substring(match), tagged_string)
    
    return transformed_string

## Transforming the tagged strings

In [9]:
# Transforming the tagged strings
df['content'] = df['content'].apply(transform_tagged_string)

### Inspecting a few texts

In [11]:
df.loc[2, 'content']

'c:@FlavioBolsonaro\tVERB.Fin.Sing\ttwitterhandle_VERB_Fin_Sing~Mais\tADV\tmais_ADV~um\tDET.Masc.Sing\tum_DET_Masc_Sing~Romário\tPROPN.Masc.Sing\tromário_PROPN_Masc_Sing~na\tADP_DET.Fem.Sing\tem_o_ADP_DET_Fem_Sing~política\tNOUN.Fem.Sing\tpolítica_NOUN_Fem_Sing~,\tPUNCT.Comma\t,~que\tSCONJ\tque_SCONJ~Deus\tPROPN.Masc.Sing\tdeus_PROPN_Masc_Sing~ajude\tVERB.Fin.Sing\tajudar_VERB_Fin_Sing~os\tDET.Masc.Plur\to_DET_Masc_Plur~bolsonaro\tNOUN.Masc.Sing\tbolsonaro_NOUN_Masc_Sing~e\tCCONJ\te_CCONJ~não\tADV\tnão_ADV~deixem\tVERB.Fin.Plur\tdeixar_VERB_Fin_Plur~subir\tVERB.Inf\tsubir_VERB_Inf~para\tADP\tpara_ADP~cabeça\tNOUN.Fem.Sing\tcabeça_NOUN_Fem_Sing~,\tPUNCT.Comma\t,~só\tADV\tsó_ADV~falta\tVERB.Fin.Sing\tfaltar_VERB_Fin_Sing~agora\tADV\tagora_ADV~apoiar\tVERB.Inf\tapoiar_VERB_Inf~bbb\tNOUN.Fem.Plur\tbbb_NOUN_Fem_Plur~.\tPUNCT.Sent\t.~Nao\tPROPN.Masc.Sing\tnao_PROPN_Masc_Sing~caiam\tVERB.Fin.Plur\tcair_VERB_Fin_Plur~nessa\tADP_DET.Fem.Sing\tem_esse_ADP_DET_Fem_Sing~,\tPUNCT.Comma\t,~nos\tPRON

## Exporting the DataFrame into `tweets/tagged2.txt`

In [12]:
df.to_csv('tweets/tagged2.txt', sep='|', index=False, header=False, encoding='utf-8', lineterminator='\n', doublequote=False, escapechar=' ')