<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Phase 1_1 - Mariana

This solution responds to the requirement of adding the part-of-speech (POS) tag as a suffix to the lemmas determined by TreeTagger.

It takes the file `tweets/tagged.txt` as input, performs the appropriate string transformations and returns `tweets/tagged2.txt` as output. Therefore, the solution should be executed after the execution of `treetagging.sh` is completed.

Before moving on to running `tokenstypes.sh`, `tweets/tagged2.txt` should replace `tweets/tagged.txt` as shown below:

```
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ ll
total 33172
drwxr-xr-x 2 eyamrog eyamrog     4096 Sep 11 17:32 ./
drwxr-xr-x 5 eyamrog eyamrog     4096 Sep 11 17:45 ../
-rw-r--r-- 1 eyamrog eyamrog    89062 Sep 11 14:40 tagged.txt
-rw-r--r-- 1 eyamrog eyamrog    99405 Sep 11 17:34 tagged2.txt
-rw-r--r-- 1 eyamrog eyamrog 16842002 Sep 11 14:38 tweets.txt
-rw-r--r-- 1 eyamrog eyamrog 16924433 Sep 11 14:20 tweets_ori.txt
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ mv tagged.txt tagged_ori.txt
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ mv tagged2.txt tagged.txt
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ ll
total 33172
drwxr-xr-x 2 eyamrog eyamrog     4096 Sep 11 17:46 ./
drwxr-xr-x 5 eyamrog eyamrog     4096 Sep 11 17:45 ../
-rw-r--r-- 1 eyamrog eyamrog    99405 Sep 11 17:34 tagged.txt
-rw-r--r-- 1 eyamrog eyamrog    89062 Sep 11 14:40 tagged_ori.txt
-rw-r--r-- 1 eyamrog eyamrog 16842002 Sep 11 14:38 tweets.txt
-rw-r--r-- 1 eyamrog eyamrog 16924433 Sep 11 14:20 tweets_ori.txt
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ 
```

## Importing the required libraries

In [1]:
import pandas as pd
import re

## Importing `tweets/tagged.txt` into a DataFrame

In [2]:
df = pd.read_csv('tweets/tagged.txt', sep='|', names=['text_id', 'conversation', 'date', 'user', 'content'])

In [3]:
df

Unnamed: 0,text_id,conversation,date,user,content
0,t000000,v:2316329808,d:2016-10-13,u:SrtaXiss,c:RT\tPROPN.Masc.Sing\trt~@BR_DeTodos200Mi\tVE...
1,t000001,v:4876348647,d:2016-10-13,u:ins_ana_,c:RT\tNOUN.Masc.Sing\trt~@correio_dopovo\tADJ....
2,t000002,v:2858025838,d:2016-10-25,u:ireneravachero3,c:E\tCCONJ\te~quanto\tADV\tquanto~a\tDET.Fem.S...
3,t000003,v:457243275,d:2016-10-13,u:vitor_CRVG,c:@Estadao\tNOUN.Masc.Sing\ttwitterhandle~quer...
4,t000004,v:1944741320,d:2016-10-14,u:Camisa13doGalo,c:Rómulo\tPROPN.Masc.Sing\trómulo~Otero\tPROPN...
...,...,...,...,...,...
19060,t019060,v:1506118493938270208,d:2022-09-20,u:Jmalvesdc,c:RT\tPROPN.Masc.Sing\trt~@BoicaIslene\tVERB.F...
19061,t019061,v:2904143747,d:2022-09-05,u:cjcastro45,c:RT\tPROPN.Masc.Sing\trt~@Pattypschmidt\tPROP...
19062,t019062,v:1547288191731900416,d:2022-09-11,u:LimaFucuta,c:@VEJA\tVERB.Fin.Sing\ttwitterhandle~Cruz\tPR...
19063,t019063,v:123496655,d:2022-09-14,u:EMBRAC,c:RT\tPROPN.Masc.Sing\trt~@DiarioPE\tVERB.Fin....


In [4]:
df.dtypes

text_id         object
conversation    object
date            object
user            object
content         object
dtype: object

### Inspecting a few texts

In [5]:
df.loc[1, 'content']

'c:RT\tNOUN.Masc.Sing\trt~@correio_dopovo\tADJ.Masc.Sing\ttwitterhandle~:\tPUNCT.Colon\t:~Roraima\tPROPN.Masc.Sing\troraima~prepara\tVERB.Fin.Sing\tpreparar~gabinete\tNOUN.Masc.Sing\tgabinete~de\tADP\tde~emergência\tNOUN.Fem.Sing\temergência~para\tADP\tpara~crise\tNOUN.Fem.Sing\tcrise~de\tADP\tde~refugiados\tNOUN.Masc.Plur\trefugiado~venezuelanos\tADJ.Masc.Plur\tvenezuelano'

## Appending a `~` character at the end of each string of the column `content`

The character `~` is required to allow for the detection of the string patterns to transform.

In [6]:
# Appending '~' to the end of each string in the 'content' column
df['content'] = df['content'] + '~'

### Inspecting a few texts

In [7]:
df.loc[1, 'content']

'c:RT\tNOUN.Masc.Sing\trt~@correio_dopovo\tADJ.Masc.Sing\ttwitterhandle~:\tPUNCT.Colon\t:~Roraima\tPROPN.Masc.Sing\troraima~prepara\tVERB.Fin.Sing\tpreparar~gabinete\tNOUN.Masc.Sing\tgabinete~de\tADP\tde~emergência\tNOUN.Fem.Sing\temergência~para\tADP\tpara~crise\tNOUN.Fem.Sing\tcrise~de\tADP\tde~refugiados\tNOUN.Masc.Plur\trefugiado~venezuelanos\tADJ.Masc.Plur\tvenezuelano~'

## Defining a function to transform the tagged strings

In [8]:
def transform_tagged_string(tagged_string):
    # Ensure the input is a string
    tagged_string = str(tagged_string)
    # Function to transform each substring
    def transform_substring(match):
        parts = match.group(1).split('\t')
        if parts[0] in ['HASHTAG', 'EMOJI']:
            substring = f'{parts[0]}\t{parts[1]}~'
        else:
            tag = parts[0].replace('.', '_') # Replacing any occurrence of '.' by '_' to ensure compliance with the next stage of processing
            substring = f'{parts[0]}\t{parts[1]}_{tag}~'
        return substring
    
    # Regular expression to match each substring delimited by '~'
    #pattern = r'(\w+\t\w+)~'
    pattern = r'([a-zA-Z0-9_.]+\t\w+)~'
    
    # Apply the transformation
    transformed_string = re.sub(pattern, lambda match: transform_substring(match), tagged_string)
    
    return transformed_string

## Transforming the tagged strings

In [9]:
# Transforming the tagged strings
df['content'] = df['content'].apply(transform_tagged_string)

### Inspecting a few texts

In [10]:
df.loc[1, 'content']

'c:RT\tNOUN.Masc.Sing\trt_NOUN_Masc_Sing~@correio_dopovo\tADJ.Masc.Sing\ttwitterhandle_ADJ_Masc_Sing~:\tPUNCT.Colon\t:~Roraima\tPROPN.Masc.Sing\troraima_PROPN_Masc_Sing~prepara\tVERB.Fin.Sing\tpreparar_VERB_Fin_Sing~gabinete\tNOUN.Masc.Sing\tgabinete_NOUN_Masc_Sing~de\tADP\tde_ADP~emergência\tNOUN.Fem.Sing\temergência_NOUN_Fem_Sing~para\tADP\tpara_ADP~crise\tNOUN.Fem.Sing\tcrise_NOUN_Fem_Sing~de\tADP\tde_ADP~refugiados\tNOUN.Masc.Plur\trefugiado_NOUN_Masc_Plur~venezuelanos\tADJ.Masc.Plur\tvenezuelano_ADJ_Masc_Plur~'

## Exporting the DataFrame into `tweets/tagged2.txt`

In [11]:
df.to_csv('tweets/tagged2.txt', sep='|', index=False, header=False, encoding='utf-8', lineterminator='\n', doublequote=False, escapechar=' ')