<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Phase 2_1 - Renata

This solution responds to the requirement of adding the part-of-speech (POS) tag as a suffix to the lemmas determined by TreeTagger.

It takes the file `tweets/tagged.txt` as input, performs the appropriate string transformations and returns `tweets/tagged2.txt` as output. Therefore, the solution should be executed after the execution of `treetagging.sh` is completed.

Before moving on to running `tokenstypes.sh`, `tweets/tagged2.txt` should replace `tweets/tagged.txt` as shown below:

```
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ ll
total 33172
drwxr-xr-x 2 eyamrog eyamrog     4096 Sep 11 17:32 ./
drwxr-xr-x 5 eyamrog eyamrog     4096 Sep 11 17:45 ../
-rw-r--r-- 1 eyamrog eyamrog    89062 Sep 11 14:40 tagged.txt
-rw-r--r-- 1 eyamrog eyamrog    99405 Sep 11 17:34 tagged2.txt
-rw-r--r-- 1 eyamrog eyamrog 16842002 Sep 11 14:38 tweets.txt
-rw-r--r-- 1 eyamrog eyamrog 16924433 Sep 11 14:20 tweets_ori.txt
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ mv tagged.txt tagged_ori.txt
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ mv tagged2.txt tagged.txt
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ ll
total 33172
drwxr-xr-x 2 eyamrog eyamrog     4096 Sep 11 17:46 ./
drwxr-xr-x 5 eyamrog eyamrog     4096 Sep 11 17:45 ../
-rw-r--r-- 1 eyamrog eyamrog    99405 Sep 11 17:34 tagged.txt
-rw-r--r-- 1 eyamrog eyamrog    89062 Sep 11 14:40 tagged_ori.txt
-rw-r--r-- 1 eyamrog eyamrog 16842002 Sep 11 14:38 tweets.txt
-rw-r--r-- 1 eyamrog eyamrog 16924433 Sep 11 14:20 tweets_ori.txt
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ 
```

## Importing the required libraries

In [None]:
import pandas as pd
import re

## Importing `tweets/tagged.txt` into a DataFrame

In [None]:
df = pd.read_csv('tweets/tagged.txt', sep='|', names=['text_id', 'conversation', 'date', 'user', 'content'])

In [None]:
df

In [None]:
df.dtypes

### Inspecting a few texts

In [None]:
df.loc[1, 'content']

## Appending a `~` character at the end of each string of the column `content`

The character `~` is required to allow for the detection of the string patterns to transform.

In [None]:
# Appending '~' to the end of each string in the 'content' column
df['content'] = df['content'] + '~'

### Inspecting a few texts

In [None]:
df.loc[1, 'content']

## Defining a function to transform the tagged strings

In [None]:
def transform_tagged_string(tagged_string):
    # Ensure the input is a string
    tagged_string = str(tagged_string)
    # Function to transform each substring
    def transform_substring(match):
        parts = match.group(1).split('\t')
        if parts[0] in ['HASHTAG', 'EMOJI']:
            substring = f'{parts[0]}\t{parts[1]}~'
        else:
            substring = f'{parts[0]}\t{parts[1]}_{parts[0]}~'
        return substring
    
    # Regular expression to match each substring delimited by '~'
    #pattern = r'(\w+\t\w+)~'
    pattern = r'([a-zA-Z0-9_.]+\t\w+)~'
    
    # Apply the transformation
    transformed_string = re.sub(pattern, lambda match: transform_substring(match), tagged_string)
    
    return transformed_string

## Transforming the tagged strings

In [None]:
# Transforming the tagged strings
df['content'] = df['content'].apply(transform_tagged_string)

### Inspecting a few texts

In [None]:
df.loc[1, 'content']

## Exporting the DataFrame into `tweets/tagged2.txt`

In [None]:
df.to_csv('tweets/tagged2.txt', sep='|', index=False, header=False, encoding='utf-8', lineterminator='\n', doublequote=False, escapechar=' ')