<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Phase 2_1 - Renata

This solution responds to the requirement of adding the part-of-speech (POS) tag as a suffix to the lemmas determined by TreeTagger.

It takes the file `tweets/tagged.txt` as input, performs the appropriate string transformations and returns `tweets/tagged2.txt` as output. Therefore, the solution should be executed after the execution of `treetagging.sh` is completed.

Before moving on to running `tokenstypes.sh`, `tweets/tagged2.txt` should replace `tweets/tagged.txt` as shown below:

```
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ ll
total 33172
drwxr-xr-x 2 eyamrog eyamrog     4096 Sep 11 17:32 ./
drwxr-xr-x 5 eyamrog eyamrog     4096 Sep 11 17:45 ../
-rw-r--r-- 1 eyamrog eyamrog    89062 Sep 11 14:40 tagged.txt
-rw-r--r-- 1 eyamrog eyamrog    99405 Sep 11 17:34 tagged2.txt
-rw-r--r-- 1 eyamrog eyamrog 16842002 Sep 11 14:38 tweets.txt
-rw-r--r-- 1 eyamrog eyamrog 16924433 Sep 11 14:20 tweets_ori.txt
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ mv tagged.txt tagged_ori.txt
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ mv tagged2.txt tagged.txt
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ ll
total 33172
drwxr-xr-x 2 eyamrog eyamrog     4096 Sep 11 17:46 ./
drwxr-xr-x 5 eyamrog eyamrog     4096 Sep 11 17:45 ../
-rw-r--r-- 1 eyamrog eyamrog    99405 Sep 11 17:34 tagged.txt
-rw-r--r-- 1 eyamrog eyamrog    89062 Sep 11 14:40 tagged_ori.txt
-rw-r--r-- 1 eyamrog eyamrog 16842002 Sep 11 14:38 tweets.txt
-rw-r--r-- 1 eyamrog eyamrog 16924433 Sep 11 14:20 tweets_ori.txt
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_renata/tweets$ 
```

## Importing the required libraries

In [1]:
import pandas as pd
import re

## Importing `tweets/tagged.txt` into a DataFrame

In [2]:
df = pd.read_csv('tweets/tagged.txt', sep='|', names=['text_id', 'conversation', 'date', 'user', 'content'])

In [3]:
df

Unnamed: 0,text_id,conversation,date,user,content
0,t000000,v:107838014712814240,d:2024-05-17,u:AlexJones,c:#alexjonesshow_h\tHASHTAG\talexjonesshow_h~F...
1,t000001,v:107838014712814240,d:2024-05-08,u:AlexJones,c:#alexjonesshow_h\tHASHTAG\talexjonesshow_h~M...
2,t000002,v:107838014712814240,d:2024-04-23,u:AlexJones,c:BREAKING\tNN\tbreaking~:\t:\t:~Australian\tN...
3,t000003,v:107838014712814240,d:2024-03-29,u:AlexJones,c:BREAKING\tNN\tbreaking~:\t:\t:~CPS\tNP\tcps~...
4,t000004,v:107838014712814240,d:2024-03-26,u:AlexJones,c:Tuesday\tNP\ttuesday~LIVE\tNP\tlive~:\t:\t:~...
...,...,...,...,...,...
56638,t056638,v:107759501782461328,d:2022-07-20,u:truthsocial,c:We\tPP\twe~are\tVBP\tbe~humbled\tVBN\thumble...
56639,t056639,v:107759501782461328,d:2022-05-09,u:truthsocial,c:The\tDT\tthe~wait\tNN\twait~is\tVBZ\tbe~OVER...
56640,t056640,v:107759501782461328,d:2022-05-07,u:truthsocial,c:BIG\tJJ\tbig~NEWS\tNN\tnews~!\tSENT\t!~Our\t...
56641,t056641,v:107759501782461328,d:2022-03-11,u:truthsocial,c:User\tNN\tuser~engagement\tNN\tengagement~on...


In [4]:
df.dtypes

text_id         object
conversation    object
date            object
user            object
content         object
dtype: object

### Inspecting a few texts

In [5]:
df.loc[2, 'content']

"c:BREAKING\tNN\tbreaking~:\t:\t:~Australian\tNP\taustralian~Senator\tNP\tsenator~Calls\tNP\tcalls~For\tIN\tfor~Elon\tNP\telon~Musk\tNP\tmusk~'\tPOS\t'~s\tJJ\ts~Arrest\tNP\tarrest~For\tIN\tfor~Defending\tNP\tdefending~Free\tNP\tfree~Speech\tNP\tspeech"

## Appending a `~` character at the end of each string of the column `content`

The character `~` is required to allow for the detection of the string patterns to transform.

In [6]:
# Appending '~' to the end of each string in the 'content' column
df['content'] = df['content'] + '~'

### Inspecting a few texts

In [7]:
df.loc[2, 'content']

"c:BREAKING\tNN\tbreaking~:\t:\t:~Australian\tNP\taustralian~Senator\tNP\tsenator~Calls\tNP\tcalls~For\tIN\tfor~Elon\tNP\telon~Musk\tNP\tmusk~'\tPOS\t'~s\tJJ\ts~Arrest\tNP\tarrest~For\tIN\tfor~Defending\tNP\tdefending~Free\tNP\tfree~Speech\tNP\tspeech~"

## Defining a function to transform the tagged strings

In [8]:
def transform_tagged_string(tagged_string):
    # Ensure the input is a string
    tagged_string = str(tagged_string)
    # Function to transform each substring
    def transform_substring(match):
        parts = match.group(1).split('\t')
        if parts[0] in ['HASHTAG', 'EMOJI']:
            substring = f'{parts[0]}\t{parts[1]}~'
        else:
            tag = parts[0].replace('.', '_') # Replacing any occurrence of '.' by '_' to ensure compliance with the next stage of processing
            substring = f'{parts[0]}\t{parts[1]}_{tag}~'
        return substring
    
    # Regular expression to match each substring delimited by '~'
    #pattern = r'(\w+\t\w+)~'
    pattern = r'([a-zA-Z0-9_.]+\t[a-zA-Z0-9_\-]+)~'
    
    # Apply the transformation
    transformed_string = re.sub(pattern, lambda match: transform_substring(match), tagged_string)
    
    return transformed_string

## Transforming the tagged strings

In [9]:
# Transforming the tagged strings
df['content'] = df['content'].apply(transform_tagged_string)

### Inspecting a few texts

In [10]:
df.loc[1, 'content']

'c:#alexjonesshow_h\tHASHTAG\talexjonesshow_h~Must-Watch\tNP\tmust-watch_NP~Wednesday\tNP\twednesday_NP~Broadcast\tNP\tbroadcast_NP~:\t:\t:~Alex\tNP\talex_NP~Jones\tNP\tjones_NP~&\tCC\t&~Nick\tNP\tnick_NP~Fuentes\tNP\tfuentes_NP~Discuss\tNP\tdiscuss_NP~Future\tNP\tfuture_NP~of\tIN\tof_IN~Free\tNP\tfree_NP~Speech\tNP\tspeech_NP~In\tIN\tin_IN~America\tNP\tamerica_NP~&\tCC\t&~Worldwide\tNP\tworldwide_NP~!\tSENT\t!~!\tSENT\t!~TUNE\tNN\ttune_NN~IN\tIN\tin_IN~LIVE\tNP\tlive_NP~!\tSENT\t!~GET\tVB\tget_VB~NITRIC\tNP\tnitric_NP~BOOST*\tNP\tboost*~NOW\tRB\tnow_RB~40\tCD\t@card@~%\tNN\t%~OFF\tNN\toff_NN~!\tSENT\t!~#supportinfowars_h\tHASHTAG\tsupportinfowars_h~'

## Exporting the DataFrame into `tweets/tagged2.txt`

In [11]:
df.to_csv('tweets/tagged2.txt', sep='|', index=False, header=False, encoding='utf-8', lineterminator='\n', doublequote=False, escapechar=' ')