## Reading files

We read the `.txt` files line by line and apply the following filters:

1. **Remove `Emojis`**  
   - **For aesthetic reasons, as well as this being a portfolio project I want to get rid of emojis** 

After filtering, we normalize the content:

- **Strip invisible Unicode characters** like `\u200E` (Left-to-Right Mark) and `\u200F` (Right-to-Left Mark).

These steps ensure reliable timestamp parsing and consistent regex behavior. Since the data has mostly been cleaned, there will not be a huge need for a lot of processing.

In [11]:
import regex
import pandas as pd

def read_data(file_path: str) -> pd.DataFrame:
    # 1. Define the list of words to remove
    # You can add more words to this list as needed
    banned_words = [
        'shkau', 'magjup', 'mut', 'pidh', 'kar', 
        'peder', 'qr', 'rop', 'shkerdh', 'bastard'
    ]
    
    # Create a regex pattern that matches any of these words (case-insensitive)
    # \b ensures we match whole words only
    profanity_pattern = r'\b(' + '|'.join(banned_words) + r')\b'

    # Define filtering patterns for emojis
    emojis_pattern = r'\p{Emoji}'

    cleaned_tweets = []

    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            tweet = line.strip()
            if not tweet:
                continue  # Skip empty lines
            
            # A. Remove emojis
            tweet = regex.sub(emojis_pattern, "", tweet)
            
            # B. Clean special white-space/marks
            tweet = tweet.replace('\u202f', ' ').replace('\u200E', '').replace('\u200F', '')
            
            # C. Remove the swear words
            tweet = regex.sub(profanity_pattern, "[removed]", tweet, flags=regex.IGNORECASE)
            
            cleaned_tweets.append(tweet)
    
    
    
    # Create DataFrame
    df = pd.DataFrame(cleaned_tweets, columns=["Tweets"])
    
    return df

The `all_chats` dictionary holds the content of each file as a dataframe with three columns: `timestamp`, `sender`, and `message`.  

In [12]:
from pathlib import Path

all_chats = {}
data_directory = Path("../../albanian-dialect-corpus/data/ks")
for file in data_directory.glob('*.txt'):
    file_name = file.stem
    all_chats[file_name] = read_data(file)

In [13]:
print(all_chats)

{'1':                                                Tweets
0             Ky njeri osht personifikimi i budalles!
1   Po po veq Hasha u kon me Erdoganin e keq e Vjo...
2   Bile edhe droga qe na pat hup na pat marr ne q...
3   Qysh kish pas than ajo bija, sa ma shum po shp...
4   Fillimisht kurgja te keqe nuk ka Blerandi. Pse...
5                   E kan hjek prej rryme Feroniklin!
6   Arbeni ish more shum efiqient  me  euro m² po ...
7                             Ja ka djeg krejt chipat
8             ... na ke thy o, ... na ke thy hy hy hy
9   a e din at meselen per tirqit e Banush Sadllar...
10  Bash era e re osht VV ama era [removed] po i v...
11  ruju  se kan mbet edhe pak dishepuj te albinit...
12  av av kokan qu ushtria kibernetike e Kores se ...
13  Ma burr se ky perfekt njeri eshte, kerkush nuk...
14       Minimum duhesh me bo tweet per me pas efekt!
15  Eventualisht edhe do te mund te ishte, por jo ...
16  Shum kisha dasht me dit, cilat jan kan masat q...
17                    

## Text sequence

The text should be merged into a single sequence to prepare it for the next step, where the BPE algorithm will be applied and the text will be encoded.

In [14]:
text_sequence = ""
for file_name in all_chats.keys():
    text_sequence += " ".join(all_chats[file_name]['Tweets'].values)

len(text_sequence)

2683993

In [16]:
with open("../output/combined_text.txt", "w", encoding="utf-8") as f:
    f.write(text_sequence)