## Reading files

We read the `.txt` files line by line and apply the following filters:

1. **Remove `Emojis`**  
   - **For aesthetic reasons, as well as this being a portfolio project I want to get rid of emojis** 

After filtering, we normalize the content:

- **Strip invisible Unicode characters** like `\u200E` (Left-to-Right Mark) and `\u200F` (Right-to-Left Mark).

These steps ensure reliable timestamp parsing and consistent regex behavior. Since the data has mostly been cleaned, there will not be a huge need for a lot of processing.

In [None]:
import re
import pandas as pd



def read_data(file_path: str) -> pd.DataFrame:
    # Define filtering patterns

    # Write the regex for emojis
    emojis = r'\p{Emoji}'

    with open(file_path, 'r', encoding='utf-8') as f:
        lines = f.readlines()

    # Apply filters to remove unwanted lines
    filtered_lines = []
    for line in lines:
        if (
            emojis not in line
        ):
            line = line.replace(emojis, "").strip()
            filtered_lines.append(line)

    # Replace narrow no-break space (iOS specific)
    content = content.replace('\u202f', ' ')
    
    # Remove LRM and RLM characters (Left-to-Right Mark and Right-to-Left Mark)
    content = content.replace('\u200E', '').replace('\u200F', '')

    # Updated regex pattern to match both iOS and Android WhatsApp exports.
    pattern = r'(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{2}(?::\d{2})?(?:\s?[APap][Mm])?)\s?(?:-|\~)?\s?(.*?): (.*?)(?=\n\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{2}|$)'
    messages = re.findall(pattern, content, re.DOTALL)
    df = pd.DataFrame(messages, columns=['timestamp', 'sender', 'message'])

    timestamps = []
    for timestamp in df['timestamp']:
        try:
            timestamp = pd.to_datetime(
                timestamp, format='mixed', errors='coerce')
        except Exception as e:
            print(f"Error parsing timestamp '{timestamp}': {e}")
            timestamp = pd.NaT
        timestamps.append(timestamp)

    df['timestamp'] = timestamps
    return df

The `all_chats` dictionary holds the content of each file as a dataframe with three columns: `timestamp`, `sender`, and `message`.  

In [None]:
from pathlib import Path

all_chats = {}
data_directory = Path("../data/private")
for file in data_directory.glob('*.txt'):
    file_name = file.stem
    all_chats[file_name] = read_whatsapp_chat(file)

## Text sequence

The text should be merged into a single sequence to prepare it for the next step, where the BPE algorithm will be applied and the text will be encoded.

In [None]:
text_sequence = ""
for file_name in all_chats.keys():
    text_sequence += " ".join(all_chats[file_name]['message'].values)

len(text_sequence)

In [None]:
with open("../output/combined_text.txt", "w", encoding="utf-8") as f:
    f.write(text_sequence)