# Extracting and Cleaning Text from Telegram ChatExport

This notebook parses all html files in Telegram chat export , extracts message metadata and text, cleans the text for analysis, and saves outputs as:
- A CSV file containing all parsed messages and fields
- One cleaned TXT file per day (messages concatenated per date)

The goal is a clean, analysis-ready corpus while preserving essential metadata.

In [90]:
import pandas as pd
from bs4 import BeautifulSoup as bs
from pathlib import Path
import re
import os

## Setup and Working Directory

Before running this notebook, you need to:

1. **Download a chat history from Telegram:**
   - Open Telegram and export the desired chat (Settings â†’ Chat Settings â†’ Export Chat History)
   - Select HTML format (default format)

2. **Save the export in a folder:**
   - Create a folder in the same location as this notebook
   - Name the folder with a relevant name (e.g., "Alice_Weidel" or "Group_Chat_2024")
   - Copy all HTML files from the Telegram export into this folder

3. **Enter the folder name when prompted:**
   - In the next cell, the program will ask you to enter the folder name
   - **Important:** Enter exactly the same name you gave the folder, as this name is used to locate the files and name the output files later in the program

In [98]:
# Iterate all files html files
folder_name = input()

folder = f'{folder_name}'


 Marine Le Pen


## Load and Parse the Telegram Export

- open and read the html file using UTF-8 encoding.
- Parse the HTML with BeautifulSoup (bs).
- We will later extract each message block (`div.message.default.clearfix`) to a tidy table.

In [99]:
# Make list of html files in the chat export folder
html_file_names = [i for i in os.listdir(folder) if i.endswith('.html')]

def parse_into_bs(html_file):
    
    html_path = Path(rf'./{folder_name}/{html_file}')
    with html_path.open(encoding="utf-8") as f:
        html = f.read()
    
    soup = bs(html, "html.parser")

    return soup

# from html to soup elements
bs4_soup = [parse_into_bs(html_file) for html_file in html_file_names]

## Functions for Parsing Messages

This section:
- Extracts message date/time from the `"title"` attribute (e.g., `02.09.2025 13:45:12 UTC+0`)
- Reads the sender name (`div.from_name`)
- Gets text content (`div.text`), preserving line breaks
- Builds a pandas DataFrame with columns:
  - `message_id`, `date`, `time`, `utc`, `from_name`, `text`

In [100]:
def get_datetime_title(body):
    el = body.select_one("div.pull_right.date.details")
    title = el.get("title", "").strip() if el else ""
    date, time, utc = "", "", ""
    if title:
        parts = title.split(" ", 2)
        if len(parts) == 3:
            date_raw, time, utc = parts
            dparts = date_raw.split(".")
            if len(dparts) == 3 and dparts[0].isdigit() and dparts[1].isdigit():
                date = f"{int(dparts[0])}.{int(dparts[1])}.{dparts[2]}"
            else:
                date = date_raw
        else:
            date = title
    return date, time, utc


def get_from_name(body):
    el = body.select_one("div.from_name")
    return el.get_text(strip=True) if el else ""


def get_text(body):
    el = body.select_one("div.text")
    if not el:
        return ""
    for br in el.find_all("br"):
        br.replace_with("\n")
    return el.get_text(strip=True, separator="\n")


def parse_message_div(msg):
    body = msg.select_one("div.body")
    if body is None:
        return {}
    date, time, utc = get_datetime_title(body)
    return {
        "message_id": msg.get("id", ""),
        "date": date,
        "time": time,
        "utc": utc,
        "from_name": get_from_name(body),
        "text": get_text(body),
    }

def parse_all_messages_to_df(soup):
    msgs = soup.select("div.message.default.clearfix")
    rows = [parse_message_div(m) for m in msgs]
    rows = [r for r in rows if r]  # fjern tomme
    return pd.DataFrame(rows, columns=[
        "message_id",
        "date",
        "time",
        "utc",
        "from_name",
        "text",
    ])


# Build DataFrame - parse the html data into a dataframe

_df = [parse_all_messages_to_df(soup) for soup in bs4_soup]

_df = pd.concat(_df).reset_index(drop=True)

df = _df.sort_values('date', ascending=True)

## Build the DataFrame and Clean Text

- Construct the DataFrame from all parsed messages.
- Clean the raw text:
  - Replace newlines with spaces
  - Remove URLs
  - Collapse multiple spaces
  - Keep word tokens only
  - Lowercase
- Add:
  - `clean_text`: cleaned message text
  - `word_count`: number of words in `clean_text`

In [101]:
# Clean the text
def clean_text(text):
    _text = text.replace('\n', ' ') # remove new lines
    _text = re.sub(r'http.+\b', ' ',  _text) # remove links
    _text = re.sub(r'\s+', ' ', _text) # remove multiple white space
    _text = re.findall(r'\b\S+\b', _text) # remove all signs but letters - return list
    _text = ' '.join(_text) # join list to string 
    _text = _text.lower() # lower all letters
    return _text.strip() # strip white spaces

# Add new column with clean text
df['clean_text'] = df['text'].apply(clean_text)

# Add the number of words in the clean text
df['word_count'] = df['clean_text'].apply( lambda x : len(x.split())) 

# Filter data so only rows with word_count more than 10 are kept
df = df.query('word_count > 9')


In [102]:
df

Unnamed: 0,message_id,date,time,utc,from_name,text,clean_text,word_count
228,message258,1.10.2021,14:22:01,UTC+01:00,Marine Le Pen,"Depuis 27 ans, le mois d'octobre est l'occasio...",depuis 27 ans le mois d'octobre est l'occasion...,48
1134,message1226,1.10.2023,13:22:19,UTC+01:00,Marine Le Pen,"La dÃ©conjugalisation de lâ€™AAH, mesure que jâ€™ai...",la dÃ©conjugalisation de lâ€™aah mesure que jâ€™ai ...,63
943,message1026,1.11.2022,12:01:39,UTC+01:00,Marine Le Pen,Sous la pression de Bruxelles et dâ€™association...,sous la pression de bruxelles et dâ€™association...,33
944,message1027,1.11.2022,19:27:14,UTC+01:00,Marine Le Pen,En ce jour rÃ©servÃ© au souvenir de nos disparus...,en ce jour rÃ©servÃ© au souvenir de nos disparus...,23
1154,message1247,1.11.2023,18:34:25,UTC+01:00,Marine Le Pen,Les disparus vivent dans nos cÅ“urs tant quâ€™il ...,les disparus vivent dans nos cÅ“urs tant quâ€™il ...,37
...,...,...,...,...,...,...,...,...
831,message910,9.6.2022,13:44:42,UTC+01:00,Marine Le Pen,ðŸŽ™ Je serai ce vendredi Ã 8h30 lâ€™invitÃ©e de RMC...,je serai ce vendredi Ã 8h30 lâ€™invitÃ©e de rmc e...,11
1080,message1169,9.6.2023,12:36:31,UTC+01:00,Marine Le Pen,"ðŸ“¹ Cette attaque d'Annecy, s'attaquer Ã des bÃ©b...",cette attaque d'annecy s'attaquer Ã des bÃ©bÃ©s ...,37
176,message200,9.8.2021,19:55:34,UTC+01:00,Marine Le Pen,ðŸ–‹ Â«Â Pass Sanitaire : les FranÃ§ais entrent en r...,pass sanitaire les franÃ§ais entrent en rÃ©siden...,10
175,message199,9.8.2021,13:49:47,UTC+01:00,Marine Le Pen,"En France, on peut donc Ãªtre clandestin, incen...",en france on peut donc Ãªtre clandestin incendi...,43


## Save to CSV

Export data to a CSV file for later analysis
(e.g., in spreadsheets, Python, R, or Orange).


In [103]:
out_dir = Path("csv_file")
out_dir.mkdir(exist_ok=True)

path = os.path.join(f"./{out_dir}/{folder}_export.csv")
df.to_csv(path, index=False, encoding='utf-8')

## Export: Concatenate Cleaned Text by Day

- Convert `date` to a proper datetime (day-first).
- Group by day and concatenate all `clean_text` messages for that day.
- Save one TXT per date into `txt_files_grouped_by_day/`.

This is useful for daily-level analysis in tools like Voyant.

In [104]:
_df = df.copy()
_df['date'] = pd.to_datetime(_df['date'], dayfirst=True)
df_grouped = _df.groupby('date')['clean_text'].apply( lambda x : ' '.join(x)).reset_index() 
df_grouped = df_grouped.rename(columns={'clean_text': 'concatenated_text'})

out_dir = Path("txt_files_grouped_by_day")
out_dir.mkdir(exist_ok=True)


def make_txt_filename(date_str):
    date_str = str(date_str).strip().replace("-", "_")[0:10]
    if not date_str:
        date_str = "unknown_date"
    
    return f'{date_str}.txt'


written = 0
for index, row in df_grouped.iterrows():
    text = str(row['concatenated_text']).strip()
    if not text:
        continue
    _fname = make_txt_filename(row.get(("date"), ""))
    fname = folder + '_' + _fname
    (out_dir / fname).write_text(text, encoding="utf-8")
    written += 1

print(f"Wrote {written} txt files in {out_dir}")

Wrote 546 txt files in txt_files_grouped_by_day
