## Data Preprocessing and Cleaning
In this notebook, we will be preprocessing and cleaning the cummulated data to ensure it is ready for analysis and model training.
This includes removing duplicates, handling missing values, mask the features if necessary and fix the data types.

Below are the steps followed:
1. Importing required libraries and defining base directory
2. Loading the cummulated data files and its file paths
3. Define a function to carry out preprocessing and cleaning tasks
4. Call the preprocessing function on each dataset and store the cleaned data

### 1. Importing required libraries and defining base directory
In this step, we will import the necessary libraries and define the base directory where the cummulated data is stored.

In [1]:
import os
import pandas as pd
from pathlib import Path
import re
import csv
import hashlib
import ast

print(os.getcwd())
os.chdir("OriginalRedditDataSet/Raw_Data/")
print(os.getcwd())

d:\Data602\Project\VentBuddy-A-Data-Driven-Companion-for-Mental-Health-Support
d:\Data602\Project\VentBuddy-A-Data-Driven-Companion-for-Mental-Health-Support\OriginalRedditDataSet\Raw_Data


### 2. Loading the cummulated data files and its file paths
In this step, we will load the cummulated data files and store their file paths for further processing. 
The filepaths will be used to read the data into dataframes for preprocessing and cleaning.

In [2]:
base_dir = Path(os.getcwd())
print(base_dir)
cumulated_data_dir = Path("../Cumulated_Data/Combined_by_category")
print(cumulated_data_dir)
preprocessed_data_dir = Path("../Preprocessed")
print(preprocessed_data_dir)

cumulated_data_files = os.listdir(cumulated_data_dir)
print(cumulated_data_files)
data_file_paths = []
for file_name in cumulated_data_files:
    if file_name.endswith(".csv"):
        full_path = cumulated_data_dir / file_name
        data_file_paths.append(str(full_path))
print(data_file_paths)

d:\Data602\Project\VentBuddy-A-Data-Driven-Companion-for-Mental-Health-Support\OriginalRedditDataSet\Raw_Data
..\Cumulated_Data\Combined_by_category
..\Preprocessed
['anx_data.csv', 'dep_data.csv', 'lon_data.csv', 'mh_data.csv', 'sw_data.csv']
['..\\Cumulated_Data\\Combined_by_category\\anx_data.csv', '..\\Cumulated_Data\\Combined_by_category\\dep_data.csv', '..\\Cumulated_Data\\Combined_by_category\\lon_data.csv', '..\\Cumulated_Data\\Combined_by_category\\mh_data.csv', '..\\Cumulated_Data\\Combined_by_category\\sw_data.csv']


### 3. Define a function to carry out preprocessing and cleaning tasks
In this step, we will define a function that will take a file path as input and perform various preprocessing and cleaning tasks on the data.
This function will handle tasks such as removing duplicates, handling missing values, masking sensitive features, and fixing data types.

In [16]:
def clean_reddit_data(file_path):
    """
    Steps followed:
    1) Remove columns: created_utc, Unnamed: 0, author
    2) Remove duplicate rows
    3) Drop rows where selftext is NaN or empty
    4) Fill missing score with 0
    5) Fill missing subreddit with value passed as parameter
    6) Fill missing title with 'no title'
    7) Normalize subreddit/selftext/title (lower + strip)
    8) Coerce dtypes: subreddit/selftext/title->string, score->Int64, timestamp->datetime64[ns]
    9) Fix swapped subreddit/selftext rows (when selftext == target but subreddit != target)
    10) Keep only rows where subreddit == target
    11) Drop unparseable timestamps and reset index
    """

    print("### Working on file:", file_path)
    subreddit_fill_value=""
    if file_path.split("\\")[-1].startswith('a'):
        subreddit_fill_value = "anxiety"
    elif file_path.split("\\")[-1].startswith('d'):
        subreddit_fill_value = "depression"
    elif file_path.split("\\")[-1].startswith('m'):
        subreddit_fill_value = "mentalhealth"
    elif file_path.split("\\")[-1].startswith('s'):
        subreddit_fill_value = "suicidewatch"
    elif file_path.split("\\")[-1].startswith('l'):
        subreddit_fill_value = "lonely"

    print("Using subreddit fill value:", subreddit_fill_value)

    df = pd.read_csv(file_path)
    print("Size of the file before cleaning :(rows, columns)", df.shape)

    # 1) Drop unnecessary columns if present
    print("Columns before dropping unnecessary ones: \n", df.columns.tolist())
    cols_to_drop = [c for c in ['created_utc', 'Unnamed: 0', 'author'] if c in df.columns]
    if cols_to_drop:
        df.drop(columns=cols_to_drop, inplace=True, errors='ignore')
    print("Columns after dropping unnecessary ones: \n", df.columns.tolist())

    # 2) Remove exact duplicate rows
    print("Size of dataframe before removing duplicates:", df.shape)
    print("Sample duplicate rows :")
    display(df[df.duplicated(keep=False)].head(5))
    before_dups = len(df)
    df.drop_duplicates(inplace=True)
    print(f"Removed duplicates: {before_dups - len(df)}")
    print("Duplicate rows after dropping :")
    display(df[df.duplicated(keep=False)].head(5))
    print("Size of dataframe after removing duplicates:", df.shape)

    # 3) Drop rows with missing/empty selftext (ensure string ops are safe)
    print("Size of dataframe before dropping empty/NaN selftext:", df.shape)
    if 'selftext' in df.columns:
        df['selftext'] = df['selftext'].astype('string')
        before_text = len(df)
        df = df[df['selftext'].notna() & (df['selftext'].str.strip() != "")]
        print(f"Dropped empty/NaN selftext: {before_text - len(df)}")
    print("Size of dataframe after dropping empty/NaN selftext:", df.shape)

    # 4) Fill missing score with 0 (will coerce dtype below)
    if 'score' in df.columns:
        df['score'] = df['score'].fillna(0)

    # 5) Fill missing subreddit with provided value
    target = subreddit_fill_value.lower().strip()
    if 'subreddit' in df.columns:
        df['subreddit'] = df['subreddit'].fillna(target)

    # 6) Fill missing title with 'no title'
    if 'title' in df.columns:
        df['title'] = df['title'].fillna('no title')

    # 7) Normalize text columns (lowercase + strip)
    if 'subreddit' in df.columns:
        df['subreddit'] = df['subreddit'].astype(str).str.lower().str.strip()
    if 'selftext' in df.columns:
        df['selftext'] = df['selftext'].astype(str).str.lower().str.strip()
    if 'title' in df.columns:
        df['title'] = df['title'].astype(str).str.lower().str.strip()

    # 8) Dtype coercion:
    for col in ['subreddit', 'selftext', 'title']:
        if col in df.columns:
            df[col] = df[col].astype('string')
    if 'score' in df.columns:
        df['score'] = pd.to_numeric(df['score'], errors='coerce').astype('Int64').fillna(0).astype('Int64')
    if 'timestamp' in df.columns:
        df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')

    # 9) Fix swapped subreddit/selftext rows (only if both exist)
    if 'subreddit' in df.columns and 'selftext' in df.columns:
        mask = (df['subreddit'] != target) & (df['selftext'] == target)
        fixed = int(mask.sum())
        if fixed:
            df.loc[mask, ['subreddit', 'selftext']] = df.loc[mask, ['selftext', 'subreddit']].values
            print(f"Fixed swapped subreddit/selftext rows: {fixed}")

    # 10) Keep only rows where subreddit == target
    if 'subreddit' in df.columns:
        before_keep = len(df)
        df = df[df['subreddit'] == target]
        print(f"Dropped rows with subreddit != '{target}': {before_keep - len(df)}")

    # 11) Drop unparseable timestamps (if timestamp exists), reset index
    if 'timestamp' in df.columns:
        before_ts = len(df)
        df = df.dropna(subset=['timestamp'])
        print(f"Dropped unparseable timestamps: {before_ts - len(df)}")

    df.reset_index(drop=True, inplace=True)

    # Summary
    print("### Cleaning complete ###")
    print("Size of the file after cleaning :(rows, columns)", df.shape)
    print("Dtypes:\n", df.dtypes)
    print("Nulls per column:\n", df.isnull().sum())

    # Write cleaned dataframe to new CSV
    output_path = "..\\Preprocessed_Data\\"+file_path.split("\\")[-1][:-4]+"_cleaned.csv"
    df.to_csv(output_path, index=False)
    print("Cleaned data saved to:", output_path)
    print()
    print()

    return df

In [17]:
for file_path in data_file_paths:
    cleaned_df = clean_reddit_data(file_path)

### Working on file: ..\Cumulated_Data\Combined_by_category\anx_data.csv
Using subreddit fill value: anxiety
Size of the file before cleaning :(rows, columns) (280138, 8)
Columns before dropping unnecessary ones: 
 ['Unnamed: 0', 'author', 'created_utc', 'score', 'selftext', 'subreddit', 'title', 'timestamp']
Columns after dropping unnecessary ones: 
 ['score', 'selftext', 'subreddit', 'title', 'timestamp']
Size of dataframe before removing duplicates: (280138, 5)
Sample duplicate rows :


Unnamed: 0,score,selftext,subreddit,title,timestamp
14867,1,I have nightmares every single night. It's a n...,Anxiety,I can't sleep in fear of nightmare but I don't...,2019-12-01 17:39:48
14868,1,I have nightmares every single night. It's a n...,Anxiety,I can't sleep in fear of nightmare but I don't...,2019-12-01 17:39:48
22932,1,[removed],Anxiety,Anyone who takes Lexapro for anxiety and/or de...,2019-01-11 18:27:57
22933,1,[removed],Anxiety,Anyone who takes Lexapro for anxiety and/or de...,2019-01-11 18:27:57
26437,1,I honestly don’t know where to start. Since la...,Anxiety,Good news ruined by anxiety,2019-07-22 14:10:42


Removed duplicates: 78
Duplicate rows after dropping :


Unnamed: 0,score,selftext,subreddit,title,timestamp


Size of dataframe after removing duplicates: (280060, 5)
Size of dataframe before dropping empty/NaN selftext: (280060, 5)
Dropped empty/NaN selftext: 11499
Size of dataframe after dropping empty/NaN selftext: (268561, 5)
Fixed swapped subreddit/selftext rows: 100
Dropped rows with subreddit != 'anxiety': 0
Dropped unparseable timestamps: 0
### Cleaning complete ###
Size of the file after cleaning :(rows, columns) (268561, 5)
Dtypes:
 score                 Int64
selftext     string[python]
subreddit    string[python]
title        string[python]
timestamp    datetime64[ns]
dtype: object
Nulls per column:
 score        0
selftext     0
subreddit    0
title        0
timestamp    0
dtype: int64
Cleaned data saved to: ..\Preprocessed_Data\anx_data_cleaned.csv


### Working on file: ..\Cumulated_Data\Combined_by_category\dep_data.csv
Using subreddit fill value: depression
Size of the file before cleaning :(rows, columns) (650984, 8)
Columns before dropping unnecessary ones: 
 ['Unnamed: 0', 

Unnamed: 0,score,selftext,subreddit,title,timestamp
9318,1.0,What's a minute or two of strangualation for t...,depression,That extension cord os getting mighty fuckin a...,2019-04-11 05:22:36
9319,1.0,What's a minute or two of strangualation for t...,depression,That extension cord os getting mighty fuckin a...,2019-04-11 05:22:36
21911,1.0,When are you gonna surrender type shit.,depression,Life is like besieging yourself,2019-08-16 06:28:19
21912,1.0,When are you gonna surrender type shit.,depression,Life is like besieging yourself,2019-08-16 06:28:19
22006,1.0,It's hard to know how sad other people are so ...,depression,How depressed am I,2019-08-16 02:50:50


Removed duplicates: 328
Duplicate rows after dropping :


Unnamed: 0,score,selftext,subreddit,title,timestamp


Size of dataframe after removing duplicates: (650656, 5)
Size of dataframe before dropping empty/NaN selftext: (650656, 5)
Dropped empty/NaN selftext: 4691
Size of dataframe after dropping empty/NaN selftext: (645965, 5)
Fixed swapped subreddit/selftext rows: 2099
Dropped rows with subreddit != 'depression': 7535
Dropped unparseable timestamps: 16299
### Cleaning complete ###
Size of the file after cleaning :(rows, columns) (622131, 5)
Dtypes:
 score                 Int64
selftext     string[python]
subreddit    string[python]
title        string[python]
timestamp    datetime64[ns]
dtype: object
Nulls per column:
 score        0
selftext     0
subreddit    0
title        0
timestamp    0
dtype: int64
Cleaned data saved to: ..\Preprocessed_Data\dep_data_cleaned.csv


### Working on file: ..\Cumulated_Data\Combined_by_category\lon_data.csv
Using subreddit fill value: lonely
Size of the file before cleaning :(rows, columns) (157361, 8)
Columns before dropping unnecessary ones: 
 ['Unnamed

Unnamed: 0,score,selftext,subreddit,title,timestamp
30091,1,[removed],lonely,Resize my grandma none of my family members ha...,2020-12-14 16:45:27
30092,1,[removed],lonely,Resize my grandma none of my family members ha...,2020-12-14 16:45:27
30093,1,[removed],lonely,Resize my grandma none of my family members ha...,2020-12-14 16:45:27
30260,1,So my birthday is tomorrow as I am posting thi...,lonely,Stuck in thought about life.,2020-12-13 15:25:20
30261,1,So my birthday is tomorrow as I am posting thi...,lonely,Stuck in thought about life.,2020-12-13 15:25:20


Removed duplicates: 62
Duplicate rows after dropping :


Unnamed: 0,score,selftext,subreddit,title,timestamp


Size of dataframe after removing duplicates: (157299, 5)
Size of dataframe before dropping empty/NaN selftext: (157299, 5)
Dropped empty/NaN selftext: 3541
Size of dataframe after dropping empty/NaN selftext: (153758, 5)
Fixed swapped subreddit/selftext rows: 246
Dropped rows with subreddit != 'lonely': 0
Dropped unparseable timestamps: 0
### Cleaning complete ###
Size of the file after cleaning :(rows, columns) (153758, 5)
Dtypes:
 score                 Int64
selftext     string[python]
subreddit    string[python]
title        string[python]
timestamp    datetime64[ns]
dtype: object
Nulls per column:
 score        0
selftext     0
subreddit    0
title        0
timestamp    0
dtype: int64
Cleaned data saved to: ..\Preprocessed_Data\lon_data_cleaned.csv


### Working on file: ..\Cumulated_Data\Combined_by_category\mh_data.csv
Using subreddit fill value: mentalhealth
Size of the file before cleaning :(rows, columns) (294561, 8)
Columns before dropping unnecessary ones: 
 ['Unnamed: 0', '

Unnamed: 0,score,selftext,subreddit,title,timestamp
16921,1,It's becoming increasingly difficult to separa...,mentalhealth,The doublespeak and gaslighting by government ...,2019-01-10 00:41:52
16922,1,It's becoming increasingly difficult to separa...,mentalhealth,The doublespeak and gaslighting by government ...,2019-01-10 00:41:52
30893,1,"Hello Reddit. \nI smoke marijuana, and I've be...",mentalhealth,Need some insight for my situation.,2019-05-11 16:19:02
30894,1,"Hello Reddit. \nI smoke marijuana, and I've be...",mentalhealth,Need some insight for my situation.,2019-05-11 16:19:02
41306,1,"this is going to be pretty long. okay, so:\n\n...",mentalhealth,I’m scared to be alive. Intense fear of future...,2019-09-28 02:01:27


Removed duplicates: 94
Duplicate rows after dropping :


Unnamed: 0,score,selftext,subreddit,title,timestamp


Size of dataframe after removing duplicates: (294467, 5)
Size of dataframe before dropping empty/NaN selftext: (294467, 5)
Dropped empty/NaN selftext: 4569
Size of dataframe after dropping empty/NaN selftext: (289898, 5)
Fixed swapped subreddit/selftext rows: 189
Dropped rows with subreddit != 'mentalhealth': 0
Dropped unparseable timestamps: 0
### Cleaning complete ###
Size of the file after cleaning :(rows, columns) (289898, 5)
Dtypes:
 score                 Int64
selftext     string[python]
subreddit    string[python]
title        string[python]
timestamp    datetime64[ns]
dtype: object
Nulls per column:
 score        0
selftext     0
subreddit    0
title        0
timestamp    0
dtype: int64
Cleaned data saved to: ..\Preprocessed_Data\mh_data_cleaned.csv


### Working on file: ..\Cumulated_Data\Combined_by_category\sw_data.csv
Using subreddit fill value: suicidewatch
Size of the file before cleaning :(rows, columns) (476151, 8)
Columns before dropping unnecessary ones: 
 ['Unnamed: 

Unnamed: 0,score,selftext,subreddit,title,timestamp
45681,1,"Alot has changed, im getting help going to CR ...",SuicideWatch,Its been over a year since my last post.,2019-06-01 20:36:24
45682,1,"Alot has changed, im getting help going to CR ...",SuicideWatch,Its been over a year since my last post.,2019-06-01 20:36:24
45683,1,"Alot has changed, im getting help going to CR ...",SuicideWatch,Its been over a year since my last post.,2019-06-01 20:36:23
45684,1,"Alot has changed, im getting help going to CR ...",SuicideWatch,Its been over a year since my last post.,2019-06-01 20:36:23
45685,1,"Alot has changed, im getting help going to CR ...",SuicideWatch,Its been over a year since my last post.,2019-06-01 20:36:22


Removed duplicates: 227
Duplicate rows after dropping :


Unnamed: 0,score,selftext,subreddit,title,timestamp


Size of dataframe after removing duplicates: (475924, 5)
Size of dataframe before dropping empty/NaN selftext: (475924, 5)
Dropped empty/NaN selftext: 30155
Size of dataframe after dropping empty/NaN selftext: (445769, 5)
Fixed swapped subreddit/selftext rows: 1067
Dropped rows with subreddit != 'suicidewatch': 0
Dropped unparseable timestamps: 0
### Cleaning complete ###
Size of the file after cleaning :(rows, columns) (445769, 5)
Dtypes:
 score                 Int64
selftext     string[python]
subreddit    string[python]
title        string[python]
timestamp    datetime64[ns]
dtype: object
Nulls per column:
 score        0
selftext     0
subreddit    0
title        0
timestamp    0
dtype: int64
Cleaned data saved to: ..\Preprocessed_Data\sw_data_cleaned.csv




As the data preprocessing is complete, below is the preview of each cleader dataset for each category.

In [18]:
preprocessed_data_dir = Path("..\\Preprocessed_Data")
for preprocessed_file in os.listdir(preprocessed_data_dir):
    file_path = preprocessed_data_dir / preprocessed_file
    print(f"\nPreview of {preprocessed_file}:")
    df = pd.read_csv(file_path)
    display(df.head())


Preview of anx_data_cleaned.csv:


Unnamed: 0,score,selftext,subreddit,title,timestamp
0,4,hello all. \n\nmy wife has anxiety and lately...,anxiety,wife has anxiety. how can i help?,2019-04-30 23:50:25
1,4,i wanted to write this because i feel there’s ...,anxiety,my anxiety’s kryptonite.,2019-04-30 23:31:49
2,2,"hi all, so i've been taking effexor xr 75 mg f...",anxiety,"while taking effexor, is it okay to take cloni...",2019-04-30 23:06:20
3,8,hi guys!\n\ni've finally come to the conclusio...,anxiety,"after accepting you need help, what was the fi...",2019-04-30 22:49:27
4,1,"essentially, i've had everything from poor sle...",anxiety,i've spent the last few months suffering a ser...,2019-04-30 22:30:12



Preview of dep_data_cleaned.csv:


Unnamed: 0,score,selftext,subreddit,title,timestamp
0,3,does anybody else experience “depression attac...,depression,depression attacks?,2019-04-30 23:50:57
1,2,i am struggling pretty hard this week. im mise...,depression,lifes tough and i need to vent,2019-04-30 23:50:23
2,66,when you make a post here on desktop the tab s...,depression,just a funny depression based thought,2019-04-30 23:50:08
3,2,"hey, for a couple of months i just wasn't able...",depression,how do you manage to get something done?,2019-04-30 23:48:12
4,1,“things aren't the way they were before\n\nyou...,depression,been depressed over a “f”wb. have a song that ...,2019-04-30 23:46:04



Preview of lon_data_cleaned.csv:


Unnamed: 0,score,selftext,subreddit,title,timestamp
0,69,i'm 22 but i despise dating. especially online...,lonely,tired of modern dating,2019-04-30 22:43:30
1,2,i have an hour before i go to school i need so...,lonely,an hour,2019-04-30 22:11:18
2,1,"hey guys, i (m21) am new to this reddit and ho...",lonely,thinking about removing contact with everyone,2019-04-30 20:11:59
3,1,as time went on and school ended it got easier...,lonely,my whole life,2019-04-30 18:21:11
4,2,"hi everyone,\n\ni recently created a new sub r...",lonely,new sub r/lifeafterschool for discussing life ...,2019-04-30 16:00:09



Preview of mh_data_cleaned.csv:


Unnamed: 0,score,selftext,subreddit,title,timestamp
0,5,sory sorry i know probably not the place to po...,mentalhealth,i think depreison is makng me insane,2019-04-30 23:59:46
1,9,passing the time can be hard and boring. what'...,mentalhealth,what are your low effort hobbies?,2019-04-30 23:25:01
2,1,my life is falling apart. i have nothing good ...,mentalhealth,why am i so jealous and envious?,2019-04-30 23:01:54
3,4,"like of you see say, physical scars, on a youn...",mentalhealth,"how do manditory reporters (teachers, doctors,...",2019-04-30 22:32:11
4,1,i suffer from dissociative identity disorder (...,mentalhealth,dissociative identity disorder,2019-04-30 22:16:41



Preview of sw_data_cleaned.csv:


Unnamed: 0,score,selftext,subreddit,title,timestamp
0,1,...,suicidewatch,we want to kill ourselves because some of us e...,2019-04-30 23:58:50
1,1,i’m sitting on the couch in hysterics right no...,suicidewatch,feeling pretty bad,2019-04-30 23:51:19
2,1,...,suicidewatch,we want to kill ourselves due to a simple fact...,2019-04-30 23:51:00
3,1,i have the urge to tell someone about my story...,suicidewatch,"a life of shame. i've withdrawn from everyone,...",2019-04-30 23:44:18
4,1,currently caught up in one of the deepest and ...,suicidewatch,i think i’m slowly losing my mind,2019-04-30 23:42:02


### At this point, the data is preprocessed and cleaned, ready for analysis and model training.