1. All text lowercase → ideal for token consistency.
2. No URLs, hashtags, markdown, or HTML → avoids token noise.
3. No punctuation clutter → helps AntConc tokenize cleanly
4. Natural spacing → each line is a full, analyzable sentence/paragraph.
5. Accents, emojis, and symbols removed → clean ASCII for corpus tools.
6. Joined into plain continuous sentences → perfect for collocation, keyword, and concordance analysis.

In [3]:
import pandas as pd
import numpy as np
df = pd.read_csv("dataset/Dataset.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,text,title,target
0,0,Welcome to /r/depression's check-in post - a p...,"Regular check-in post, with information about ...",1
1,1,We understand that most people who reply immed...,Our most-broken and least-understood rules is ...,1
2,2,Anyone else just miss physical touch? I crave ...,"I haven’t been touched, or even hugged, in so ...",1
3,3,I’m just so ashamed. Everyone and everything f...,Being Depressed is Embarrassing,1
4,4,I really need a friend. I don't even have a si...,I'm desperate for a friend and to feel loved b...,1


In [2]:
df = df[['text', 'title', 'target']]
df.head()

Unnamed: 0,text,title,target
0,Welcome to /r/depression's check-in post - a p...,"Regular check-in post, with information about ...",1
1,We understand that most people who reply immed...,Our most-broken and least-understood rules is ...,1
2,Anyone else just miss physical touch? I crave ...,"I haven’t been touched, or even hugged, in so ...",1
3,I’m just so ashamed. Everyone and everything f...,Being Depressed is Embarrassing,1
4,I really need a friend. I don't even have a si...,I'm desperate for a friend and to feel loved b...,1


In [3]:
df = df.dropna(subset=['text', 'title', 'target'])
df

Unnamed: 0,text,title,target
0,Welcome to /r/depression's check-in post - a p...,"Regular check-in post, with information about ...",1
1,We understand that most people who reply immed...,Our most-broken and least-understood rules is ...,1
2,Anyone else just miss physical touch? I crave ...,"I haven’t been touched, or even hugged, in so ...",1
3,I’m just so ashamed. Everyone and everything f...,Being Depressed is Embarrassing,1
4,I really need a friend. I don't even have a si...,I'm desperate for a friend and to feel loved b...,1
...,...,...,...
5952,I’ve (24M) dealt with depression/anxiety for y...,Nobody takes me seriously,4
5953,"""I don't feel very good, it's like I don't be...",selfishness,4
5954,"I can't sleep most of the nights, meds didn't ...",Is there any way to sleep better?,4
5955,"Hi, all. I have to give a presentation at work...",Public speaking tips?,4


In [4]:
df['target'].value_counts()

target
1    1202
4    1144
0    1099
2    1085
3    1077
Name: count, dtype: int64

In [5]:
# Define label mapping
label_map = {
    0: 'Stress',
    1: 'Depression',
    2: 'Bipolar_Disorder',
    3: 'Personality_Disorder',
    4: 'Anxiety'
}

# Split and save
for target, label in label_map.items():
    subset = df[df['target'] == target]
    subset.to_csv(f"{label}.csv", index=False)
    print(f"{label}.csv saved with {len(subset)} rows.")


Stress.csv saved with 1099 rows.
Depression.csv saved with 1202 rows.
Bipolar_Disorder.csv saved with 1085 rows.
Personality_Disorder.csv saved with 1077 rows.
Anxiety.csv saved with 1144 rows.


In [4]:
stress_df = pd.read_csv("Stress.csv")
stress_df.head()

len(stress_df)  # number of rows in Stress.csv

1099

In [None]:
depression_df = pd.read_csv("datatset/Depression.csv")

depression_df.head()

len(depression_df)

1202

In [None]:
bipolar_disorder_df = pd.read_csv("dataset/Bipolar_Disorder.csv")
bipolar_disorder_df.head()

len(bipolar_disorder_df)

1085

In [None]:
personality_disorder_df = pd.read_csv("dataset/Personality_Disorder.csv")
personality_disorder_df.head()

len(personality_disorder_df)

1077

In [None]:
anxiety_df = pd.read_csv("dataset/Anxiety.csv")
anxiety_df.head()

len(anxiety_df)

1144

In [5]:
import re

def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove non-alphabetic characters
    text = re.sub(r'[^A-Za-z\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

In [6]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')  # needed for lemmatization in some setups

def preprocess_text(text):
    # Step 1: Clean text
    text = clean_text(text)

    # Step 2: Tokenize
    tokens = word_tokenize(text)

    # Step 3: Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Step 4: Lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return tokens


[nltk_data] Downloading package punkt to /Users/joanneloi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/joanneloi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/joanneloi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/joanneloi/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [13]:
from nltk import data
try:
    data.find('tokenizers/punkt')
    data.find('corpora/stopwords')
    data.find('corpora/wordnet')
    print("All resources are available.")
except LookupError as e:
    print(e)


All resources are available.


In [None]:
import nltk, os

# Make sure NLTK knows where to look
nltk.data.path.append(os.path.expanduser("~/nltk_data"))

# Download everything your preprocess_text() needs
nltk.download('punkt', download_dir=os.path.expanduser("~/nltk_data"))
nltk.download('punkt_tab', download_dir=os.path.expanduser("~/nltk_data"))
nltk.download('stopwords', download_dir=os.path.expanduser("~/nltk_data"))
nltk.download('wordnet', download_dir=os.path.expanduser("~/nltk_data"))
nltk.download('omw-1.4', download_dir=os.path.expanduser("~/nltk_data"))

print("All NLTK resources are now installed and ready.")


All NLTK resources are now installed and ready.


[nltk_data] Downloading package punkt to /Users/joanneloi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/joanneloi/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/joanneloi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/joanneloi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/joanneloi/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [16]:
from unidecode import unidecode

for label in label_map.values():
    print(f"\nProcessing dataset: {label}.csv")

    df_temp = pd.read_csv(f"{label}.csv")

    # --- Stage 1: Basic Cleaning ---
    df_temp['text'] = df_temp['text'].astype(str).apply(clean_text)
    df_temp['title'] = df_temp['title'].astype(str).apply(clean_text)
    df_temp.to_csv(f"{label}_cleaned.csv", index=False, encoding='utf-8')
    print(f"{label}_cleaned.csv saved (basic cleaning done).")

    # --- Stage 2: Full Preprocessing ---
    df_temp['text'] = df_temp['text'].astype(str).apply(lambda x: ' '.join(preprocess_text(x)))
    df_temp['title'] = df_temp['title'].astype(str).apply(lambda x: ' '.join(preprocess_text(x)))

    # Now both columns are strings again
    df_temp['combined'] = df_temp['title'] + " " + df_temp['text']

    # Normalize, deduplicate, clean
    df_temp['combined'] = df_temp['combined'].apply(unidecode)
    before = len(df_temp)
    df_temp = df_temp.drop_duplicates(subset=['combined'])
    after = len(df_temp)
    print(f"Removed {before - after} duplicates. {after} entries remain.")
    df_temp = df_temp[df_temp['combined'].str.strip().str.len() > 3]

    # Export for AntConc
    df_temp[['combined']].to_csv(f"{label}_cleaned.txt", index=False, header=False, encoding='utf-8')
    print(f"{label}_cleaned.txt ready for AntConc analysis.")


Processing dataset: Stress.csv
Stress_cleaned.csv saved (basic cleaning done).
Removed 293 duplicates. 806 entries remain.
Stress_cleaned.txt ready for AntConc analysis.

Processing dataset: Depression.csv
Depression_cleaned.csv saved (basic cleaning done).
Removed 228 duplicates. 974 entries remain.
Depression_cleaned.txt ready for AntConc analysis.

Processing dataset: Bipolar_Disorder.csv
Bipolar_Disorder_cleaned.csv saved (basic cleaning done).
Removed 276 duplicates. 809 entries remain.
Bipolar_Disorder_cleaned.txt ready for AntConc analysis.

Processing dataset: Personality_Disorder.csv
Personality_Disorder_cleaned.csv saved (basic cleaning done).
Removed 185 duplicates. 892 entries remain.
Personality_Disorder_cleaned.txt ready for AntConc analysis.

Processing dataset: Anxiety.csv
Anxiety_cleaned.csv saved (basic cleaning done).
Removed 199 duplicates. 945 entries remain.
Anxiety_cleaned.txt ready for AntConc analysis.
