## DATA602 - Project: VentBuddy - A Data-Driven Companion for Mental Health Support
Team Members:
- Venkata Sai Pradyumna Nadella - 122096059
- Venkata Sai Sri Pujya Kothapalli - 121985248
- Mridhula Senthilkumar - 121944900

## Data Cumulation


In this notebook, we will be collecting and cumulating data from multiple folders into a single dataset for further analysis and model training.

Below are the steps followed:
1. Import necessary libraries, define the base directory containing the data folders
2. Rename the folders systematically for easier access
3. Handle the currupted file.
4. Rename the files inside the folders systematically.
5. Combine the data from each month and create a single dataset for that year (1 file per category) and then combine the data from all years into a final dataset for each category to be used from further processing.

### 1. Importing required libraries and defining base directory

In [1]:
import os
import pandas as pd
from pathlib import Path
import re
import csv
import hashlib
import ast

print(os.getcwd())
os.chdir("OriginalRedditDataSet/Raw_Data/")
print(os.getcwd())

c:\Users\prady\Desktop\UMCP_Courses\DATA602\Project\VentBuddy-A-Data-Driven-Companion-for-Mental-Health-Support
c:\Users\prady\Desktop\UMCP_Courses\DATA602\Project\VentBuddy-A-Data-Driven-Companion-for-Mental-Health-Support\OriginalRedditDataSet\Raw_Data


### 2. Renaming Folders Systematically
In this step, we will rename the folders and files in a systematic way to ensure consistency and ease of access.
* After renaming the folder for the first time, no need of renaming again. So, when the code is run for the second time, it will skip the renamed files.

In [None]:
import os
import re
from pathlib import Path

base_dir = Path(os.getcwd())
month_map = {
    "jan": "Jan",
    "feb": "Feb",
    "mar": "Mar",
    "apr": "Apr",
    "may": "May",
    "jun": "Jun",
    "june": "Jun",
    "jul": "Jul",
    "aug": "Aug",
    "sep": "Sep",
    "sept": "Sep",
    "oct": "Oct",
    "nov": "Nov",
    "dec": "Dec"
}

def normalize_folder_name(folder_name, year):
    name = folder_name.strip().lower()
    name = re.sub(r"\s*(\d+)$", r"\1", name)
    month_key = None
    for k in month_map:
        if k in name:
            month_key = k
            break
    if not month_key:
        return None
    month_std = month_map[month_key]
    year_suffix = str(year)[-2:]
    return f"{month_std} {year_suffix}"

for year_folder in sorted(base_dir.iterdir()):
    if year_folder.is_dir():
        year = year_folder.name
        print("\nProcessing:",year_folder.name)
        for subfolder in year_folder.iterdir():
            if subfolder.is_dir():
                new_name = normalize_folder_name(subfolder.name, year)
                if new_name:
                    new_path = subfolder.parent / new_name
                    if subfolder != new_path:
                        subfolder.rename(new_path)
                        print("- Renamed: "+subfolder.name+" to "+new_name)
                    else:
                        print("- No change needed: ",subfolder.name)
                else:
                    print("Skipped unrecognized: ",subfolder.name)

### 3. Handling the corrupted file
The file depmay21.numbers is the lone file which is of type .numbers and is corrupted.
So, it was manually converted to .csv and added to the folder.
The data in this file is also not consistent with other files. 
Hence, this is handled separately.
* So, the code is commented after execution as it is a one time process.

In [None]:
import pandas as pd
import ast

df = pd.read_csv("2021\\May 21\\depmay21_old.csv")

def safe_eval(x):
    if pd.isna(x):
        return {}
    try:
        return ast.literal_eval(x)
    except (ValueError, SyntaxError):
        return {}

df_d_parsed = df['d_'].apply(safe_eval)
expanded_df = df_d_parsed.apply(pd.Series)

df=expanded_df

df['created'] = pd.to_datetime(df['created'], unit='s', utc=True)
df['created'] = df['created'].dt.tz_convert('US/Eastern')
df['created'] = df['created'].dt.strftime("%#m/%#d/%Y %H:%M")

df.rename(columns={'created': 'timestamp'}, inplace=True)

df['Unnamed: 0'] = range(len(df))
df = df[['Unnamed: 0', 'author', 'created_utc', 'score', 'selftext', 'subreddit', 'title', 'timestamp']]

df.to_csv("2021\\May 21\\depmay21.csv", index=False)
print("Updated file saved successfully!")

### 4. Renaming all the filenames systematically
In this step, we will rename all the filenames in a systematic way to ensure consistency and ease of access.
This is an important step as some file have camel case, some have underscores, some have hyphens etc.
In order to read the files easily, we will be standardizing the filenames.
* After renaming the files for the first time, no need of renaming again. So, when the code is run for the second time, it will skip the renamed files.

In [None]:
import re
import csv
from pathlib import Path

base_dir = Path(os.getcwd())
apply_rename = True
only_csv = True
log_path = Path("../data/Processed/rename_log.csv")
log_path.parent.mkdir(parents=True, exist_ok=True)

month_keys = {
    "jan": "jan", "january": "jan",
    "feb": "feb", "february": "feb",
    "mar": "mar", "march": "mar",
    "apr": "apr", "april": "apr",
    "may": "may",
    "jun": "jun", "june": "jun",
    "jul": "jul", "july": "jul",
    "aug": "aug", "august": "aug",
    "sep": "sep", "sept": "sep", "september": "sep",
    "oct": "oct", "october": "oct",
    "nov": "nov", "november": "nov",
    "dec": "dec", "december": "dec"
}

year_re = re.compile(r"(20\d{2}|\d{2})")
alpha_prefix_re = re.compile(r"^[a-zA-Z]")

first_letter_map = {
    "a": "anx",
    "d": "dep",
    "m": "mh",
    "s": "sw",
    "l": "lon"
}

def first_letter_category(stem_lower):
    if not stem_lower:
        return None
    first = stem_lower[0]
    return first_letter_map.get(first, None)

def extract_month(stem_lower, parent_month_name=None):
    for k in month_keys:
        if k in stem_lower:
            return month_keys[k]
    if parent_month_name:
        pm = parent_month_name.lower()
        for k in month_keys:
            if k in pm:
                return month_keys[k]
    return None

def extract_year(stem_lower, parent_month_name=None, parent_year_folder=None):
    m = year_re.search(stem_lower)
    if m:
        y = m.group(0)
        return y[-2:]
    if parent_month_name:
        m2 = year_re.search(parent_month_name)
        if m2:
            return m2.group(0)[-2:]
    if parent_year_folder:
        if parent_year_folder.isdigit() and len(parent_year_folder) == 4:
            return parent_year_folder[-2:]
        m3 = year_re.search(parent_year_folder)
        if m3:
            return m3.group(0)[-2:]
    return None

rename_records = []
skipped = []

if not base_dir.exists():
    raise FileNotFoundError(f"Base directory not found: {base_dir}")

planned_count = 0
for year_dir in sorted([p for p in base_dir.iterdir() if p.is_dir()]):
    for month_dir in sorted([m for m in year_dir.iterdir() if m.is_dir()]):
        for file in sorted([f for f in month_dir.iterdir() if f.is_file()]):
            if only_csv and file.suffix.lower() != ".csv":
                skipped.append((file, "non-csv"))
                rename_records.append([year_dir.name, month_dir.name, file.name, "", "skipped", "non-csv"])
                continue
            
            stem = file.stem
            stem_lower = stem.lower()
                        
            category = first_letter_category(stem_lower)
            month_extracted = extract_month(stem_lower, parent_month_name=month_dir.name)
            year_extracted = extract_year(stem_lower, parent_month_name=month_dir.name, parent_year_folder=year_dir.name)
            
            reasons = []
            if category is None:
                reasons.append("category-not-detected")
            if month_extracted is None:
                reasons.append("month-not-detected")
            if year_extracted is None:
                reasons.append("year-not-detected")
            
            if reasons:
                reason_text = ";".join(reasons)
                skipped.append((file, reason_text))
                rename_records.append([year_dir.name, month_dir.name, file.name, "", "skipped", reason_text])
                continue
            
            new_name_stem = f"{category}{month_extracted}{year_extracted}"
            new_name = new_name_stem + file.suffix.lower()
            new_path = file.with_name(new_name)
            
            counter = 1
            while new_path.exists() and new_path.resolve() != file.resolve():
                new_name = f"{new_name_stem}_{counter}{file.suffix.lower()}"
                new_path = file.with_name(new_name)
                counter += 1
            
            planned_count += 1
            try:
                if apply_rename:
                    file.rename(new_path)
                    print(f"RENAMED: {file.relative_to(base_dir)}  ->  {new_path.relative_to(base_dir)}")
                    rename_records.append([year_dir.name, month_dir.name, file.name, new_path.name, "renamed", ""])
                else:
                    print(f"WOULD RENAME: {file.relative_to(base_dir)}  ->  {new_path.relative_to(base_dir)}")
                    rename_records.append([year_dir.name, month_dir.name, file.name, new_path.name, "planned", ""])
            except Exception as e:
                err = f"rename-failed:{e}"
                print(f"ERROR renaming {file}: {err}")
                skipped.append((file, err))
                rename_records.append([year_dir.name, month_dir.name, file.name, "", "failed", err])

print(f"\nSummary: attempted renames = {planned_count}, skipped = {len(skipped)}")

with open(log_path, "w", newline="", encoding="utf-8") as fh:
    writer = csv.writer(fh)
    writer.writerow(["year_folder", "month_folder", "old_name", "new_name", "status", "reason"])
    writer.writerows(rename_records)

print(f"Rename log saved to: {log_path}")

if skipped:
    print("\n=== SKIPPED FILES (relative_path : reason) ===")
    for fpath, reason in skipped:
        try:
            print(f"{fpath.relative_to(base_dir)}  :  {reason}")
        except Exception:
            print(f"{fpath}  :  {reason}")
else:
    print("\nNo files were skipped.")


### 5. Combining data from each month into a single dataset for that year and then combining data from all years
In this step, we will combine the data from each month into a single dataset for that year and then combine data from all years.
This will result in a final dataset for each category to be used for further processing.

In [None]:
# Two-step merge:
# Step A: create per-year per-category files inside each year folder
# Step B: combine per-year files across years into final per-category files

import pandas as pd
from pathlib import Path
import re
import csv

base_dir = Path(os.getcwd())
per_year_log = Path("../Cumulated_Data/Processed/per_year_merge_log.csv")
final_log = Path("../Cumulated_Data/Processed/final_combine_log.csv")
final_outdir = Path("../Cumulated_Data/Combined_by_category")

dedupe_per_year = True
dedupe_final = True
categories = ["anx", "dep", "sw", "lon", "mh"]

monthly_pattern = re.compile(r"^(?P<cat>[a-z]+)(?P<mon>[a-z]{3})(?P<yy>\d{2})\.csv$", re.IGNORECASE)
per_year_pattern = re.compile(r"^(?P<cat>[a-z]+)[_]?[_]?(?P<year>\d{4})\.csv$", re.IGNORECASE)

final_outdir.mkdir(parents=True, exist_ok=True)
per_year_log.parent.mkdir(parents=True, exist_ok=True)
final_log.parent.mkdir(parents=True, exist_ok=True)

def normalize_cat(pref: str):
    p = pref.lower()
    if p.startswith("an"):
        return "anx"
    if p.startswith("de"):
        return "dep"
    if p.startswith("su") or p.startswith("sw"):
        return "sw"
    if p.startswith("lo"):
        return "lon"
    if p.startswith("m"):
        return "mh"
    return None

# STEP A: Per-year merges
per_year_records = []
all_skipped = []

for year_dir in sorted([p for p in base_dir.iterdir() if p.is_dir()]):
    year = year_dir.name
    print(f"\n=== Processing Year: {year} ===")
    monthly_files = sorted(year_dir.rglob("*.csv"))
    
    cat_to_files = {c: [] for c in categories}
    skipped_discovery = []
    for f in monthly_files:
        m = monthly_pattern.match(f.name)
        if not m:
            if per_year_pattern.match(f.name):
                continue
            skipped_discovery.append((f, "pattern-mismatch"))
            continue
        cat_raw = m.group("cat")
        cat = normalize_cat(cat_raw)
        yy = m.group("yy")
        if year[-2:] != yy:
            skipped_discovery.append((f, f"year-mismatch (file yy={yy})"))
            continue
        if cat:
            cat_to_files[cat].append(f)
        else:
            skipped_discovery.append((f, "unrecognized-category"))
    
    for cat in categories:
        print(f"  {cat}: {len(cat_to_files[cat])} monthly files found")
    if skipped_discovery:
        print(f" - {len(skipped_discovery)} files skipped during discovery in {year}:")
        for sf, reason in skipped_discovery[:20]:
            print(f"     - {sf.relative_to(base_dir)} ({reason})")
        if len(skipped_discovery) > 20:
            print(f"     ... and {len(skipped_discovery)-20} more")
    all_skipped.extend(skipped_discovery)
    
    for cat in categories:
        files_list = cat_to_files.get(cat, [])
        out_file = year_dir / f"{cat}_{year}.csv"
        if not files_list:
            per_year_records.append([year, cat, 0, 0, 0, str(out_file), "no_files"])
            print(f"   - {cat}: no monthly files -> skipped creating {out_file.name}")
            continue
        
        dfs = []
        rows_read = 0
        failed_reads = []
        for f in files_list:
            try:
                if f.stat().st_size == 0:
                    failed_reads.append((f, "empty"))
                    print(f" - Skipped empty: {f.relative_to(base_dir)}")
                    continue
                df = pd.read_csv(f, low_memory=False)
                dfs.append(df)
                rows_read += len(df)
                print(f"     read: {f.relative_to(base_dir)} ({len(df)} rows)")
            except pd.errors.EmptyDataError:
                failed_reads.append((f, "EmptyDataError"))
                print(f" - EmptyDataError: {f.relative_to(base_dir)}")
            except pd.errors.ParserError:
                failed_reads.append((f, "ParserError"))
                print(f" - ParserError: {f.relative_to(base_dir)}")
            except Exception as e:
                failed_reads.append((f, f"error:{e}"))
                print(f" - Error reading {f.relative_to(base_dir)}: {e}")
        
        if not dfs:
            per_year_records.append([year, cat, len(files_list), 0, 0, str(out_file), "all_reads_failed"])
            print(f"   - {cat}: no readable monthly CSVs; skipping {out_file.name}")
            continue
        
        try:
            combined = pd.concat(dfs, ignore_index=True, sort=False)
        except Exception as e:
            per_year_records.append([year, cat, len(files_list), rows_read, 0, str(out_file), f"concat_failed:{e}"])
            print(f"   - {cat}: concat failed: {e}")
            continue
        
        before = len(combined)
        if dedupe_per_year:
            combined = combined.drop_duplicates()
        after = len(combined)
        
        try:
            combined.to_csv(out_file, index=False)
            per_year_records.append([year, cat, len(files_list), before, after, str(out_file), "merged"])
            print(f"   → Wrote {out_file.relative_to(base_dir)} : {after} rows (before dedupe {before})")
        except Exception as e:
            per_year_records.append([year, cat, len(files_list), before, after, str(out_file), f"write_failed:{e}"])
            print(f"   - {cat}: failed to write {out_file.name}: {e}")

with open(per_year_log, "w", newline="", encoding="utf-8") as fh:
    writer = csv.writer(fh)
    writer.writerow(["year", "category", "n_month_files", "rows_before_dedupe", "rows_after_dedupe", "out_path", "status"])
    writer.writerows(per_year_records)
print(f"\nPer-year merge log saved to: {per_year_log}")

if all_skipped:
    print(f"\n=== Skipped files during per-year processing ({len(all_skipped)}) ===")
    for sf, reason in all_skipped:
        try:
            print(f" - {sf.relative_to(base_dir)} : {reason}")
        except Exception:
            print(f" - {sf} : {reason}")
else:
    print("\nNo files skipped during per-year discovery.")

# STEP B: Final per-category combine across years
print("\n\n=== STEP B: Combining per-year files across years into final per-category files ===")
final_records = []
for cat in categories:
    per_year_files = []
    for year_dir in sorted([p for p in base_dir.iterdir() if p.is_dir()]):
        candidate = year_dir / f"{cat}_{year_dir.name}.csv"
        if candidate.exists():
            per_year_files.append(candidate)
    if not per_year_files:
        print(f"  - {cat}: no per-year files found -> skipping final combine")
        final_records.append([cat, 0, 0, 0, str(final_outdir / f"{cat}_data.csv"), "no_files"])
        continue
    
    dfs = []
    rows_read = 0
    failed = []
    for f in per_year_files:
        try:
            if f.stat().st_size == 0:
                failed.append((f, "empty"))
                print(f" - empty per-year file skipped: {f.relative_to(base_dir)}")
                continue
            df = pd.read_csv(f, low_memory=False)
            dfs.append(df)
            rows_read += len(df)
            print(f"    read per-year: {f.relative_to(base_dir)} ({len(df)} rows)")
        except Exception as e:
            failed.append((f, f"read-error:{e}"))
            print(f" - failed to read {f.relative_to(base_dir)}: {e}")
    if not dfs:
        final_records.append([cat, len(per_year_files), 0, 0, str(final_outdir / f"{cat}_data.csv"), "all_reads_failed"])
        print(f"  - {cat}: no readable per-year files; skipping final output")
        continue
    
    try:
        combined_final = pd.concat(dfs, ignore_index=True, sort=False)
    except Exception as e:
        final_records.append([cat, len(per_year_files), rows_read, 0, str(final_outdir / f"{cat}_data.csv"), f"concat_failed:{e}"])
        print(f"  - {cat}: concat failed: {e}")
        continue
    
    before_final = len(combined_final)
    if dedupe_final:
        combined_final = combined_final.drop_duplicates()
    after_final = len(combined_final)
    
    out_final = final_outdir / f"{cat}_data.csv"
    try:
        combined_final.to_csv(out_final, index=False)
        final_records.append([cat, len(per_year_files), before_final, after_final, str(out_final), "merged"])
        print(f"  → Wrote final: {out_final} : {after_final} rows (before dedupe {before_final})")
    except Exception as e:
        final_records.append([cat, len(per_year_files), before_final, after_final, str(out_final), f"write_failed:{e}"])
        print(f"  - {cat}: failed to write final output: {e}")

with open(final_log, "w", newline="", encoding="utf-8") as fh:
    writer = csv.writer(fh)
    writer.writerow(["category", "n_per_year_files", "rows_before_dedupe", "rows_after_dedupe", "out_path", "status"])
    writer.writerows(final_records)
print(f"\nFinal combine log saved to: {final_log}")

print("\nAll done. Per-year files are created inside each year folder and final per-category files are in:")
print(final_outdir)


### At this point, the data is successfully cumulated and ready for further analysis and preprocessing. This will be carried out in the DataPreprocessing.ipynb notebook. 