## Notebook 01: Data Preprocessing

This notebook uses the `load_and_preprocess_data` function from `src.preprocessing` to load raw data, apply cleaning and filtering, and save the processed output.

**Target Data:** Semantic Scholar data (combining 'title' and 'abstract').

In [None]:
# ## 1. Imports and Setup

import os
import sys
import pandas as pd
import logging

# --- Add src directory to Python path ---
# This allows importing modules from src. Adjust path if notebook is moved.
module_path = os.path.abspath(os.path.join('..')) 
if module_path not in sys.path:
    sys.path.append(module_path)
    print(f"Added {module_path} to sys.path")
else:
    print(f"{module_path} already in sys.path")

# --- Import the preprocessing function ---
try:
    from src.preprocessing import load_and_preprocess_data 
    print("Successfully imported 'load_and_preprocess_data' from src.preprocessing")
except ImportError as e:
    print(f"Error importing functions: {e}")
    print("Ensure the 'src' directory is in the Python path and preprocessing.py exists.")
except Exception as e:
    print(f"An unexpected error occurred during import: {e}")

# --- Configure Logging ---
# Basic logging setup for notebook visibility
# Use force=True to allow reconfiguring logging in Jupyter environment
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', force=True)

Added c:\WORKING\BERTopic_Modeling to sys.path
Successfully imported 'load_and_preprocess_data' from src.preprocessing


In [None]:
# ## 2. Define Configuration for Semantic Scholar Data

# --- Define paths ---
project_root_dir = os.path.abspath(os.path.join('..')) 
raw_data_file = os.path.join(project_root_dir, 'data', 'raw', 'reddit_allFoS_2015_to_2025_bulk_results.csv')
processed_data_output_file = os.path.join(project_root_dir, 'data', 'processed', 's2_processed_docs.csv')
dropped_rows_file = os.path.join(project_root_dir, 'data', 'processed', 's2_dropped_missing_abstract.csv') # Path for rows dropped due to missing required columns

logging.info(f"Project root identified as: {project_root_dir}")
logging.info(f"Raw data file path: {raw_data_file}")
logging.info(f"Processed data output file path: {processed_data_output_file}")
logging.info(f"Dropped rows output file path: {dropped_rows_file}")

# --- Parameters for load_and_preprocess_data --- 
param_file_path = raw_data_file
param_text_source_columns = ['title', 'abstract']
param_unique_id_column = 'corpusId' # Verify this matches your S2 CSV column name for unique paper ID

# NEW: Specify columns required for 'docs' creation. Rows missing these will be dropped.
param_required_columns_for_docs_creation = ['abstract'] # We want to drop if 'abstract' is missing/empty
# NEW: Specify where to save these dropped rows
param_dropped_rows_output_path = dropped_rows_file

param_data_type_specific_df_processing = None 
param_clean_apply_unescape = True
param_clean_apply_url_removal = True
param_clean_apply_html_tag_removal = True
param_clean_apply_quote_normalization = True
param_clean_apply_char_filtering = True
param_clean_char_filter_regex = r"[^a-zA-Z0-9\s,.!?':;\"-]"
param_clean_apply_html_entity_removal = True
param_clean_apply_lowercase = True
param_apply_length_filter = True
param_min_doc_length = 50  
param_max_doc_length = 10000 
param_apply_duplicate_removal = True
param_column_for_duplicate_checking = 'docs' 
param_apply_score_filter = False 
param_score_column_for_filtering = None 
param_min_score_for_filtering = None
param_max_score_for_filtering = None

logging.info("Configuration parameters defined for Semantic Scholar data.")

2025-05-08 13:39:56,814 - INFO - Project root identified as: c:\WORKING\BERTopic_Modeling
2025-05-08 13:39:56,815 - INFO - Raw data file path: c:\WORKING\BERTopic_Modeling\data\raw\reddit_allFoS_2015_to_2025_bulk_results.csv
2025-05-08 13:39:56,816 - INFO - Processed data output file path: c:\WORKING\BERTopic_Modeling\data\processed\s2_reddit_processed_docs.csv
2025-05-08 13:39:56,818 - INFO - Configuration parameters defined for Semantic Scholar data.


In [None]:
# ## 3. Run Preprocessing

logging.info(f"Attempting to load and preprocess data from: {param_file_path}")

processed_df = None 

try:
    processed_df = load_and_preprocess_data(
        file_path=param_file_path,
        text_source_columns=param_text_source_columns,
        unique_id_column=param_unique_id_column,
        required_columns_for_docs_creation=param_required_columns_for_docs_creation, # New parameter
        dropped_rows_output_path=param_dropped_rows_output_path,         # New parameter
        data_type_specific_df_processing=param_data_type_specific_df_processing,
        clean_apply_unescape=param_clean_apply_unescape,
        clean_apply_url_removal=param_clean_apply_url_removal,
        clean_apply_html_tag_removal=param_clean_apply_html_tag_removal,
        clean_apply_quote_normalization=param_clean_apply_quote_normalization,
        clean_apply_char_filtering=param_clean_apply_char_filtering,
        clean_char_filter_regex=param_clean_char_filter_regex,
        clean_apply_html_entity_removal=param_clean_apply_html_entity_removal,
        clean_apply_lowercase=param_clean_apply_lowercase,
        apply_length_filter=param_apply_length_filter,
        min_doc_length=param_min_doc_length,
        max_doc_length=param_max_doc_length,
        apply_duplicate_removal=param_apply_duplicate_removal,
        column_for_duplicate_checking=param_column_for_duplicate_checking,
        apply_score_filter=param_apply_score_filter,
        score_column_for_filtering=param_score_column_for_filtering,
        min_score_for_filtering=param_min_score_for_filtering,
        max_score_for_filtering=param_max_score_for_filtering
    )

    if processed_df is not None:
        logging.info(f"Preprocessing finished. Processed DataFrame shape: {processed_df.shape}")
        if not processed_df.empty:
            try:
                output_dir = os.path.dirname(processed_data_output_file)
                if not os.path.exists(output_dir):
                    os.makedirs(output_dir)
                    logging.info(f"Created output directory: {output_dir}")
                processed_df.to_csv(processed_data_output_file, index=False)
                logging.info(f"Processed DataFrame saved to: {processed_data_output_file}")
            except Exception as e:
                logging.error(f"Error saving processed DataFrame: {e}")
        else:
            logging.info("Processed DataFrame is empty, not saving main output file.")
    else:
        logging.warning("Preprocessing did not return a DataFrame.")

except FileNotFoundError as e:
    logging.error(f"Input file path error: {e}. Please ensure the path is correct.")
except ValueError as e:
    logging.error(f"Configuration or data error during preprocessing: {e}")
except NameError as e:
     logging.error(f"Import error - required function not loaded: {e}")
except Exception as e:
    logging.error(f"An unexpected error occurred during preprocessing: {e}", exc_info=True)

2025-05-08 13:40:00,930 - INFO - Attempting to load and preprocess data from: c:\WORKING\BERTopic_Modeling\data\raw\reddit_allFoS_2015_to_2025_bulk_results.csv


Starting preprocessing for: c:\WORKING\BERTopic_Modeling\data\raw\reddit_allFoS_2015_to_2025_bulk_results.csv
Original dataset shape: (5446, 20)
Created 'docs' column from: ['title', 'abstract']
Using 'corpusId' as reference ID.
Applying configurable text cleaning to 'docs' column...


2025-05-08 13:40:01,232 - INFO - Preprocessing finished. Processed DataFrame shape: (4932, 21)
2025-05-08 13:40:01,420 - INFO - Processed DataFrame saved to: c:\WORKING\BERTopic_Modeling\data\processed\s2_reddit_processed_docs.csv


Dropped 2 rows with empty 'docs' after text cleaning.
Filtered by length (column: 'docs', min: 50, max: 10000): Removed 488 documents. Kept 4956 documents.
Removed duplicates based on 'docs': Removed 24 documents. Kept 4932.
Finished preprocessing. Processed dataset shape: (4932, 21)


In [None]:
# ## 4. Inspect Output

from IPython.display import display 

print("--- Main Processed DataFrame --- ")
if processed_df is not None and not processed_df.empty:
    print(f"\nProcessed DataFrame Info ({processed_data_output_file}):")
    processed_df.info()
    print("\nFirst 5 rows of processed data:")
    display(processed_df.head())
    if 'docs' in processed_df.columns:
        print("\nSample of 'docs' column (first 3 documents):")
        for i, doc in enumerate(processed_df['docs'].head(3)):
            print(f"Doc {i+1}: {doc[:200]}...") 
    print(f"\nOutput file should be at: {processed_data_output_file}")
    print(f"Does output file exist? {os.path.exists(processed_data_output_file)}")
elif processed_df is not None and processed_df.empty:
     print("\nPreprocessing resulted in an empty DataFrame. Check filters and source data.")
     print(f"Output file path specified: {processed_data_output_file}")
     print(f"Does (potentially empty) output file exist? {os.path.exists(processed_data_output_file)}")
else:
    print("\nPreprocessing failed or DataFrame was not created/returned correctly.")
    print(f"Expected output file: {processed_data_output_file}")

print("\n--- Dropped Rows (due to missing required columns) --- ")
if os.path.exists(dropped_rows_file):
    try:
        df_dropped_check = pd.read_csv(dropped_rows_file)
        print(f"Successfully loaded dropped rows file: {dropped_rows_file}")
        print(f"Number of rows dropped due to missing required columns: {len(df_dropped_check)}")
        if not df_dropped_check.empty:
            print("\nFirst 5 rows of dropped data:")
            display(df_dropped_check.head())
        else:
            print("The dropped rows file is empty (meaning no rows met the criteria for being dropped due to missing required columns initially, or they were filtered out by other means).")
    except Exception as e:
        print(f"Error loading or inspecting dropped rows file {dropped_rows_file}: {e}")
else:
    print(f"Dropped rows file not found at: {dropped_rows_file}. This is expected if no rows were dropped for missing required columns.")


Processed DataFrame Info (c:\WORKING\BERTopic_Modeling\data\processed\s2_reddit_processed_docs.csv):
<class 'pandas.core.frame.DataFrame'>
Index: 4932 entries, 0 to 5445
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   paperId                   4932 non-null   object
 1   corpusId                  4932 non-null   int64 
 2   url                       4932 non-null   object
 3   title                     4932 non-null   object
 4   abstract                  1413 non-null   object
 5   venue                     4045 non-null   object
 6   year                      4932 non-null   int64 
 7   referenceCount            4932 non-null   int64 
 8   citationCount             4932 non-null   int64 
 9   influentialCitationCount  4932 non-null   int64 
 10  isOpenAccess              4932 non-null   bool  
 11  publicationDate           4193 non-null   object
 12  author_names_str          4875 non-

Unnamed: 0,paperId,corpusId,url,title,abstract,venue,year,referenceCount,citationCount,influentialCitationCount,...,publicationDate,author_names_str,fieldsOfStudy_str,s2FieldsOfStudy_str,publicationTypes_str,externalIds_json,openAccessPdf_json,journal_json,publicationVenue_json,docs
0,0010ab98621bd8fa34de6b15e15b9da7fd28b4b3,259553347,https://www.semanticscholar.org/paper/0010ab98...,Reddit sentiment analysis for natural language...,"In the Internet age, social media has fully pe...",Applied and Computational Engineering,2023,0,0,0,...,6/14/2023,Ang Li,,Computer Science(s2-fos-model),JournalArticle,"{""DOI"": ""10.54254/2755-2721/5/20230649"", ""Corp...","{""url"": ""https://www.ewadirect.com/proceedings...","{""name"": ""Applied and Computational Engineering""}","{""id"": ""38ef5a81-0fad-4de7-abc2-0fb847d3ece7"",...",reddit sentiment analysis for natural language...
1,001aebf4db7db55746eb9f4b53a8cc4e44ab64ff,242991503,https://www.semanticscholar.org/paper/001aebf4...,"Author response for ""Reasoning in social media...",,,2021,0,0,0,...,2/23/2021,Ayşe Öcal; Lu Xiao; Jaihyun Park,,Computer Science(s2-fos-model); Psychology(s2-...,,"{""DOI"": ""10.1108/oir-08-2020-0330/v3/response1...","{""url"": """", ""status"": null, ""license"": null, ""...",,,"author response for ""reasoning in social media..."
2,002559cc9c1b9168fe9fead1be9510eb1e5e80d4,261572976,https://www.semanticscholar.org/paper/002559cc...,Sharing Reliable COVID-19 Information and Coun...,Background The rampant spread of misinformatio...,JMIR infodemiology,2023,37,2,0,...,3/28/2023,Alexis M. Koskan; Shalini Sivanandam; Kristy R...,Medicine,Medicine(external); Sociology(s2-fos-model),JournalArticle; Review,"{""PubMedCentral"": ""10625073"", ""DOI"": ""10.2196/...","{""url"": ""https://doi.org/10.2196/47677"", ""stat...","{""name"": ""JMIR Infodemiology"", ""volume"": ""3""}","{""id"": ""5954d9ea-100f-4b8f-94ce-3e2ba0431102"",...",sharing reliable covid-19 information and coun...
4,002788669c1548cedeeae6a0d6fd1e0d5eb40e43,256758312,https://www.semanticscholar.org/paper/00278866...,Hate Speech Patterns in Social Media: A Method...,Social media offers users an online platform t...,Australasian Journal of Information Systems,2023,0,4,0,...,2/8/2023,V. Wanniarachchi; C. Scogings; Teo Sušnjak; A....,Computer Science,Computer Science(external); Sociology(s2-fos-m...,JournalArticle,"{""DBLP"": ""journals/ajis/WanniarachchiSSM23"", ""...","{""url"": ""https://journal.acs.org.au/index.php/...","{""name"": ""Australas. J. Inf. Syst."", ""volume"":...","{""id"": ""37046766-bc51-4464-99ed-f39d8391d1d8"",...",hate speech patterns in social media: a method...
5,0029c7cb16972c4c878bbef08ac0373363bf6d07,265608098,https://www.semanticscholar.org/paper/0029c7cb...,RoCS-MT: Robustness Challenge Set for Machine ...,"RoCS-MT, a Robust Challenge Set for Machine Tr...",Conference on Machine Translation,2023,35,6,0,...,,Rachel Bawden; Benoît Sagot,Computer Science,Computer Science(external); Computer Science(s...,JournalArticle,"{""ACL"": ""2023.wmt-1.21"", ""DBLP"": ""conf/wmt/Baw...","{""url"": ""https://aclanthology.org/2023.wmt-1.2...","{""pages"": ""198-216""}","{""id"": ""9aacb914-3edf-4e02-b8fe-5abf21c4d2ba"",...",rocs-mt: robustness challenge set for machine ...



Sample of 'docs' column (first 3 documents):
Doc 1: reddit sentiment analysis for natural language processing in the internet age, social media has fully penetrated into people's lives. as one of the well-developed online platforms with a large user ba...
Doc 2: author response for "reasoning in social media: insights from reddit "change my view" submissions"...
Doc 3: sharing reliable covid-19 information and countering misinformation: in-depth interviews with information advocates background the rampant spread of misinformation about covid-19 has been linked to a ...

Output file should be at: c:\WORKING\BERTopic_Modeling\data\processed\s2_reddit_processed_docs.csv
Does output file exist? True
