# Civil Comments Preprocessing


**Purpose:** This notebook documents the preprocessing steps used to transform the Civil Comments dataset into an analysis- and modeling-ready table.


> **Reproducibility & ethics:** This repository does not redistribute the Civil Comments dataset or any derived labeled exports. You must obtain the dataset under its original license/terms and provide your own input/output paths.


### What this notebook does:
- Loads the Civil Comments source file.
- Performs lightweight text/encoding cleanup (byte-string decoding)
- Parses timestamps for time-aware analysis.
- Adds identity-mention indicator columns for downstream fairness/system-risk diagnostics.


In [None]:
# Mount Google Drive to access project files and datasets
# The following command creates a virtual link between Google Drive and the Colab environment
# Files stored in the Drive can then be accessed under /content/drive/MyDrive/

from google.colab import drive

# Mount the drive at the default mount point
drive.mount('/content/drive')

# After executing this cell, a prompt will appear asking for authorization.
# Once authorized, the path /content/drive/MyDrive/ will contain all user files.


Mounted at /content/drive


## Input path configuration

Set the path to the Civil Comments source file. Use a local path when running locally, or a Drive path if running in Colab.


In [None]:
# Define the absolute path to the Civil Comments dataset
# The path reflects the folder hierarchy in Google Drive
# Keeping this reference centralized allows for easy modification later if the file location changes

dataset_path = "/content/drive/MyDrive/Dat490/Dataset/Civil_Comments_TFDS.csv"

# Quick verification: print the path to confirm correctness
print("Dataset path set to:", dataset_path)


Dataset path set to: /content/drive/MyDrive/Dat490/Dataset/Civil_Comments_TFDS.csv


## Dependencies

Import the libraries used for preprocessing and lightweight validation.


In [None]:
# Importing essential libraries individually for clarity and maintainability

import pandas as pd          # Data manipulation and analysis
import numpy as np           # Numerical operations and array handling

# Optional visualization libraries (uncomment if needed later)
# import matplotlib.pyplot as plt
# import seaborn as sns

print("Libraries successfully imported.")


✅ Libraries successfully imported.


## Load the dataset

Read the Civil Comments file into a DataFrame. The encoding is set explicitly to reduce issues with special characters.


In [None]:
# Load the Civil Comments dataset using the pre-defined path
df = pd.read_csv(dataset_path, encoding='utf-8')

# Basic verification of successful loading
print(" Dataset successfully loaded.")
print(f"Shape: {df.shape}")             # Displays the number of rows and columns
print("Preview of column names:", list(df.columns)[:10], "...")


✅ Dataset successfully loaded.
Shape: (1999514, 14)
Preview of column names: ['article_id', 'created_date', 'id', 'identity_attack', 'insult', 'obscene', 'parent_id', 'parent_text', 'publication_id', 'severe_toxicity'] ...


## Basic validation

Before transforming the data, inspect the schema and missingness to ensure the file loaded correctly (columns, dtypes, null counts).


In [None]:
# Display structural information about the dataset
# Provides data types, non-null counts, and memory usage
df.info()

# Display the first few rows to preview text and label structure
df.head(3)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1999514 entries, 0 to 1999513
Data columns (total 14 columns):
 #   Column           Dtype  
---  ------           -----  
 0   article_id       int64  
 1   created_date     object 
 2   id               object 
 3   identity_attack  float64
 4   insult           float64
 5   obscene          float64
 6   parent_id        int64  
 7   parent_text      object 
 8   publication_id   object 
 9   severe_toxicity  float64
 10  sexual_explicit  float64
 11  text             object 
 12  threat           float64
 13  toxicity         float64
dtypes: float64(7), int64(2), object(5)
memory usage: 213.6+ MB


Unnamed: 0,article_id,created_date,id,identity_attack,insult,obscene,parent_id,parent_text,publication_id,severe_toxicity,sexual_explicit,text,threat,toxicity
0,153145,b'2016-11-29 17:23:57.762283+00',b'634903',0.0,0.0,0.0,0,b'',b'54',0.0,0.0,"b""btw, Globe, your new comment section is lame...",0.0,0.2
1,379147,b'2017-09-19 03:02:05.207449+00',b'5977874',0.0,0.0,0.0,5977787,"b""I get the impression that Boeing is very af...",b'54',0.0,0.0,"b""If at first you don't succeed...try again: h...",0.0,0.0
2,342612,b'2017-06-10 06:32:21.121964+00',b'5390534',0.0,0.7,0.0,5388737,"b""Here's a vote for Comey to be prosecuting fo...",b'55',0.0,0.0,"b""You don't understand what leaking is. By th...",0.0,0.7


## Decode byte-string columns

Some columns may be loaded as byte strings depending on how the dataset was exported. This step decodes those fields into readable text.


In [None]:
# Automatically detect and decode any column that contains byte strings
# This approach preserves non-text columns (e.g., numeric, datetime)
# and only decodes columns with at least one byte or bytearray entry.

for col in df.columns:
    if df[col].apply(lambda x: isinstance(x, (bytes, bytearray))).any():
        df[col] = df[col].apply(lambda x: x.decode('utf-8', errors='ignore') if isinstance(x, (bytes, bytearray)) else x)
        print(f"Decoded column: {col}")

print(" Byte-string decoding completed for all applicable columns.")
df.head(3)


✅ Byte-string decoding completed for all applicable columns.


Unnamed: 0,article_id,created_date,id,identity_attack,insult,obscene,parent_id,parent_text,publication_id,severe_toxicity,sexual_explicit,text,threat,toxicity
0,153145,b'2016-11-29 17:23:57.762283+00',b'634903',0.0,0.0,0.0,0,b'',b'54',0.0,0.0,"b""btw, Globe, your new comment section is lame...",0.0,0.2
1,379147,b'2017-09-19 03:02:05.207449+00',b'5977874',0.0,0.0,0.0,5977787,"b""I get the impression that Boeing is very af...",b'54',0.0,0.0,"b""If at first you don't succeed...try again: h...",0.0,0.0
2,342612,b'2017-06-10 06:32:21.121964+00',b'5390534',0.0,0.7,0.0,5388737,"b""Here's a vote for Comey to be prosecuting fo...",b'55',0.0,0.0,"b""You don't understand what leaking is. By th...",0.0,0.7


## Clean residual byte-like artifacts

After decoding, remove any remaining byte-string patterns so text fields are consistently readable.


In [None]:
# Identify and clean any string values that look like b'example'
# The pattern is purely textual, so we use string replacement
# This preserves numeric fields and normal strings unaffected by the pattern.

def clean_byte_like(value):
    """Remove b'...' or b"..." wrappers from string representations of bytes."""
    if isinstance(value, str):
        if value.startswith("b'") and value.endswith("'"):
            return value[2:-1]
        if value.startswith('b"') and value.endswith('"'):
            return value[2:-1]
    return value

# Apply the cleaning function to all object-type columns
for col in df.select_dtypes(include='object').columns:
    df[col] = df[col].apply(clean_byte_like)

print(" Residual byte-like string cleanup completed.")
df.head(3)


✅ Residual byte-like string cleanup completed.


Unnamed: 0,article_id,created_date,id,identity_attack,insult,obscene,parent_id,parent_text,publication_id,severe_toxicity,sexual_explicit,text,threat,toxicity
0,153145,2016-11-29 17:23:57.762283+00,634903,0.0,0.0,0.0,0,,54,0.0,0.0,"btw, Globe, your new comment section is lame. ...",0.0,0.2
1,379147,2017-09-19 03:02:05.207449+00,5977874,0.0,0.0,0.0,5977787,I get the impression that Boeing is very afra...,54,0.0,0.0,If at first you don't succeed...try again: htt...,0.0,0.0
2,342612,2017-06-10 06:32:21.121964+00,5390534,0.0,0.7,0.0,5388737,Here's a vote for Comey to be prosecuting for ...,55,0.0,0.0,You don't understand what leaking is. By the ...,0.0,0.7


## Parse timestamps

Convert the datetime field into a standardized datetime type. This supports sorting, filtering, and any time-based analysis.


In [None]:
# Convert the 'created_date' column from string to datetime
# This parsing automatically interprets timezone offsets and irregular formatting where possible.
# Any unparsable values are coerced to NaT (Not-a-Time) to prevent runtime errors.

df['created_date'] = pd.to_datetime(df['created_date'], errors='coerce')

# Verify conversion success and display a quick summary
print(" Datetime conversion completed.")
print("Datatype of 'created_date':", df['created_date'].dtype)
print("Earliest date:", df['created_date'].min())
print("Latest date:", df['created_date'].max())


✅ Datetime conversion completed.
Datatype of 'created_date': datetime64[ns, UTC]
Earliest date: 2015-09-29 10:50:41.987077+00:00
Latest date: 2017-11-11 01:01:10.822969+00:00


## Identity-mention indicator features

Create boolean indicator columns for comments that mention identity terms (e.g., race, religion, gender, sexuality).

These indicators support downstream fairness and system-risk diagnostics. They are *not* demographic attributes of authors—only term-presence flags in text.


In [None]:
### Load the Identity Lexicon File (BOM-safe)

import json

lexicon_path = "/content/drive/MyDrive/Dat490/Dataset/identity_lexicon.json"

with open(lexicon_path, "r", encoding="utf-8-sig") as f:   # ← this removes the BOM automatically
    identity_lexicon = json.load(f)

print("Identity lexicon loaded successfully.")


Identity lexicon loaded successfully.


In [None]:
# Create the Identity-Mention Columns

import re

# Loop through each identity group in the lexicon
for identity, terms in identity_lexicon.items():

    # Build a regex pattern from the list of terms
    # Using word boundaries and case-insensitive matching
    pattern = r'\b(' + '|'.join(map(re.escape, terms)) + r')\b'

    # Create a new column: has_<identity>
    df[f"has_{identity}"] = df["text"].str.contains(pattern, case=False, regex=True, na=False).astype(int)

print("Identity-mention columns created successfully.")


  df[f"has_{identity}"] = df["text"].str.contains(pattern, case=False, regex=True, na=False).astype(int)
  df[f"has_{identity}"] = df["text"].str.contains(pattern, case=False, regex=True, na=False).astype(int)
  df[f"has_{identity}"] = df["text"].str.contains(pattern, case=False, regex=True, na=False).astype(int)
  df[f"has_{identity}"] = df["text"].str.contains(pattern, case=False, regex=True, na=False).astype(int)
  df[f"has_{identity}"] = df["text"].str.contains(pattern, case=False, regex=True, na=False).astype(int)
  df[f"has_{identity}"] = df["text"].str.contains(pattern, case=False, regex=True, na=False).astype(int)
  df[f"has_{identity}"] = df["text"].str.contains(pattern, case=False, regex=True, na=False).astype(int)
  df[f"has_{identity}"] = df["text"].str.contains(pattern, case=False, regex=True, na=False).astype(int)
  df[f"has_{identity}"] = df["text"].str.contains(pattern, case=False, regex=True, na=False).astype(int)
  df[f"has_{identity}"] = df["text"].str.contains(patte

Identity-mention columns created successfully.


In [None]:
# Save the processed dataset

df.to_parquet("/content/drive/MyDrive/Dat490/Dataset/civil_with_identities.parquet")
print("Processed dataset saved.")


Processed dataset saved.


---

## Dataset is ready

The dataset has been loaded and minimally standardized (encoding cleanup, timestamp parsing, identity indicators).



In [None]:
# Load the processed dataset (you can always start from here once you apply above steps once)

df = pd.read_parquet("/content/drive/MyDrive/Dat490/Dataset/civil_with_identities.parquet")
print("Processed dataset loaded and ready.")


Processed dataset loaded and ready.


In [None]:
### Quick Dataset Integrity Check

# 1. Preview first rows
print("----- HEAD -----")
display(df.head())

# 2. Shape check
print("----- SHAPE -----")
print(df.shape)

# 3. Identity columns list
print("----- IDENTITY COLUMNS -----")
identity_cols = [col for col in df.columns if col.startswith("has_")]
print(identity_cols)

# 4. Missing values check
print("----- MISSING VALUES -----")
print(df.isna().sum())

# 5. Value check for identity columns (should be only 0 and 1)
print("----- UNIQUE VALUES IN IDENTITY COLUMNS -----")
print(df[identity_cols].nunique())


----- HEAD -----


Unnamed: 0,article_id,created_date,id,identity_attack,insult,obscene,parent_id,parent_text,publication_id,severe_toxicity,...,has_muslim,has_christian,has_jewish,has_gay,has_lesbian,has_bisexual,has_transgender,has_male,has_female,has_disability
0,153145,2016-11-29 17:23:57.762283+00:00,634903,0.0,0.0,0.0,0,,54,0.0,...,0,0,0,0,0,0,0,0,0,0
1,379147,2017-09-19 03:02:05.207449+00:00,5977874,0.0,0.0,0.0,5977787,I get the impression that Boeing is very afra...,54,0.0,...,0,0,0,0,0,0,0,0,0,0
2,342612,2017-06-10 06:32:21.121964+00:00,5390534,0.0,0.7,0.0,5388737,Here's a vote for Comey to be prosecuting for ...,55,0.0,...,0,0,0,0,0,0,0,0,0,0
3,163084,2017-01-20 23:46:35.985660+00:00,871483,0.0,0.0,0.0,870487,"Excuse me, but what religious freedom are you ...",53,0.0,...,0,0,0,0,0,0,0,0,0,0
4,161284,2017-01-11 19:39:41.734415+00:00,825427,0.0,0.0,0.0,825241,Who thought it would be a good idea to put Row...,22,0.0,...,0,0,0,0,0,0,0,0,0,0


----- SHAPE -----
(1999514, 28)
----- IDENTITY COLUMNS -----
['has_black', 'has_white', 'has_asian', 'has_latino', 'has_muslim', 'has_christian', 'has_jewish', 'has_gay', 'has_lesbian', 'has_bisexual', 'has_transgender', 'has_male', 'has_female', 'has_disability']
----- MISSING VALUES -----
article_id         0
created_date       3
id                 0
identity_attack    0
insult             0
obscene            0
parent_id          0
parent_text        0
publication_id     0
severe_toxicity    0
sexual_explicit    0
text               0
threat             0
toxicity           0
has_black          0
has_white          0
has_asian          0
has_latino         0
has_muslim         0
has_christian      0
has_jewish         0
has_gay            0
has_lesbian        0
has_bisexual       0
has_transgender    0
has_male           0
has_female         0
has_disability     0
dtype: int64
----- UNIQUE VALUES IN IDENTITY COLUMNS -----
has_black          2
has_white          2
has_asian          