# 01 Data Preprocessing

This notebook covers the initial data loading and preprocessing of the employee messages dataset (test.csv). We'll examine the data structure, clean the text data, and prepare it for sentiment analysis and further tasks as outlined in the project requirements.

## 1. Import Required Libraries
Import pandas, numpy, and any other libraries needed for data preprocessing.

In [None]:
import pandas as pd
import numpy as np
import os
import re

## 2. Load Raw Dataset
Load the raw dataset from the `data/raw/` directory using pandas.

In [None]:
# Load the actual dataset
raw_data_path = os.path.join('..', 'data', 'raw', 'test.csv')
df = pd.read_csv(raw_data_path)
print(f"Dataset shape: {df.shape}")
df.head()

## 3. Inspect Data Structure
Display the first few rows, data types, and summary statistics to understand the dataset.

In [None]:
# Display first 5 rows
print("First 5 rows:")
print(df.head())

# Data types and non-null counts
print("\nDataset info:")
df.info()

# Summary statistics for numeric columns
print("\nSummary statistics:")
print(df.describe(include='all'))

# Check column names and basic structure
print(f"\nColumns: {list(df.columns)}")
print(f"Shape: {df.shape}")

## 4. Handle Missing Values
Identify and handle missing values using appropriate strategies such as imputation or removal.

In [None]:
# Check for missing values
missing = df.isnull().sum()
print('Missing values per column:')
print(missing)

# Check for empty strings in text columns
print('\nEmpty strings in text columns:')
for col in ['Subject', 'body']:
    if col in df.columns:
        empty_count = (df[col] == '').sum()
        print(f"{col}: {empty_count} empty strings")

# Handle missing values - drop rows where both Subject and body are missing
df = df.dropna(subset=['Subject', 'body'], how='all')
print(f"\nDataset shape after removing rows with missing Subject AND body: {df.shape}")

## 5. Data Cleaning (Text and Numeric)
Clean text fields (e.g., remove special characters, lowercase) and process numeric columns as needed.

In [None]:
# Clean text columns - Subject and body
def clean_text(text):
    if pd.isna(text):
        return ""
    text = str(text).lower()
    # Remove special characters but keep basic punctuation
    text = re.sub(r'[^\w\s\.,!?]', ' ', text)
    # Remove multiple spaces
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    return text

# Clean Subject and body columns
df['Subject_clean'] = df['Subject'].apply(clean_text)
df['body_clean'] = df['body'].apply(clean_text)

# Fill missing subjects with "(No Subject)"
df['Subject_clean'] = df['Subject_clean'].replace('', '(no subject)')

# Combine Subject and body for comprehensive text analysis
df['combined_text'] = df['Subject_clean'] + ' ' + df['body_clean']

print("Text cleaning completed!")
print(f"Sample cleaned text:\n{df['combined_text'].iloc[0][:200]}...")

## 6. Feature Engineering
Create new features or transform existing ones to support downstream analysis.

In [None]:
# Process date column
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6])

# Extract email domain from 'from' column
df['email_domain'] = df['from'].str.split('@').str[1]

# Create text length features
df['subject_length'] = df['Subject'].str.len()
df['body_length'] = df['body'].str.len()
df['combined_text_length'] = df['combined_text'].str.len()

# Create word count features
df['subject_word_count'] = df['Subject_clean'].str.split().str.len()
df['body_word_count'] = df['body_clean'].str.split().str.len()

print("Feature engineering completed!")
print(f"New columns added: {[col for col in df.columns if col not in ['Subject', 'body', 'date', 'from']]}")

## 7. Save Processed Data
Save the cleaned and preprocessed dataset to the `data/processed/` directory for use in later steps.

In [None]:
processed_data_path = os.path.join('..', 'data', 'processed', 'email_data_processed.csv')
df.to_csv(processed_data_path, index=False)
print(f"Processed data saved to {processed_data_path}")
print(f"Final dataset shape: {df.shape}")
print(f"Columns in processed data: {list(df.columns)}")