# Task 2: Exploratory Data Analysis (EDA)

This notebook implements **Task 2** of the Employee Sentiment Analysis project. The objective is to understand the structure, distribution, and trends in the dataset through thorough exploration.

## Key Areas of Analysis:
1. **Data Structure**: Examine records, data types, missing values
2. **Sentiment Distribution**: Analyze sentiment labels across the dataset  
3. **Time Trends**: Investigate patterns over time
4. **Employee Patterns**: Explore employee-specific trends and anomalies
5. **Message Characteristics**: Analyze message length, frequency, and content patterns

## 1. Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('ggplot')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

## 2. Load Data with Sentiment Labels

We'll load the dataset with sentiment labels from Task 1 to perform comprehensive exploratory analysis.

In [None]:
# Load data with sentiment labels (from Task 1)
df = pd.read_csv('../data/processed/email_data_with_sentiment.csv')

print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Display basic information
print("\n=== DATASET OVERVIEW ===")
print(f"Total number of messages: {len(df):,}")
print(f"Number of unique employees: {df['from'].nunique():,}")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")
print(f"Missing values per column:")
print(df.isnull().sum())

df.head()