# Task 2: Exploratory Data Analysis (EDA)

This notebook implements **Task 2** of the Employee Sentiment Analysis project. The objective is to understand the structure, distribution, and trends in the dataset through thorough exploration.

## Key Areas of Analysis:
1. **Data Structure**: Examine records, data types, missing values
2. **Sentiment Distribution**: Analyze sentiment labels across the dataset  
3. **Time Trends**: Investigate patterns over time
4. **Employee Patterns**: Explore employee-specific trends and anomalies
5. **Message Characteristics**: Analyze message length, frequency, and content patterns

## 1. Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('ggplot')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

## 2. Load Data with Sentiment Labels

We'll load the dataset with sentiment labels from Task 1 to perform comprehensive exploratory analysis.

In [None]:
# Load data with sentiment labels (from Task 1)
df = pd.read_csv('../data/processed/email_data_with_sentiment.csv')

print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Display basic information
print("\n=== DATASET OVERVIEW ===")
print(f"Total number of messages: {len(df):,}")
print(f"Number of unique employees: {df['from'].nunique():,}")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")
print(f"Missing values per column:")
print(df.isnull().sum())

df.head()

## 3. Data Structure Analysis

Let's examine the overall structure and characteristics of our dataset.

In [None]:
# Data structure analysis
print("=== DATA STRUCTURE ANALYSIS ===")

# Check data types
print("\nData Types:")
print(df.dtypes)

# Statistical summary
print("\nStatistical Summary:")
print(df.describe(include='all'))

# Missing values analysis
print("\nMissing Values Analysis:")
missing_data = df.isnull().sum()
missing_percentage = (missing_data / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Missing Percentage': missing_percentage
})
print(missing_df[missing_df['Missing Count'] > 0])

# Unique values in key columns
print("\nUnique Values in Key Columns:")
key_columns = ['from', 'email_domain', 'sentiment_final', 'year', 'month']
for col in key_columns:
    if col in df.columns:
        print(f"- {col}: {df[col].nunique():,} unique values")

# Memory usage
print(f"\nDataset Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

## 4. Sentiment Distribution Analysis

Analyze the distribution of sentiment labels across different dimensions.

In [None]:
# Sentiment distribution analysis
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Overall sentiment distribution
sentiment_counts = df['sentiment_final'].value_counts()
colors = ['green', 'gray', 'red']
sentiment_counts.plot(kind='pie', ax=axes[0,0], autopct='%1.1f%%', colors=colors, startangle=90)
axes[0,0].set_title('Overall Sentiment Distribution')
axes[0,0].set_ylabel('')

# Sentiment by year
yearly_sentiment = pd.crosstab(df['year'], df['sentiment_final'], normalize='index') * 100
yearly_sentiment.plot(kind='bar', ax=axes[0,1], color=colors)
axes[0,1].set_title('Sentiment Distribution by Year (%)')
axes[0,1].set_xlabel('Year')
axes[0,1].set_ylabel('Percentage')
axes[0,1].tick_params(axis='x', rotation=45)
axes[0,1].legend(title='Sentiment')

# Sentiment by month (all years combined)
monthly_sentiment = pd.crosstab(df['month'], df['sentiment_final'], normalize='index') * 100
monthly_sentiment.plot(kind='bar', ax=axes[1,0], color=colors)
axes[1,0].set_title('Sentiment Distribution by Month (%)')
axes[1,0].set_xlabel('Month')
axes[1,0].set_ylabel('Percentage')
axes[1,0].set_xticks(range(12))
axes[1,0].set_xticklabels(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
axes[1,0].tick_params(axis='x', rotation=45)
axes[1,0].legend(title='Sentiment')

# Top 10 employees by message count with sentiment breakdown
top_employees = df['from'].value_counts().head(10).index
top_emp_sentiment = df[df['from'].isin(top_employees)].groupby(['from', 'sentiment_final']).size().unstack(fill_value=0)
top_emp_sentiment.plot(kind='bar', stacked=True, ax=axes[1,1], color=colors)
axes[1,1].set_title('Sentiment Distribution - Top 10 Most Active Employees')
axes[1,1].set_xlabel('Employee Email')
axes[1,1].set_ylabel('Number of Messages')
axes[1,1].tick_params(axis='x', rotation=45)
axes[1,1].legend(title='Sentiment')

plt.tight_layout()
plt.savefig('../visualizations/sentiment_distribution_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

# Print detailed statistics
print("=== SENTIMENT DISTRIBUTION STATISTICS ===")
print(f"\nOverall Distribution:")
for sentiment, count in sentiment_counts.items():
    percentage = (count / len(df)) * 100
    print(f"- {sentiment}: {count:,} messages ({percentage:.1f}%)")

print(f"\nTop 5 Most Active Employees:")
top_5_employees = df['from'].value_counts().head(5)
for email, count in top_5_employees.items():
    employee_sentiment = df[df['from'] == email]['sentiment_final'].value_counts()
    sentiment_summary = ", ".join([f"{sent}: {cnt}" for sent, cnt in employee_sentiment.items()])
    print(f"- {email}: {count} messages ({sentiment_summary})")