# Bangladesh Student Data Analysis: Data Processing Example

This notebook demonstrates how to use the data processing module to analyze student data.

## Setup

First, let's import necessary libraries and the data processor module.

In [None]:
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Add the src directory to Python path
sys.path.append('../../')
from src.data_processing.data_processor import DataProcessor

# Set up plotting style
plt.style.use('seaborn')
sns.set_palette('husl')

## Create Sample Data

For demonstration purposes, let's create some sample student data.

In [None]:
# Create sample student data
np.random.seed(42)
n_students = 1000

divisions = ['Dhaka', 'Chittagong', 'Khulna', 'Rajshahi', 'Sylhet', 'Barisal', 'Rangpur', 'Mymensingh']

sample_data = pd.DataFrame({
    'student_id': [f'S{i:04d}' for i in range(n_students)],
    'name': [f'Student {i}' for i in range(n_students)],
    'date_of_birth': pd.date_range(start='2000-01-01', periods=n_students),
    'division': np.random.choice(divisions, n_students),
    'gpa': np.random.normal(3.5, 0.5, n_students).clip(0, 4),
    'days_present': np.random.randint(150, 200, n_students),
    'total_school_days': 200
})

sample_data.head()

## Process the Data

Now let's use our DataProcessor class to clean and analyze the data.

In [None]:
# Initialize the data processor
processor = DataProcessor()

# Clean the data
cleaned_data = processor.clean_student_data(sample_data)

# Calculate performance metrics
performance_data = processor.calculate_performance_metrics(cleaned_data)

performance_data.head()

## Analyze Performance Distribution

Let's visualize the distribution of student performance across different divisions.

In [None]:
# Plot GPA distribution by division
plt.figure(figsize=(12, 6))
sns.boxplot(data=performance_data, x='division', y='gpa')
plt.title('GPA Distribution by Division')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Plot performance level distribution
plt.figure(figsize=(10, 6))
performance_data['performance_level'].value_counts().plot(kind='bar')
plt.title('Distribution of Performance Levels')
plt.xlabel('Performance Level')
plt.ylabel('Number of Students')
plt.tight_layout()
plt.show()

## Analyze Attendance Patterns

Let's examine the relationship between attendance and academic performance.

In [None]:
# Create scatter plot of attendance rate vs GPA
plt.figure(figsize=(10, 6))
sns.scatterplot(data=performance_data, x='attendance_rate', y='gpa', alpha=0.5)
plt.title('Relationship between Attendance Rate and GPA')
plt.xlabel('Attendance Rate')
plt.ylabel('GPA')
plt.tight_layout()
plt.show()

# Calculate correlation
correlation = performance_data['attendance_rate'].corr(performance_data['gpa'])
print(f'Correlation between attendance rate and GPA: {correlation:.3f}')

## Summary Statistics by Division

Let's generate summary statistics for each division.

In [None]:
# Calculate summary statistics by division
summary_stats = performance_data.groupby('division').agg({
    'gpa': ['mean', 'std', 'min', 'max'],
    'attendance_rate': ['mean', 'std'],
    'student_id': 'count'
}).round(3)

summary_stats.columns = ['Avg GPA', 'GPA Std', 'Min GPA', 'Max GPA', 
                        'Avg Attendance', 'Attendance Std', 'Student Count']
summary_stats

## Conclusions

This notebook demonstrates basic usage of the DataProcessor class for analyzing student performance data. Key findings from this analysis include:

1. Distribution of performance levels across divisions
2. Relationship between attendance and academic performance
3. Regional variations in student performance

For actual analysis, you would:
- Load real data from your data sources
- Perform more detailed cleaning and validation
- Conduct more sophisticated statistical analyses
- Generate comprehensive reports based on the findings