# ProdigyFlow — Kaggle Notebook

**Authors:** Komal Harshita, Priyamvadha Sahasvi Nune

---

## Abstract

This notebook presents a structured, academic-style analysis of a student performance dataset used in the ProdigyFlow pipeline. It follows the typical layout: Abstract, Introduction, Dataset Description, Methodology (Cleaning, Analysis, Visualization), Results, Discussion, Conclusion, and References.

## 1. Introduction

Education analytics can reveal insights about student performance, curriculum effectiveness, and potential interventions. In this project, ProdigyFlow automates cleaning, analysis, and visualization tasks using modular agents. This notebook documents the analysis following an academic structure to make the findings clear and reproducible.

## 2. Dataset Description

The dataset contains student records with demographic and marks information. Key attributes:

- `student_id` — unique identifier
- `location` — city
- `age` — student age (years)
- `sql_marks`, `excel_marks`, `python_marks`, `power_bi_marks`, `english_marks` — subject scores (0-100)

**Size:** 500 rows, 8 columns. The data is provided in `data/data_science_student_marks.csv` in this repository.

## 3. Objectives

This notebook aims to:

1. Describe the dataset and provide reproducible cleaning steps.
2. Perform exploratory data analysis (EDA) and statistical summaries.
3. Evaluate correlations and potential relationships between attributes.
4. Produce publication-quality visualizations.
5. Summarize findings in the style (Results, Discussion, Conclusion).

## 4. Setup & Imports

Install packages if needed (Kaggle usually has common packages preinstalled).

In [None]:
# Uncomment if running in a fresh environment
# !pip install -q pandas numpy matplotlib seaborn scipy statsmodels

import os
from pathlib import Path
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

sns.set(style='whitegrid')
pd.set_option('display.max_columns', 200)

DATA_PATH = Path('data/data_science_student_marks.csv')
REPORTS_DIR = Path('reports')
VIS_DIR = Path('visuals')
REPORTS_DIR.mkdir(parents=True, exist_ok=True)
VIS_DIR.mkdir(parents=True, exist_ok=True)

print('Setup complete. DATA_PATH exists:', DATA_PATH.exists())

## 5. Data Loading & Initial Inspection

In [None]:
if not DATA_PATH.exists():
    raise FileNotFoundError(f'Expected data file not found at {DATA_PATH}. Please upload the CSV to this path.')

df = pd.read_csv(DATA_PATH)
print('Dataset shape:', df.shape)
df.head()

### 5.1 Data Types and Missing Values

In [None]:
df.info()

print('\nMissing values per column:')
print(df.isna().sum())

### 5.2 Summary Statistics

In [None]:
numeric_cols = ['age','sql_marks','excel_marks','python_marks','power_bi_marks','english_marks']

df[numeric_cols].describe().round(3)

## 6. Methodology — Data Cleaning

Describe the cleaning steps and rationale. We apply minimal, reproducible cleaning: remove exact duplicates, ensure numeric columns are numeric, and fill/flag missing values if present.

In [None]:
# Cleaning steps (defensive)
df_clean = df.copy()
# 1. Drop exact duplicates
before = df_clean.shape[0]
df_clean = df_clean.drop_duplicates().reset_index(drop=True)
after = df_clean.shape[0]
print(f'Dropped duplicates: {before - after}')

# 2. Ensure numeric types
for c in numeric_cols:
    df_clean[c] = pd.to_numeric(df_clean[c], errors='coerce')

# 3. Report missing values after coercion
print('\nMissing after coercion:')
print(df_clean[numeric_cols].isna().sum())

# 4. If any numeric missing, fill with column median (deterministic, reproducible)
for c in numeric_cols:
    if df_clean[c].isna().sum() > 0:
        med = df_clean[c].median()
        df_clean[c].fillna(med, inplace=True)

print('\nMissing now:')
print(df_clean.isna().sum())

# Save cleaned copy
cleaned_path = Path('data/cleaned_student_data_academic.csv')
df_clean.to_csv(cleaned_path, index=False)
print('\nSaved cleaned dataset to', cleaned_path)

## 7. Results — Exploratory Data Analysis

### 7.1 Distributions of Marks

In [None]:
plt.figure(figsize=(12,8))
for i, col in enumerate(numeric_cols):
    plt.subplot(2,3,i+1)
    sns.histplot(df_clean[col], kde=True)
    plt.title(col)
plt.tight_layout()
plt.savefig(VIS_DIR / 'distributions_grid.png', dpi=150)
plt.show()

### 7.2 Boxplots (Detect outliers)

In [None]:
plt.figure(figsize=(10,5))
sns.boxplot(data=df_clean[numeric_cols])
plt.title('Boxplot of numeric columns')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig(VIS_DIR / 'boxplots.png', dpi=150)
plt.show()

### 7.3 Correlation Analysis

In [None]:
corr = df_clean[numeric_cols].corr().round(3)
plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation matrix')
plt.tight_layout()
plt.savefig(VIS_DIR / 'correlation_heatmap.png', dpi=150)
plt.show()

corr

### 7.4 Location-wise Analysis

In [None]:
loc_summary = df_clean.groupby('location')[numeric_cols].agg(['mean','std','count'])
loc_summary = loc_summary.round(2)
loc_summary.head(12)

In [None]:
plt.figure(figsize=(12,5))
# Show average Python marks by location
avg_python = df_clean.groupby('location')['python_marks'].mean().sort_values(ascending=False)
avg_python.plot(kind='bar')
plt.ylabel('Average Python Marks')
plt.title('Average Python Marks by Location')
plt.tight_layout()
plt.savefig(VIS_DIR / 'avg_python_by_location.png', dpi=150)
plt.show()

### 7.5 Statistical Tests

We perform a few simple tests: ANOVA to check if mean Python marks differ across locations, and Pearson correlations for numeric pairs.

In [None]:
# ANOVA for python marks across locations
locations = df_clean['location'].unique()
groups = [df_clean[df_clean['location']==loc]['python_marks'].values for loc in locations]

f_val, p_val = stats.f_oneway(*groups)
print('ANOVA F-statistic:', round(f_val,3), 'p-value:', round(p_val,4))

# Pearson correlation table (already computed in corr)
from itertools import combinations
pairs = list(combinations(numeric_cols, 2))
pearson = {f'{a}__{b}': stats.pearsonr(df_clean[a], df_clean[b]) for a,b in pairs}
# show a few
{ k: (round(v[0],3), round(v[1],4)) for k,v in list(pearson.items())[:6] }

## 8. Results Narrative
We summarize key findings from the EDA and statistical tests, highlighting significant patterns and insights.

**Draft findings:**

- The dataset contains 500 student records with no missing values after cleaning.
- Mean marks across subjects are consistently in the mid-80s.
- Correlation matrix shows weak relationships between subjects (no strong linear dependencies).
- ANOVA on Python marks across locations produced a p-value > 0.05, indicating no statistically significant difference in means across cities (example; check actual output above).

## 9. Discussion

The dataset shows consistently high student performance, with most subject averages in the mid-80s and no missing values. Correlations between subjects are weak, suggesting that strengths in one skill do not strongly predict performance in others. Age and location also show minimal influence, indicating evenly distributed abilities across the group. Overall, the dataset is clean, balanced, and ideal for demonstrating data cleaning, analysis, and visualization techniques without the variability or noise found in real-world data.

## 10. Conclusion & Future Work


- Add more demographic/contextual features (GPA, attendance, assignments).
- Include temporal data for trend analysis.
- Build predictive models to forecast student performance.
- Expand agent capabilities to propose interventions based on insights.

## 11. References

- ProdigyFlow repository (this project)
- Pandas documentation
- Seaborn / Matplotlib docs
- Scipy stats documentation

---

## Appendix: Reproducibility & How to run

1. Install dependencies in `requirements.txt`.
2. Run cells in order or execute the pipeline via `python agents/main_agent.py`.
3. Visual outputs are saved to `/visuals` and numeric summaries to `/reports`.

**Files referenced in this notebook:**
- `agents/cleaning_agent.py`, `agents/analysis_agent.py`, `agents/visualization_agent.py` (pipeline agents)
- `data/data_science_student_marks.csv` (raw data)
- `data/cleaned_student_data_academic.csv` (cleaned output)
