# Does Homework Really Impact Achievement?

What is the relationship between the amount of time students spend on homework and their academic achievement? We'll use a real-world dataset to explore this question and visualize our findings.

The data for this investigation comes from the [Early Childhood Longitudinal Study, (ECLS)](https://nces.ed.gov/ecls/), a large-scale study conducted by the U.S. National Center for Education Statistics. This study tracks students from kindergarten through elementary school, collecting rich information about their development, learning experiences, and family backgrounds. Our subset of this data includes details on student demographics, homework habits, and standardized achievement scores in subjects like math and reading.

## Data Loading and Initial Exploration

1. Import the necessary libraries:
- Pandas
- Matplotlib

2. The data is contained in the `ecls_homework_dataset.csv` file in the `data` folder. Read it in as a dataframe named `ecls`.

3.  Verify that the data is loaded by printing the first few lines of the dataframe.

### Dataset Key

- `student_id`: A unique identifier for each student in the dataset.

- `grade`: The student's grade level, numerically coded (e.g., 0 for Kindergarten, 1 for 1st Grade, 2 for 2nd Grade).

- `ses_level`: Socioeconomic status level of the student's family (specific categories for this variable are not detailed in this dataset snippet, but it generally indicates economic background).

- `homework_category`: A categorical representation of homework time assigned or completed. The original values were mapped as follows (though the homework_minutes column already contains the converted numeric values):

  - `1`: No homework assigned
  - `2`: Less than 30 minutes
  - `3`: 30-60 minutes
  - `4`: 1-2 hours
  - `5`: More than 2 hours

`homework_minutes`: The numeric conversion of homework time in minutes per day, derived from homework_category.

`math_score`: The student's achievement score in Math.

`reading_score`: The student's achievement score in Reading.

`grade_label`: A categorical label for the student's grade (e.g., 'Kindergarten', '1st Grade', '2nd Grade').

`homework_label`: A categorical label for the homework time (e.g., 'No homework', 'Less than 30 min').

Quickly explore the data using some of the dataframe methods. Some questions you could answer are:
- How big is the dataset?
- What are the summary statistics?

Let's start focusing on one grade level at a time:

1. Create a new dataframe that contains data only for students in Kindergarten (Grade 0) called `kinder`. *Hint: remember to set a condition first.*

2. After filtering, use the `.describe()` method to get a quick statistical summary of this specific grade's data.

Now that you have isolated the data for Kindergarten students, let's explore their average math scores based on the different homework categories. This will give you insights into how much homework time is associated with math achievement within this specific grade.

**Use the `.groupby()` method to group the `kinder` dataframe by homework_category, and then calculate the .mean() `math_score` and `reading_score` for each group.**

**Repeat this analysis of math and reading scores for some of the other grades.**

## 📓 Reflection 📓

After seeing the data, do you think homework impacts student achievement? What questions do you still have that the data could potentially answer?

## Supplementary: Visualization of the Results

This code creates a clear bar plot showing the average math achievement for each homework category within the grade you selected. This simplified visualization, focusing on one grade at a time, makes it easier to compare the patterns and average scores directly across different grade levels when you run the analysis for each one.

**Just run the next cell.**

In [None]:
import seaborn as sns

mean_math_scores_all_grades = ecls.groupby(['grade_label', 'homework_label']).agg(
    homework_minutes=('homework_minutes', 'first'), # Keep one homework_minutes for sorting
    math_score=('math_score', 'mean')
).sort_values(by=['grade_label', 'homework_minutes']).reset_index()


# Step 4: Visualize Mean Math Scores by Homework Category for All Grades in a Single Image
# Use seaborn.catplot to create a grid of bar plots, one for each grade.
# 'col="grade_label"' creates separate columns (facets) for each unique grade.
g = sns.catplot(
    data=mean_math_scores_all_grades,
    x='homework_label',
    y='math_score',
    col='grade_label',
    kind='bar',
    col_wrap=3, # Adjust to wrap plots into rows if many grades
    height=4, aspect=1.2, # Adjust size of each facet
    palette='viridis',
    sharey=True # Share the Y-axis across all plots for easier comparison
)

# Set common labels and titles
g.set_axis_labels('Homework Time Category', 'Average Math Achievement Score')
g.set_titles('Grade: {col_name}')
g.fig.suptitle('Average Math Achievement Scores by Homework Category Across Grades', y=1.02, fontsize=16) # Title for the entire figure

# Rotate x-axis labels for readability in each subplot
for ax in g.axes.flat:
    for label in ax.get_xticklabels():
        label.set_rotation(45)
        label.set_horizontalalignment('right')

plt.tight_layout(rect=[0, 0.03, 1, 0.98]) # Adjust layout to prevent suptitle overlap and labels clipping
plt.show()

## Supplementary Material: A Statistical Test

This code snippet below calculates the Pearson correlation coefficient (`r`) between `homework_minutes` and `math_score` in your `ecls` dataframe. The correlation coefficient (`r`) quantifies the linear relationship between these two variables. A positive `r` indicates that as one variable increases, the other tends to increase, while a negative `r` suggests that as one increases, the other tends to decrease. A value close to zero means there's a weak or no linear relationship.


In [None]:
correlation = ecls['homework_minutes'].corr(ecls['math_score'])
print(f"Homework vs Math Achievement: r = {correlation:.3f}")