# New South Wales Department of Education (NSW DOE) - Data Case Study 
## Data Analysis

In [None]:
import pandas as pd
import duckdb
from ydata_profiling import ProfileReport
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from scipy.stats import ttest_ind

### connect to the database

In [None]:
con = duckdb.connect('../../database/nsw_doe_data_case_study.duckdb',read_only=False)

#### Exploratory Data Analysis:

#### Data profile check

In [None]:
df = con.sql('select * from public_school_nsw_master_dataset').df()
profile_public_school_nsw_master_dataset = ProfileReport(df, title="Public School NSW Data Profiling Report")
profile_public_school_nsw_master_dataset.to_file("profile_public_school_nsw_master_dataset.html")
profile_public_school_nsw_master_dataset

In [None]:
df = con.sql('select * from multi_age_composite_unpivoted').df()
profile_multi_age_composite_unpivoted = ProfileReport(df, title="Multi Age Composite Profiling Report")
profile_multi_age_composite_unpivoted.to_file("profile_multi_age_composite_unpivoted.html")
profile_multi_age_composite_unpivoted

In [None]:
df = con.sql('select * from student_attendance_unpivoted').df()
profile_student_attendance_dataset = ProfileReport(df, title="Student Attendance Profiling Report")
profile_student_attendance_dataset.to_file("student_attendance_unpivoted.html")
profile_student_attendance_dataset

In [None]:
df = con.sql('select * from nsw_composite_school_attendance_data').df()
profile_student_attendance_dataset = ProfileReport(df, title="nsw_composite_school_attendance_data Profiling Report")
profile_student_attendance_dataset.to_file("nsw_composite_school_attendance_data.html")
profile_student_attendance_dataset

## Action: Data analysis:

<span style="color:yellow; font-size:30px;">Hypothesis Formulation:</span>

| Title                   | Description                                                          |
|-------------------------|----------------------------------------------------------------------|
| **Objective**           | Determine if multi-age composite classes have an impact on attendance rates. |
| **Null Hypothesis (H₀)** | Multi-age composite classes have no impact on attendance rates.      |
| **Alternative Hypothesis (H₁)** | Multi-age composite classes have a significant impact on attendance rates. |


<span style="color:yellow; font-size:30px;">Statistical Test:</span>

In [None]:
# Query data from the database
columns = [
    "Composite_class_count", "Composite_class_students", "Pct_composite_classes", 
    "Pct_composite_class_students", "Attendance_pct", "ICSEA_value", 
    "latest_year_enrolment_FTE", "Indigenous_pct", "LBOTE_pct"
]
query = f"SELECT {','.join(columns)} FROM nsw_composite_school_attendance_data"
df = con.execute(query).fetch_df()

# Drop rows with 'np' and blank values in the specified columns
df = df[~df['Indigenous_pct'].isin(['np', ''])]
df = df[~df['LBOTE_pct'].isin(['np', ''])]
df = df[~df['ICSEA_value'].isin([''])]

# Convert the columns to numeric
for col in columns:
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Descriptive statistics
desc_stats = df.describe()
print(desc_stats)

# Correlation matrix
correlations = df.corr()
print(correlations)

# Separate schools based on ICSEA_value
mean_icsea = df['ICSEA_value'].mean()
high_icsea = df[df['ICSEA_value'] > mean_icsea]
low_icsea = df[df['ICSEA_value'] <= mean_icsea]

# T-tests
significant_cols = []

for col in columns:
    t_stat, p_val = ttest_ind(high_icsea[col], low_icsea[col], nan_policy='omit')  # omitting NaN values
    
    # Apply Bonferroni correction for multiple testing
    adjusted_alpha = 0.05 / len(columns)
    
    if p_val < adjusted_alpha:
        significant_cols.append(col)

print("\nColumns with significant differences between high and low ICSEA schools:")
print(significant_cols)

### Descriptive Statistics:
Composite_class_count: The average number of composite classes per school is approximately 4.9, with a maximum of 29 and a minimum of 0.
Composite_class_students: On average, there are around 123 students in composite classes in a school, with a maximum of 779 and a minimum of 0.
Pct_composite_classes: The mean percentage of composite classes in schools is approximately 49.5%. The values range from 0% to 100%.
Pct_composite_class_students: The average percentage of students in composite classes is about 51.9% and ranges from 0% to 100%.
Attendance_pct: The average attendance percentage is approximately 91.8% with a maximum of 97.9% and a minimum of 50.6%.
ICSEA_value: The average ICSEA value is around 980.8, with a range from 586 to 1186.
latest_year_enrolment_FTE: The average enrolment for the latest year is about 353.9, with a maximum of 2079 and a minimum of 2.
Indigenous_pct: On average, 14.1% of the student population in schools are Indigenous, with the percentage going as high as 100% in some schools.
LBOTE_pct: On average, 26.7% of the student population in schools are from a language background other than English. The percentage varies from 0% to 100% across different schools.
Correlation Matrix:
This matrix showcases the relationship between two variables. A positive value indicates a direct relationship while a negative value indicates an inverse relationship. Values close to 1 or -1 show strong correlations.

Some key takeaways:

Composite_class_count and Composite_class_students have a strong positive correlation of 0.987, meaning as one increases, the other tends to as well.
Pct_composite_classes and Pct_composite_class_students also share a strong positive correlation of approximately 0.997.
ICSEA_value and Indigenous_pct have a strong negative correlation of approximately -0.849. This suggests that schools with higher percentages of Indigenous students tend to have lower ICSEA values.
Attendance_pct and ICSEA_value share a moderately positive correlation of 0.526, suggesting that schools with higher ICSEA values also tend to have higher attendance percentages.

#### T-tests:

In [None]:
# Split the dataframe based on ICSEA_value median
median_icsea = df['ICSEA_value'].median()
high_icsea_df = df[df['ICSEA_value'] > median_icsea]
low_icsea_df = df[df['ICSEA_value'] <= median_icsea]

# Columns to run t-tests on
columns = [
    "Composite_class_count", "Composite_class_students", "Pct_composite_classes",
    "Pct_composite_class_students", "Attendance_pct", "latest_year_enrolment_FTE",
    "Indigenous_pct", "LBOTE_pct"
]

# Running t-tests
results = {}
for col in columns:
    t_stat, p_value = stats.ttest_ind(high_icsea_df[col], low_icsea_df[col], nan_policy='omit')
    results[col] = {"t-statistic": t_stat, "p-value": p_value}

results_df = pd.DataFrame(results).T
print(results_df)

#### Outcome of T-test:
Pct_composite_classes:

t-statistic: -23.572692
p-value: Extremely close to 0.
Interpretation: There's a highly significant difference between the percentages of composite classes in high ICSEA schools and low ICSEA schools.
Pct_composite_class_students:

t-statistic: -23.001235
p-value: Extremely close to 0.
Interpretation: There's a very significant difference in the percentages of students in composite classes between high ICSEA schools and low ICSEA schools.
Attendance_pct:

t-statistic: 35.825538
p-value: Extremely close to 0.
Interpretation: Attendance percentage significantly differs between high ICSEA schools and low ICSEA schools.
When we consider the above metrics in relation to each other:

Both Pct_composite_classes and Pct_composite_class_students show significant differences between high ICSEA and low ICSEA schools. This implies that the use of composite classes or the proportion of students in these classes might be influenced by school type or other factors related to the school's socio-economic status (as represented by the ICSEA value).

The Attendance_pct also shows a significant difference, which means attendance is affected by ICSEA values.

However, to directly determine if multi-age composite classes have an impact on attendance rates, we need to examine the correlation between composite class metrics (Pct_composite_classes and Pct_composite_class_students) and Attendance_pct:

The correlation between Pct_composite_classes and Attendance_pct is -0.208262. This indicates a weak negative relationship, meaning that as the percentage of composite classes increases, attendance slightly decreases.

Similarly, the correlation between Pct_composite_class_students and Attendance_pct is -0.205552, which also suggests a weak negative relationship between the percentage of students in composite classes and attendance.

Comments:
While there is a weak negative correlation between the use of composite classes (or the percentage of students in these classes) and attendance rates, it is essential to be cautious. Correlation does not imply causation. Other factors (e.g., school resources, location, socio-economic factors) might influence both the use of composite classes and attendance rates. Further studies or analyses (like regression analyses) would be needed to determine if there's a direct causal relationship between composite classes and attendance rates.

#### Hypothesis Outcome:
(If the t-test p-value is below a certain significance level (usually 0.05), we reject the null hypothesis, indicating that there's a significant difference in attendance rates.)

Results:
Composite_class_count:
p-value: 1.996499e-01 (or 0.1996)
Composite_class_students:
p-value: 4.806868e-04 (or 0.0004807)
Pct_composite_classes:
p-value: 4.760830e-119 (virtually 0)
Pct_composite_class_students:
p-value: 1.314659e-113 (virtually 0)
Attendance_pct:
p-value: 9.131261e-262 (virtually 0)
latest_year_enrolment_FTE:
p-value: 1.060082e-184 (virtually 0)
Indigenous_pct:
p-value: 0.000000e+00 (0)
LBOTE_pct:
p-value: 2.038754e-128 (virtually 0)
Interpretation & Conclusion:
For Composite_class_count, the p-value (0.1996) is greater than 0.05. Therefore, we fail to reject the null hypothesis for this variable. This means there is no significant difference in the count of composite classes between high and low ICSEA schools.

For all other variables (Composite_class_students, Pct_composite_classes, Pct_composite_class_students, Attendance_pct, latest_year_enrolment_FTE, Indigenous_pct, LBOTE_pct), the p-values are far below the 0.05 significance level. Hence, we can reject the null hypothesis for these variables. This indicates that there are significant differences in these metrics between high ICSEA schools and low ICSEA schools.

To relate this back to your objective - "Does multi-age composite classes have an impact on attendance rates?":

The metrics directly related to composite classes (Pct_composite_classes and Pct_composite_class_students) show a significant difference between high ICSEA and low ICSEA schools.

The Attendance_pct also significantly differs between the two groups.

Given these results, we can conclude that there is a relationship between multi-age composite classes and attendance rates across high and low ICSEA schools. However, the precise nature of this relationship (i.e., causative or merely associative) would require further, more detailed analyses.