# Notebook #2: Descriptive Statistics & Initial EDA

In this notebook, we perform some basic exploratory data analysis in the cleaned dataset to understand size, composition, and relationships relevant to our research questions (i.e., demographics, student status, funding).

## 1.0. Import Libraries

In [1]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

<hr>

## 2.0. Import the Cleaned 'district_and_expenses' CSV Data

In [2]:
# Import and store the cleaned DataFrame from the 'district_and_expenses.csv' file
district_and_expenses = pd.read_csv('district_and_expenses.csv')

# Display the imported 'district_and_expenses' DataFrame
display(district_and_expenses)

Unnamed: 0,Fed ID,District Code,CDS Code,County Name,District Type,Grade Low,Grade High,Grade Low Census,Grade High Census,Assistance Status,...,Students with Disabilities (%),Socioeconomically Disadvantaged,Socioeconomically Disadvantaged (%),District Label,District Name,EDP 365,Expense ADA,Expense per ADA,LEA Type,Decimal Difference
0,601770.0,61119,1.611190e+12,Alameda,Unified,KG,12,KG,12,General Assistance,...,12.200000,4035.0,38.200000,Alameda Unified (Alameda),Alameda Unified,1.550948e+08,8567.86,18101.93,Unified,0.232163
1,601860.0,61127,1.611270e+12,Alameda,Unified,KG,12,KG,12,General Assistance,...,9.000000,1122.0,31.400000,Albany City Unified (Alameda),Albany City Unified,6.149090e+07,3435.41,17899.14,Unified,0.040342
2,604740.0,61143,1.611430e+12,Alameda,Unified,KG,12,KG,12,General Assistance,...,12.000000,2508.0,27.600000,Berkeley Unified (Alameda),Berkeley Unified,2.205508e+08,8572.17,25728.70,Unified,0.058892
3,607800.0,61150,1.611500e+12,Alameda,Unified,KG,12,KG,12,General Assistance,...,11.000000,3686.0,38.800000,Castro Valley Unified (Alameda),Castro Valley Unified,1.424913e+08,8991.52,15847.30,Unified,0.055328
4,612630.0,61168,1.611680e+12,Alameda,Unified,KG,12,KG,12,General Assistance,...,12.500000,327.0,54.500000,Emery Unified (Alameda),Emery Unified,1.586300e+07,554.70,28597.44,Unified,0.081666
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
927,,76349,,Mendocino,Elementary,KG,12,KG,8,General Assistance,...,14.883721,243.0,56.511628,Arena Union Elementary/Point Arena Joint Union...,Arena Union Elementary/Point Arena Joint Union...,1.016266e+07,325.53,31218.80,Comm Admin,0.320923
928,,40261,,Santa Cruz,Elementary,KG,5,KG,5,General Assistance,...,14.636480,2304.0,36.734694,Santa Cruz City Elementary/High (Santa Cruz),Santa Cruz City Elementary/High,1.152800e+08,5688.18,20266.58,Comm Admin,0.102637
929,,40246,,Sonoma,Elementary,KG,12,KG,6,,...,17.717921,3326.0,45.018950,Petaluma City Elementary/Joint Union High (Son...,Petaluma City Elementary/Joint Union High,1.252075e+08,6651.17,18824.88,Comm Admin,0.110782
930,,40253,,Sonoma,Elementary,KG,8,KG,6,,...,17.340181,7541.0,50.959589,Santa Rosa City Schools (Sonoma),Santa Rosa City Schools,2.486762e+08,11701.14,21252.30,Comm Admin,0.264663


<hr>

## 3.0. Descriptive Statistics

### 3.1. Distribution of Demographic Percentages

We looked into the distribution of student demographic percentages across California school districts. To understand the diversity and equity across California school districts, we examine the distributions of key student demographic percentages, which include: racial/ethnic groups, English learners, and student status populations (i.e., foster, socioeconomically disadvantaged (SED), migrant, homeless).

In [3]:
# Keep a list of the student demographic column names
DEMOGRAPHICS = ['African American (%)', 'American Indian (%)', 'Asian (%)','Filipino (%)',
                'Hispanic (%)', 'Pacific Islander (%)', 'White (%)', 'Two or More Races (%)',
                'English Learner (%)', 'Foster (%)', 'Homeless (%)', 'Migrant (%)', 
                'Students with Disabilities (%)', 'Socioeconomically Disadvantaged (%)']

# Display the descriptive statistics for the student demographic columns using 'demographic' values
district_and_expenses[DEMOGRAPHICS].describe()

Unnamed: 0,African American (%),American Indian (%),Asian (%),Filipino (%),Hispanic (%),Pacific Islander (%),White (%),Two or More Races (%),English Learner (%),Foster (%),Homeless (%),Migrant (%),Students with Disabilities (%),Socioeconomically Disadvantaged (%)
count,932.0,932.0,932.0,932.0,932.0,932.0,932.0,932.0,932.0,932.0,932.0,932.0,932.0,932.0
mean,2.333693,1.766325,5.868692,1.270682,47.828038,0.282077,34.178112,5.135629,16.667545,0.536182,3.679393,1.454324,12.928851,58.668789
std,4.047831,6.849491,10.816395,2.362522,28.105139,0.507898,24.890993,4.336854,14.767007,1.279515,5.112967,3.779724,4.479624,24.187314
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.8
25%,0.3,0.1,0.5,0.0,21.9,0.0,10.975,1.6,5.2,0.1,0.5,0.0,10.4,41.075
50%,0.9,0.3,1.5,0.5,45.85,0.1,31.780233,4.5,13.1,0.3,1.95621,0.0,13.0,61.15
75%,2.4,0.8,5.7,1.4,70.825,0.3,55.325,7.5,23.925,0.6,4.7,0.9,15.0,78.925
max,44.2,100.0,69.4,25.8,100.0,6.5,100.0,25.0,75.0,30.0,40.8,49.4,66.7,100.0


The summary statistics above provide a quick overview of the distributions of student demographic percentages across California school districts. This information can be used to identify the range of values, central tendencies, and spread of each demographic group.

Some insights:
- High variability in Hispanic (%) (std ~28%) suggests concentrated or sparse populations in certain areas
- High mean in Socioeconomically Disadvantaged (~58.7%) indicates a significant number of SED students across districts.
- Moderately high mean in English Learners (mean ~16.7%) indicates a significant number of English learners across CA districts.
- Low means for Foster (~0.53%) & Migrant (~1.4%) indicate these are niche demographic challenges


### 3.2. Distribution of the 'Expense per ADA' Column

We wanted to understand the spread of per-student spending, the crux of our research questions. 

In [4]:
district_and_expenses['Expense per ADA'].describe()

count       932.000000
mean      22181.055837
std       11107.346413
min        9951.740000
25%       17181.642500
50%       19742.790000
75%       22965.720000
max      139026.730000
Name: Expense per ADA, dtype: float64

The overall distribution of `'Expense per ADA'` shows n = 932 districts with:
- Mean ~$22,181
- Median ~$19,743
- Standard Deviation ~$11,107

When taking into account the mininum (9,952) and maximum (139,027), the mean & median indicate a **right-skewed distribution**. This means that most of the data is "lumped" on the lower end of the scale, with the mean higher than the median due to a small number of very high-spending outliers.

The interquartile range (IQR = Q3 (22,966) - Q1 (17,182) = $5,784) captures the range of the middle 50% of the data.

### 3.3. Breakdown of 'Expense  per ADA' by 'District Type' Columns

We considered how `'District Type'` (i.e., `'Elementary'`, `'High'`, `'Unified'`) may influence per-pupil spending due to differing curricula, facilities, or student needs.

In [5]:
district_and_expenses.groupby('District Type')['Expense per ADA'].mean()

District Type
Elementary    22642.697093
High          20855.198451
Unified       21763.458696
Name: Expense per ADA, dtype: float64

`'Elementary'` districts spend the most per ADA on average, while `'High'` school districts spend the least, with Unified in between.

These differences are not huge, but are noticeable and may be statistically significant. We will consider this in our regression analysis to determine if `'District Type'` is a significant predictor of `'Expense per ADA'`.

### 3.4. Breakdown of 'Expense per ADA' by 'Locale' Columns

District `'Locale'` (i.e., `'City'`, `'Town'`, `'Rural'`, `'Suburban'`) often correlates with economic factors that may affect district budgets.

We considered grouping by `'Locale'` to understand if there are significant differences in per-pupil spending across location types, which may be due to housing costs, transportation, or staffing challenges in sparser or more remote areas.

In [6]:
district_and_expenses.groupby('Locale')['Expense per ADA'].mean()

Locale
City            20187.757820
Not Reported    22276.770000
Rural           25534.498475
Suburban        20471.701115
Town            19805.738153
Name: Expense per ADA, dtype: float64

`'Rural'` locales (25,534) showed the highest average `'Expense per ADA'`. This may be because rural districts may face structural cost factors (e.g., transportation, staffing, housing) that increase per-pupil spending.

### 3.5. Correlations between 'Expense per ADA' and Demographic Percentage Columns

We wanted to explore potential relationships between spending and student demographics. To do this, we computed the Pearson correlations between `'Expense per ADA'` values and demographic percentages.

With the `pandas.corr()` method, the default correlation is the Pearson method, which measures the linear relationship between two variables.

In [7]:
district_and_expenses[['Expense per ADA'] + DEMOGRAPHICS].corr()

Unnamed: 0,Expense per ADA,African American (%),American Indian (%),Asian (%),Filipino (%),Hispanic (%),Pacific Islander (%),White (%),Two or More Races (%),English Learner (%),Foster (%),Homeless (%),Migrant (%),Students with Disabilities (%),Socioeconomically Disadvantaged (%)
Expense per ADA,1.0,-0.015822,0.223872,-0.11973,-0.10834,-0.05404,-0.053364,0.056764,0.029782,0.035026,0.16329,0.106771,0.039248,0.188632,0.143856
African American (%),-0.015822,1.0,-0.083401,0.047436,0.204436,0.069447,0.292803,-0.266314,0.042404,-0.015922,0.089812,-0.019211,-0.138319,0.109889,0.145167
American Indian (%),0.223872,-0.083401,1.0,-0.100695,-0.075441,-0.218426,-0.065433,0.021522,0.105848,-0.170176,0.159801,0.041118,-0.060554,0.190101,0.075872
Asian (%),-0.11973,0.047436,-0.100695,1.0,0.397292,-0.256856,0.176154,-0.202308,0.193719,-0.033464,-0.103466,-0.139262,-0.158101,-0.153105,-0.360027
Filipino (%),-0.10834,0.204436,-0.075441,0.397292,1.0,-0.040743,0.361735,-0.259234,0.114708,0.033892,-0.083292,-0.06741,-0.116461,-0.023331,-0.15341
Hispanic (%),-0.05404,0.069447,-0.218426,-0.256856,-0.040743,1.0,-0.045171,-0.824279,-0.679685,0.786666,-0.033007,0.189019,0.44923,-0.071258,0.670924
Pacific Islander (%),-0.053364,0.292803,-0.065433,0.176154,0.361735,-0.045171,1.0,-0.12905,0.10335,0.033363,-0.029579,0.008261,-0.106734,0.040709,-0.020221
White (%),0.056764,-0.266314,0.021522,-0.202308,-0.259234,-0.824279,-0.12905,1.0,0.451642,-0.716231,0.015307,-0.125586,-0.322604,0.072841,-0.52494
Two or More Races (%),0.029782,0.042404,0.105848,0.193719,0.114708,-0.679685,0.10335,0.451642,1.0,-0.543285,0.027563,-0.126216,-0.33023,0.05704,-0.541559
English Learner (%),0.035026,-0.015922,-0.170176,-0.033464,0.033892,0.786666,0.033363,-0.716231,-0.543285,1.0,-0.084635,0.222317,0.498234,-0.085722,0.552155


The Pearson correlation matrix between Expense per ADA & demographics showed generally weak linear associations. More analysis is needed to understand if these relationships are statistically significant (see notebook 5). 

<hr>

## 4.0. Initial EDA

### 4.1. Box Plot - Distribution of Total Enrollment

A box plot will help us visualize enrollment distributions, spread, and outliers.

In [8]:
fig_enroll_box = px.box(
    district_and_expenses, 
    x='Enroll Total', 
    title='Box Plot of Total Enrollment')
    
fig_enroll_box.show()

This box plot confirms our earlier observation that the distribution of **total enrollment is right-skewed**, with a few districts having much higher enrollments than the rest.

### 4.2. Box Plot - Distribution of the 'Expense per ADA' Column

A box plot for `'Expense per ADA'` will help us visualize the distribution and spread of per-pupil spending.

In [9]:
fig_exp_box = px.box(
    district_and_expenses, 
    x='Expense per ADA', 
    title='Box Plot of Expense per ADA')

fig_exp_box.show()

Similar to our analysis in Section 3.2, this box plot confirms that the distribution of `'Expense per ADA'` values is **right-skewed**. There are a few high-spending outliers.

### 4.3. Scatter Plot - 'Expense per ADA'  vs. Total Enrollment (Without Los Angeles Unified)

By removing Los Angeles Unified, we can better visualize the patterns among other districts.

In [10]:
scatter_drop_LA = district_and_expenses[district_and_expenses['District Name'] != 'Los Angeles Unified']

fig_scatter = px.scatter(
    scatter_drop_LA, 
    x='Enroll Total', 
    y='Expense per ADA',
    color='District Type',
    opacity=0.5,
    trendline='ols',
    hover_name='District Name',
    title='Expense per ADA vs. Total Enrollment',
    subtitle='Without LA Unified')

fig_scatter.show()

The relationship between size and spending appears weak to moderately negative or flat, suggesting that larger districts may spend less per student than smaller districts do.

### 4.4. Scatter Plot - 'Expense per ADA' vs. Percentage of Socioeconomically Disadvantaged Pupils

We wanted to know whether school districts with more socioeconomically disadvantaged students spend more per ADA.

In [11]:
fig_econ_dis = px.scatter(
    district_and_expenses,
    x='Socioeconomically Disadvantaged (%)',
    y='Expense per ADA',
    hover_data=['District Name', 'County Name'],
    title='Expense per ADA vs. Socioeconomically Disadvantaged (%)',
    subtitle='Weak Positive Correlation',
    labels={'Socioeconomically Disadvantaged (%)': 'Socioeconomically Disadvantaged (%)', 'Expense per ADA': 'Expense per ADA ($)'},
    trendline='ols',
    trendline_color_override='red'
)
fig_econ_dis.show()

The OLS trendline suggests that the relationship appears weakly positive.

In other words, districts with higher proportions of socioeconomically disadvantaged students tend to have slightly higher per-ADA spending on average.

This may hint at the possibility that districts with more socioeconomically disadvantaged students may be provided with additional resources and support by the state, which could explain the higher per-ADA spending. This may also be an effect of the LCFF.

### 4.5. Scatter Plot - 'Expense per ADA' vs. Percentage of Migrant Pupils

We wondered whether migrant students may impact per-pupil spending due to their unique needs and the resources required to support them.

In [12]:
fig_migrant = px.scatter(
    district_and_expenses,
    x='Migrant (%)',
    y='Expense per ADA',
    hover_data=['District Name', 'County Name'],
    title='Expense per ADA vs. Migrant (%)',
    labels={'Migrant (%)': 'Migrant (%)', 'Expense per ADA': 'Expense per ADA ($)'},
    trendline='ols',
    trendline_color_override='red'
)
fig_migrant.show()

The R-squared value is extremely low (0.0015), suggesting that there may be no linear relationship between the percentage of migrant students and per-pupil spending. However, we will do further regression analysis to determine if this relationship is statistically significant.

### 4.6. Scatter Plot - 'Expense per ADA' vs. Percentage of English Learner Pupils

We wanted to know whether English Learners (in a similar manner as Migrant demographics) may impact per-pupil spending due to their unique needs and the resources required to support them.

In [13]:
fig_homeless = px.scatter(
    district_and_expenses,
    x='English Learner (%)',
    y='Expense per ADA',
    hover_data=['District Name', 'County Name'],
    title='Expense per ADA vs. English Learner (%)',
    labels={'English Learner (%)': 'English Learner (%)', 'Expense per ADA': 'Expense per ADA ($)'},
    trendline='ols',
    trendline_color_override='red'
)
fig_homeless.show()

The R_squared value (0.0012) is extremely low, suggesting that there may be little or no linear relationship between the percentage of English Learners and per-pupil spending. However, we will do further regression analysis to determine if this relationship is statistically significant.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=246b06f0-3e45-45e3-acef-efea2bae7701' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>