# Data Collection

The purpose of this notebook is to collect all public raw data and turn them into a dataframes. 
These dataframes will be filtered based on our filtering strategy. This will reduce the size of the data and allow us to save them as pickle files for later use. 

The public raw datasets are not included in the repository due to its size.
To run this notebook, the public raw datasets can be downloaded from the `public datasets` link. 
The actual location where these datasets were originally downloaded are also included in the notebook. 
The datasets can be downloaded from here: [public datasets](https://drive.google.com/drive/folders/11psdH5PwJq7BNER6BO8Lu1YsJq8Exezv?usp=sharing)

### Filtering Strategy

**Aggregate Level:** "S". Focuses analysis on individual schools.  
**CharterSchool:** "No" or "N". Excludes charter schools to focus on traditional public high schools.   
**DASS:** "No" or "N". Removes alternative/continuation programs so graduation rates reflect typical comprehensive high schools.   
**ReportingCategory:** "TA". Keeps aggregate totals for each school (not broken down by subgroup) to simplify modeling.


In [12]:
# import libraries
import importlib
import os

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import subprocess
import sys

from pathlib import Path

In [13]:
# import other libraries
from helper import (
    load_cde_txt,
    clean_calschls_safety,
    clean_safety_by_connectedness,
    clean_columns,
)

# check if jcds library is installed
package_name = "jcds"

if importlib.util.find_spec(package_name) is None:
    print(f" '{package_name}' not found. Installing from Github... ")
    subprocess.check_call(
        [
            sys.executable,
            "-m",
            "pip",
            "install",
            "https://github.com/junclemente/jcds.git",
        ]
    )
else:
    print(f" '{package_name}' is already installed.")

from jcds import eda as jeda
from jcds import reports as jrep

 'jcds' is already installed.


In [14]:
# main data folder path
data_folder = Path("../data")

# subfolder paths
ca_doe = Path(data_folder / "public_data/ca_doe")
cde = Path(data_folder / "public_data/cde")
ca_schls = Path(data_folder / "public_data/ca_schls")

raw_pickle = Path(data_folder / "raw_pickle")

# CA Dept of Education

## Adjusted Cohort Graduation Rate and Outcome Data (ACGR)

**Adjusted Cohort Graduation Rate and Outcome Data**
Four-year Adjusted Cohort Graduation Rate (ACGR) and Outcome data reported by race/ethnicity, student group, and gender.  
Source: [https://www.cde.ca.gov/ds/ad/filesacgr.asp](https://www.cde.ca.gov/ds/ad/filesacgr.asp)

**Note:** To protect student privacy, data are suppressed (\*) on the data file if the cell size within a selected student population (cohort students) is 10 or less. Additionally, the “Not Reported” race/ethnicity is suppressed, regardless of actual cell size, if the student population for one or more other race/ethnicity groups is suppressed.

[Data Dictionary: ACGR](https://www.cde.ca.gov/ds/ad/fsacgr.asp)

In [15]:
# load raw dataset
df_raw = load_cde_txt(ca_doe / "acgr21.txt")

# filter dataset
df_acgr = df_raw[
    (df_raw["AggregateLevel"].str.strip() == "S")
    & (df_raw["CharterSchool"].str.strip() == "No")
    & (df_raw["DASS"].str.strip() == "No")
    & (df_raw["ReportingCategory"] == "TA")
]

df_acgr.to_pickle(raw_pickle / "raw_acgr.pkl")

# get column list
df_acgr.columns.to_list()

['AcademicYear',
 'AggregateLevel',
 'CountyCode',
 'DistrictCode',
 'SchoolCode',
 'CountyName',
 'DistrictName',
 'SchoolName',
 'CharterSchool',
 'DASS',
 'ReportingCategory',
 'CohortStudents',
 'Regular HS Diploma Graduates (Count)',
 'Regular HS Diploma Graduates (Rate)',
 "Met UC/CSU Grad Req's (Count)",
 "Met UC/CSU Grad Req's (Rate)",
 'Seal of Biliteracy (Count)',
 'Seal of Biliteracy (Rate)',
 'Golden State Seal Merit Diploma (Count)',
 'Golden State Seal Merit Diploma (Rate',
 'CHSPE Completer (Count)',
 'CHSPE Completer (Rate)',
 'Adult Ed. HS Diploma (Count)',
 'Adult Ed. HS Diploma (Rate)',
 'SPED Certificate (Count)',
 'SPED Certificate (Rate)',
 'GED Completer (Count)',
 'GED Completer (Rate)',
 'Other Transfer (Count)',
 'Other Transfer (Rate)',
 'Dropout (Count)',
 'Dropout (Rate)',
 'Still Enrolled (Count)',
 'Still Enrolled (Rate)']

In [16]:
# select columns
cols_acgr = [
    "AcademicYear",
    "AggregateLevel",
    "CountyCode",
    "DistrictCode",
    "SchoolCode",
    "CountyName",
    "DistrictName",
    "SchoolName",
    "CharterSchool",
    "DASS",
    "ReportingCategory",
    "CohortStudents",  # QA for weighing
    # "Regular HS Diploma Graduates (Count)",
    "Regular HS Diploma Graduates (Rate)",  # target variable
    # "Met UC/CSU Grad Req's (Count)",
    "Met UC/CSU Grad Req's (Rate)",  # academic readiness/intensity feature
    # "Seal of Biliteracy (Count)",
    "Seal of Biliteracy (Rate)",  # ??? language proficiency
    # "Golden State Seal Merit Diploma (Count)",
    # "Golden State Seal Merit Diploma (Rate",
    # "CHSPE Completer (Count)",
    # "CHSPE Completer (Rate)",
    # "Adult Ed. HS Diploma (Count)",
    # "Adult Ed. HS Diploma (Rate)",
    # "SPED Certificate (Count)",
    # "SPED Certificate (Rate)",
    # "GED Completer (Count)",
    # "GED Completer (Rate)",
    # "Other Transfer (Count)",
    # "Other Transfer (Rate)",
    # "Dropout (Count)",
    "Dropout (Rate)",  # secondary target
    # "Still Enrolled (Count)",
    "Still Enrolled (Rate)",  # 5th year senior
]

df_acgr[cols_acgr]

Unnamed: 0,AcademicYear,AggregateLevel,CountyCode,DistrictCode,SchoolCode,CountyName,DistrictName,SchoolName,CharterSchool,DASS,ReportingCategory,CohortStudents,Regular HS Diploma Graduates (Rate),Met UC/CSU Grad Req's (Rate),Seal of Biliteracy (Rate),Dropout (Rate),Still Enrolled (Rate)
66594,2020-21,S,01,31609,0131755,Alameda,California School for the Blind (State Special...,California School for the Blind,No,No,TA,11,0.0,0.0,0.0,63.6,0.0
66654,2020-21,S,01,31617,0131763,Alameda,California School for the Deaf-Fremont (State ...,California School for the Deaf-Fremont,No,No,TA,38,63.2,0.0,33.3,2.6,28.9
66718,2020-21,S,01,61119,0000001,Alameda,Alameda Unified,"Nonpublic, Nonsectarian Schools",No,No,TA,*,*,*,*,*,*
66782,2020-21,S,01,61119,0106401,Alameda,Alameda Unified,Alameda Science and Technology Institute,No,No,TA,43,100.0,95.3,2.3,0.0,0.0
66910,2020-21,S,01,61119,0130229,Alameda,Alameda Unified,Alameda High,No,No,TA,394,92.4,73.9,22.8,2.3,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
254262,2020-21,S,58,72736,0000000,Yuba,Marysville Joint Unified,District Office,No,No,TA,69,56.5,0.0,0.0,43.5,0.0
254314,2020-21,S,58,72736,0000001,Yuba,Marysville Joint Unified,"Nonpublic, Nonsectarian Schools",No,No,TA,*,*,*,*,*,*
254426,2020-21,S,58,72736,5830013,Yuba,Marysville Joint Unified,Lindhurst High,No,No,TA,226,87.2,36.0,11.7,6.2,5.8
254630,2020-21,S,58,72736,5835202,Yuba,Marysville Joint Unified,Marysville High,No,No,TA,201,90.5,37.4,1.1,7.0,2.0


# CA Dept of Education - Data and Statistics

Downloadable files about California's K–12 educational system by topic area, including enrollment, assessment and accountability, English learners, foster youth, free or reduced-price meal, graduates and dropouts, and staff data.

All public raw data in this section can be downloaded from here:  
[Available Downloadable Data Files by Topic](https://www.cde.ca.gov/ds/ad/downloadabledata.asp)  

## Chronic Absenteeism Data

The Absenteeism Downloadable Files page provides access to data about student absenteeism, including chronic absenteeism and absenteeism by reason counts and rates, disaggregated by race/ethnicity, gender, student program group, and grade span.

[Data Dictionary: Chronic Absenteeism](https://www.cde.ca.gov/ds/ad/fsabd.asp)


In [17]:
# load raw dataset and filter
df_raw = load_cde_txt(cde / "chronicabsenteeism21.txt")

df_chron_abs = df_raw[
    (df_raw["Aggregate Level"].str.strip() == "S")
    & (df_raw["Charter School"].str.strip() == "No")
    & (df_raw["Reporting Category"].str.strip() == "TA")
]

# save
df_chron_abs.to_pickle(raw_pickle / "raw_chronic_absent.pkl")

df_chron_abs

Unnamed: 0,Academic Year,Aggregate Level,County Code,District Code,School Code,County Name,District Name,School Name,Charter School,Reporting Category,ChronicAbsenteeismEligibleCumula,ChronicAbsenteeismCount,ChronicAbsenteeismRate
57598,2020-21,S,01,10017,0130419,Alameda,Alameda County Office of Education,Alameda County Community,No,TA,122,103,84.4
57599,2020-21,S,01,10017,0130401,Alameda,Alameda County Office of Education,Alameda County Juvenile Hall/Court,No,TA,107,66,61.7
57621,2020-21,S,01,31609,0131755,Alameda,California School for the Blind (State Special...,California School for the Blind,No,TA,68,6,8.8
57644,2020-21,S,01,31617,0131763,Alameda,California School for the Deaf-Fremont (State ...,California School for the Deaf-Fremont,No,TA,329,38,11.6
58027,2020-21,S,01,61119,6090013,Alameda,Alameda Unified,Edison Elementary,No,TA,460,17,3.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...
263006,2020-21,S,58,72751,6056832,Yuba,Wheatland,Lone Tree Elementary,No,TA,354,59,16.7
263008,2020-21,S,58,72751,6056840,Yuba,Wheatland,Wheatland Elementary,No,TA,314,85,27.1
263062,2020-21,S,58,72769,0123570,Yuba,Wheatland Union High,Wheatland Community Day High,No,TA,,,
263063,2020-21,S,58,72769,0133751,Yuba,Wheatland Union High,Edward P. Duplex,No,TA,82,82,100


### Absenteeism by Reason

File structure for the Absenteeism by Reason data reported by race/ethnicity, gender, student group, and grade span. "Eligible" cumulative enrollment, students with one or more absence, average days absent, and absences by reason are provided.

[Data Dictionary: Absenteeism by Reason](https://www.cde.ca.gov/ds/ad/fsabr.asp)


In [18]:
df_raw = load_cde_txt(cde / "absenteeismreason22-v3.txt")
df_raw

df_abs = df_raw[
    (df_raw["Aggregate Level"].str.strip() == "S")
    & (df_raw["Charter School"].str.strip() == "No")
    & (df_raw["DASS"].str.strip() == "No")
    & (df_raw["Reporting Category"] == "TA")
]

df_abs.to_pickle(raw_pickle / "raw_absent_reason.pkl")

# get column list
df_abs.columns.to_list()

['Academic Year',
 'Aggregate Level',
 'County Code',
 'District Code',
 'School Code',
 'County Name',
 'District Name',
 'School Name',
 'Charter School',
 'DASS',
 'Reporting Category',
 'Eligible Cumulative Enrollment',
 'Count of Students with One or More Absences',
 'Average Days Absent',
 'Total Days Absent',
 'Excused Absences (percent)',
 'Unexcused Absences (percent)',
 'Out-of-School Suspension Absences (percent)',
 'Incomplete Independent Study Absences (percent)',
 'Excused Absences (count)',
 'Unexcused Absences (count)',
 'Out-of-School Suspension Absences (count)',
 'Incomplete Independent Study Absences (count)']

In [19]:
# select columns
abs_cols = [
    "Academic Year",
    "Aggregate Level",
    "County Code",
    "District Code",
    "School Code",
    "County Name",
    "District Name",
    "School Name",
    "Charter School",
    "DASS",
    "Reporting Category",
    "Eligible Cumulative Enrollment",
    # "Count of Students with One or More Absences",
    # "Average Days Absent",
    # "Total Days Absent",
    # "Excused Absences (percent)",
    "Unexcused Absences (percent)",
    "Out-of-School Suspension Absences (percent)",
    # "Incomplete Independent Study Absences (percent)",
    # "Excused Absences (count)",
    # "Unexcused Absences (count)",
    # "Out-of-School Suspension Absences (count)",
    # "Incomplete Independent Study Absences (count)",
]

df_abs[abs_cols]

Unnamed: 0,Academic Year,Aggregate Level,County Code,District Code,School Code,County Name,District Name,School Name,Charter School,DASS,Reporting Category,Eligible Cumulative Enrollment,Unexcused Absences (percent),Out-of-School Suspension Absences (percent)
583,2021-22,S,01,31609,0131755,Alameda,California School for the Blind (State Special...,California School for the Blind,No,No,TA,67,46.9,0
608,2021-22,S,01,31617,0131763,Alameda,California School for the Deaf-Fremont (State ...,California School for the Deaf-Fremont,No,No,TA,329,41.8,1.8
628,2021-22,S,01,61119,0000000,Alameda,Alameda Unified,District Office,No,No,TA,22,0,0
647,2021-22,S,01,61119,0106401,Alameda,Alameda Unified,Alameda Science and Technology Institute,No,No,TA,170,20.2,0
670,2021-22,S,01,61119,0111765,Alameda,Alameda Unified,Ruby Bridges Elementary,No,No,TA,473,28.4,0.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
227659,2021-22,S,58,72751,6056816,Yuba,Wheatland,Bear River,No,No,TA,586,43.9,0.2
227682,2021-22,S,58,72751,6056832,Yuba,Wheatland,Lone Tree Elementary,No,No,TA,390,29,0.1
227706,2021-22,S,58,72751,6056840,Yuba,Wheatland,Wheatland Elementary,No,No,TA,352,33.5,0
227743,2021-22,S,58,72769,0000000,Yuba,Wheatland Union High,District Office,No,No,TA,*,*,*


## Public Schools and Districts

[Public Schools and Districts](https://www.cde.ca.gov/ds/si/ds/pubschls.asp)  
The Public Schools and Districts Downloadable Files page provides access to data files containing general information about California's public schools and districts found in the California School Directory.


[Data Dictionary: Public Schools and Districts](https://www.cde.ca.gov/ds/si/ds/fspubschls.asp)


In [35]:
df_raw = pd.read_excel(cde / "pubschls.xlsx", header=5)

df_schooldata = df_raw[
    (df_raw["StatusType"].str.strip() == "Active")
    & (df_raw["EdOpsCode"].str.strip() == "TRAD")
    & (df_raw["Charter"].str.strip() == "N")
    & (df_raw["EILCode"]).str.strip().isin(["ELEMHIGH", "HS"])
]

df_schooldata.to_pickle(raw_pickle / "raw_school_data.pkl")

df_schooldata

Unnamed: 0,CDSCode,NCESDist,NCESSchool,StatusType,County,District,School,Street,StreetAbr,City,...,Virtual,Magnet,YearRoundYN,FederalDFCDistrictID,Latitude,Longitude,AdmFName,AdmLName,LastUpDate,Multilingual
59,01611190130229,0601770,00041,Active,Alameda,Alameda Unified,Alameda High,2200 Central Avenue,2200 Central Ave.,Alameda,...,N,N,N,No Data,37.764958,-122.24593,Angela,Barrett,2024-09-30,N
63,01611190132142,0601770,00045,Active,Alameda,Alameda Unified,Encinal Junior/Senior High,210 Central Avenue,210 Central Ave.,Alameda,...,N,N,N,No Data,37.772765,-122.28900,Kirstin,Snyder,2024-09-30,N
91,01611270130450,0601860,00059,Active,Alameda,Albany City Unified,Albany High,603 Key Route Boulevard,603 Key Route Blvd.,Albany,...,N,N,N,No Data,37.896661,-122.29257,Darren,McNally,2023-02-09,N
113,01611430131177,0604740,00432,Active,Alameda,Berkeley Unified,Berkeley High,1980 Allston Way,1980 Allston Way,Berkeley,...,N,N,N,No Data,37.868913,-122.27120,Juan,Raygoza,2020-08-13,Y
153,01611500132225,0607800,00742,Active,Alameda,Castro Valley Unified,Castro Valley High,19400 Santa Maria Avenue,19400 Santa Maria Ave.,Castro Valley,...,N,N,N,No Data,37.705184,-122.07847,Christopher,Fortenberry,2023-02-09,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18291,57727100101162,0643080,11380,Active,Yolo,Woodland Joint Unified,Pioneer High,1400 Pioneer Avenue,1400 Pioneer Ave.,Woodland,...,N,N,N,No Data,38.657047,-121.74229,Heather,King,2025-08-18,Y
18298,57727105738802,0643080,07011,Active,Yolo,Woodland Joint Unified,Woodland Senior High,21 North West Street,21 North West St.,Woodland,...,N,N,N,No Data,38.683705,-121.78395,Gerald Salcido,Salcido,2024-08-08,N
18349,58727365830013,0624090,03630,Active,Yuba,Marysville Joint Unified,Lindhurst High,4446 Olive Drive,4446 Olive Dr.,Olivehurst,...,N,N,N,No Data,39.079263,-121.53070,Nohemi,Arroyo-Magaña,2025-07-22,N
18356,58727365835202,0624090,03633,Active,Yuba,Marysville Joint Unified,Marysville High,12 East 18th Street,12 East 18th St.,Marysville,...,N,N,N,No Data,39.155225,-121.58565,Joe,Seiler,2025-07-15,N


## Free or Reduced-Price Meal (Student Poverty)

The Free or Reduced-Price Meal Downloadable Files page provides access to data about students who are eligible for Free or Reduced-Price Meals (FRPM).

[Data Dictionary: FRPM ](https://www.cde.ca.gov/ds/ad/fsspfrpm.asp)


In [21]:
df_raw = pd.read_excel(
    cde / "frpm2122_v2.xlsx", sheet_name="FRPM School-Level Data ", header=1
)

df_raw = clean_columns(df_raw)

df_frpm = df_raw[df_raw["Charter School (Y/N)"].str.strip() == "N"]

df_frpm.to_pickle(raw_pickle / "raw_frpm.pkl")

df_frpm

Unnamed: 0,Academic Year,County Code,District Code,School Code,County Name,District Name,School Name,District Type,School Type,Educational Option Type,...,Free Meal Count (K-12),Percent (%) Eligible Free (K-12),FRPM Count (K-12),Percent (%) Eligible FRPM (K-12),Enrollment (Ages 5-17),Free Meal Count (Ages 5-17),Percent (%) Eligible Free (Ages 5-17),FRPM Count (Ages 5-17),Percent (%) Eligible FRPM (Ages 5-17),CALPADS Fall 1 Certification Status
0,2021-2022,1,10017,130419,Alameda,Alameda County Office of Education,Alameda County Community,County Office of Education (COE),County Community,County Community School,...,45,0.789474,47,0.824561,37,29,0.783784,31,0.837838,Y
1,2021-2022,1,10017,130401,Alameda,Alameda County Office of Education,Alameda County Juvenile Hall/Court,County Office of Education (COE),Juvenile Court Schools,Juvenile Court School,...,64,1.000000,64,1.000000,56,56,1.000000,56,1.000000,Y
14,2021-2022,1,31609,131755,Alameda,California School for the Blind (State Special...,California School for the Blind,State Special Schools,State Special Schools,State Special School,...,62,1.000000,62,1.000000,43,43,1.000000,43,1.000000,Y
15,2021-2022,1,31617,131763,Alameda,California School for the Deaf-Fremont (State ...,California School for the Deaf-Fremont,State Special Schools,State Special Schools,State Special School,...,318,1.000000,318,1.000000,263,263,1.000000,263,1.000000,Y
17,2021-2022,1,61119,130229,Alameda,Alameda Unified,Alameda High,Unified School District,High Schools (Public),Traditional,...,311,0.172013,327,0.180863,1743,293,0.168101,308,0.176707,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10550,2021-2022,58,72751,6056832,Yuba,Wheatland,Lone Tree Elementary,Elementary School District,Elementary Schools (Public),Traditional,...,68,0.193732,119,0.339031,338,66,0.195266,117,0.346154,Y
10552,2021-2022,58,72751,6056840,Yuba,Wheatland,Wheatland Elementary,Elementary School District,Elementary Schools (Public),Traditional,...,178,0.523529,189,0.555882,329,170,0.516717,181,0.550152,Y
10554,2021-2022,58,72769,133751,Yuba,Wheatland Union High,Edward P. Duplex,High School District,Continuation High Schools,Continuation School,...,32,0.711111,45,1.000000,28,16,0.571429,28,1.000000,Y
10556,2021-2022,58,72769,123570,Yuba,Wheatland Union High,Wheatland Community Day High,High School District,District Community Day Schools,Community Day School,...,3,0.600000,4,0.800000,5,3,0.600000,4,0.800000,Y


## CBEDS Data about Schools & Districts

Downloadable data files for information about schools and districts, including estimated number of teacher hires, work visa applications, home-to-school transportation, kindergarten program type, and educational calendar.    


[Data Dictionary: CBEDS](https://www.cde.ca.gov/ds/ad/fscbedsorab19.asp)


In [22]:
df_raw = load_cde_txt(cde / "cbedsora21b.txt")

df_cbeds = df_raw[df_raw["Level"].str.strip() == "S"]

df_cbeds.to_pickle(raw_pickle / "raw_cbeds.pkl")

df_cbeds

Unnamed: 0,Cdscode,CountyName,DistrictName,SchoolName,Description,Level,Section,RowNumber,Value,Year
18,01100170112607,Alameda,Alameda County Office of Education,Envision Academy for Arts & Technology,Kindergarten None,S,B,4,True,2122
19,01100170112607,Alameda,Alameda County Office of Education,Envision Academy for Arts & Technology,Transitional Kindergarten None,S,B,8,True,2122
20,01100170112607,Alameda,Alameda County Office of Education,Envision Academy for Arts & Technology,Visa Applications Submitted,S,C,1,0,2122
21,01100170112607,Alameda,Alameda County Office of Education,Envision Academy for Arts & Technology,Visa Applications Granted,S,C,2,0,2122
22,01100170112607,Alameda,Alameda County Office of Education,Envision Academy for Arts & Technology,Traditional,S,D,1,True,2122
...,...,...,...,...,...,...,...,...,...,...
58706,58727695838305,Yuba,Wheatland Union High,Wheatland Union High,Kindergarten None,S,B,4,True,2122
58707,58727695838305,Yuba,Wheatland Union High,Wheatland Union High,Transitional Kindergarten None,S,B,8,True,2122
58708,58727695838305,Yuba,Wheatland Union High,Wheatland Union High,Traditional,S,D,1,True,2122
58709,58727695838305,Yuba,Wheatland Union High,Wheatland Union High,Start Date,S,D,4,20210811,2122


## Staff Data Files

The Staff Downloadable Files page provides access to data about certificated and classified staff demographic information, staff assignments, student/staff ratios, and estimated teacher hires.

### Student / Staff Ratio

[Data Dictionary: Student-Staff Ratio](https://www.cde.ca.gov/ds/ad/fsstrat.asp)


In [23]:
df_raw = load_cde_txt(cde / "strat2122.txt")


df_ss_ratio = df_raw[
    (df_raw["Aggregate Level"].str.strip() == "S")
    & (df_raw["Charter School"].str.strip() == "N")
    & (df_raw["DASS"].str.strip() == "N")
]

df_ss_ratio.to_pickle(raw_pickle / "raw_student_staff_ratio.pkl")

df_ss_ratio

Unnamed: 0,Academic Year,Aggregate Level,County Code,District Code,School Code,County Name,District Name,School Name,Charter School,DASS,School Grade Span,TOTAL_ENR_N,TCH_FTE_N,ADM_FTE_N,PSV_FTE_N,OTH_FTE_N,STU_TCH_RATIO,STU_ADM_RATIO,STU_PSV_RATIO,STU_OTH_RATIO
556,2021-22,S,01,10017,0000000,Alameda,Alameda County Office of Education,District Office,N,N,GS_K12,0,0.0,4.5,0.0,1.0,*,*,*,*
571,2021-22,S,01,31609,0131755,Alameda,California School for the Blind (State Special...,California School for the Blind,N,N,GS_K12,62,13.0,5.0,16.0,15.0,4.8,12.4,3.9,4.1
572,2021-22,S,01,31617,0000000,Alameda,California School for the Deaf-Fremont (State ...,District Office,N,N,GS_K12,0,0.0,8.0,12.5,13.0,*,*,*,*
573,2021-22,S,01,31617,0131763,Alameda,California School for the Deaf-Fremont (State ...,California School for the Deaf-Fremont,N,N,GS_K12,318,71.8,7.9,2.0,15.3,4.4,40.3,159,20.7
574,2021-22,S,01,61119,0000000,Alameda,Alameda Unified,District Office,N,N,GS_K12,0,0.0,15.0,24.2,13.7,*,*,*,*
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30235,2021-22,S,58,72751,6056816,Yuba,Wheatland,Bear River,N,N,GS_K12,568,30.0,2.0,3.2,0.0,18.9,284,180.3,*
30236,2021-22,S,58,72751,6056832,Yuba,Wheatland,Lone Tree Elementary,N,N,GS_K6,351,17.0,0.5,2.0,0.0,20.6,*,175.5,*
30237,2021-22,S,58,72751,6056840,Yuba,Wheatland,Wheatland Elementary,N,N,GS_K6,340,17.0,1.0,2.8,0.0,20,340,123.6,*
30239,2021-22,S,58,72769,0000000,Yuba,Wheatland Union High,District Office,N,N,GS_K12,3,0.8,1.0,0.1,0.0,*,3,*,*


### Staff Education

[Data Dictionary: Staff Education](https://www.cde.ca.gov/ds/ad/fssted.asp)


In [24]:
df_raw = load_cde_txt(cde / "sted2122.txt")

df_staff_ed = df_raw[
    (df_raw["Aggregate Level"].str.strip() == "S")
    & (df_raw["Charter School"].str.strip() == "N")
    & (df_raw["DASS"].str.strip() == "N")
]

df_staff_ed.to_pickle(raw_pickle / "raw_staff_edu.pkl")

df_staff_ed

Unnamed: 0,Academic Year,Aggregate Level,County Code,District Code,School Code,County Name,District Name,School Name,Charter School,DASS,...,Staff Gender,Total Staff Count,Associate,Baccalaureate,Baccalaureate Plus,Master,Master Plus,Doctorate,Special (Juris Doctor),None
7395,2021-22,S,01,10017,0000000,Alameda,Alameda County Office of Education,District Office,N,N,...,ALL,5,0,0,0,0,4,1,0,0
7396,2021-22,S,01,10017,0000000,Alameda,Alameda County Office of Education,District Office,N,N,...,GF,5,0,0,0,0,4,1,0,0
7397,2021-22,S,01,10017,0000000,Alameda,Alameda County Office of Education,District Office,N,N,...,ALL,6,0,0,0,0,5,1,0,0
7398,2021-22,S,01,10017,0000000,Alameda,Alameda County Office of Education,District Office,N,N,...,GF,6,0,0,0,0,5,1,0,0
7399,2021-22,S,01,10017,0000000,Alameda,Alameda County Office of Education,District Office,N,N,...,ALL,1,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
360896,2021-22,S,58,10587,0000000,Yuba,Yuba County Office of Education,District Office,N,N,...,ALL,11,0,2,1,7,0,1,0,0
360897,2021-22,S,58,10587,0000000,Yuba,Yuba County Office of Education,District Office,N,N,...,GF,9,0,2,1,6,0,0,0,0
360898,2021-22,S,58,10587,0000000,Yuba,Yuba County Office of Education,District Office,N,N,...,GM,2,0,0,0,1,0,1,0,0
360899,2021-22,S,58,10587,0000000,Yuba,Yuba County Office of Education,District Office,N,N,...,ALL,1,0,0,1,0,0,0,0,0


### Staff Experience

[Data Dictionary: Staff Experience](https://www.cde.ca.gov/ds/ad/fsstex.asp)


In [25]:
df_raw = load_cde_txt(cde / "stex2122.txt")

df_staff_xp = df_raw[
    (df_raw["Aggregate Level"].str.strip() == "S")
    & (df_raw["Charter School"].str.strip() == "N")
    & (df_raw["DASS"].str.strip() == "N")
]

df_staff_xp.to_pickle(raw_pickle / "raw_staff_exp.pkl")

df_staff_xp

Unnamed: 0,Academic Year,Aggregate Level,County Code,District Code,School Code,County Name,District Name,School Name,Charter School,DASS,Staff Type,School Grade Span,Staff Gender,Total Staff Count,Average Total Years Experience,Average District Years Experience,Experienced,Inexperienced,First Year,Second Year
7395,2021-22,S,01,10017,0000000,Alameda,Alameda County Office of Education,District Office,N,N,ADM,GS_K12,ALL,5,20.6,9.2,5,0,0,0
7396,2021-22,S,01,10017,0000000,Alameda,Alameda County Office of Education,District Office,N,N,ADM,GS_K12,GF,5,20.6,9.2,5,0,0,0
7397,2021-22,S,01,10017,0000000,Alameda,Alameda County Office of Education,District Office,N,N,ALL,GS_K12,ALL,6,20.0,10.2,6,0,0,0
7398,2021-22,S,01,10017,0000000,Alameda,Alameda County Office of Education,District Office,N,N,ALL,GS_K12,GF,6,20.0,10.2,6,0,0,0
7399,2021-22,S,01,10017,0000000,Alameda,Alameda County Office of Education,District Office,N,N,OTH,GS_K12,ALL,1,17.0,15.0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
360896,2021-22,S,58,10587,0000000,Yuba,Yuba County Office of Education,District Office,N,N,ALL,GS_K12,ALL,11,21.6,14.6,11,0,0,0
360897,2021-22,S,58,10587,0000000,Yuba,Yuba County Office of Education,District Office,N,N,ALL,GS_K12,GF,9,22.8,16.0,9,0,0,0
360898,2021-22,S,58,10587,0000000,Yuba,Yuba County Office of Education,District Office,N,N,ALL,GS_K12,GM,2,16.5,8.5,2,0,0,0
360899,2021-22,S,58,10587,0000000,Yuba,Yuba County Office of Education,District Office,N,N,OTH,GS_K12,ALL,1,16.0,15.0,1,0,0,0


### Enrollment by School

[Data Dictionary: Enrollment by School](https://www.cde.ca.gov/ds/ad/fsenrps.asp)


In [26]:
df_raw = load_cde_txt(cde / "enr202022-v2.txt")

df_enroll = df_raw[df_raw["ENR_TYPE"] == "P"]

df_enroll.to_pickle(raw_pickle / "raw_school_enroll.pkl")

df_enroll

Unnamed: 0,ACADEMIC_YEAR,CDS_CODE,COUNTY,DISTRICT,SCHOOL,ENR_TYPE,RACE_ETHNICITY,GENDER,GR_KN,GR_1,...,GR_7,GR_8,UNGR_ELM,GR_9,GR_10,GR_11,GR_12,UNGR_SEC,ENR_TOTAL,ADULT
15,2020-21,01100170112607,ALAMEDA,Alameda County Office of Education,Envision Academy for Arts & Technology,P,0,F,0,0,...,3,0,0,1,0,1,0,0,6,0
16,2020-21,01100170112607,ALAMEDA,Alameda County Office of Education,Envision Academy for Arts & Technology,P,0,M,0,0,...,0,0,0,0,1,1,0,0,3,0
17,2020-21,01100170112607,ALAMEDA,Alameda County Office of Education,Envision Academy for Arts & Technology,P,1,F,0,0,...,0,0,0,1,1,0,1,0,3,0
18,2020-21,01100170112607,ALAMEDA,Alameda County Office of Education,Envision Academy for Arts & Technology,P,2,F,0,0,...,0,0,0,1,0,0,0,0,1,0
19,2020-21,01100170112607,ALAMEDA,Alameda County Office of Education,Envision Academy for Arts & Technology,P,2,M,0,0,...,0,0,0,0,1,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
793375,2022-23,58727695838305,YUBA,Wheatland Union High,Wheatland Union High,P,7,F,0,0,...,0,0,0,35,54,60,43,0,192,0
793376,2022-23,58727695838305,YUBA,Wheatland Union High,Wheatland Union High,P,7,M,0,0,...,0,0,0,61,67,61,63,0,252,0
793377,2022-23,58727695838305,YUBA,Wheatland Union High,Wheatland Union High,P,7,X,0,0,...,0,0,0,0,0,1,0,0,1,0
793378,2022-23,58727695838305,YUBA,Wheatland Union High,Wheatland Union High,P,9,F,0,0,...,0,0,0,16,10,14,12,0,52,0


## Ca DOE School Climate, Health, and Learning Surveys

### Perception of Safety by Grade Level

Public raw data can be downloaded here:  
[https://calschls.org/reports-data/query-calschls/?ind=58](https://calschls.org/reports-data/query-calschls/?ind=58)

The safety and supportiveness of young people's school environments play a crucial role in their development and academic success. Students who feel safe and supported at school tend to have better emotional health and are less likely to engage in risky behaviors (1, 2). Exposure to violence in schools and school neighborhoods is associated with many negative outcomes for youth, including poor academic performance, truancy, substance use, violent behavior, depression-related feelings, and suicidal thoughts and behaviors (1, 3). Experiencing violence during childhood or adolescence also increases the likelihood of long-term physical, behavioral, and mental health problems in adulthood (1). Further, school violence not only affects the individuals involved but also can adversely impact teachers, bystanders, and surrounding communities (3).

Unfortunately, school safety is often compromised. According to a 2019 survey, nearly half (44%) of U.S. high school students had one or more violent experiences in the previous year, such as bullying, physical fighting, being threatened with a weapon at school, dating violence, or sexual violence (1). Females and LGBTQ students were significantly more likely to experience multiple types of violence when compared with males and heterosexual students, respectively (1). In addition, studies show that reports of hate crimes and mass casualty events in schools have increased in recent years (3, 4).


In [27]:
df_raw = pd.read_excel(
    ca_schls / "Kidsdata-Perceptions-of-School-Safety--by-Grade-Level--2017.xls",
    header=None,
)

df_safety = clean_calschls_safety(df_raw)

df_safety.to_pickle(raw_pickle / "raw_safety_percept_grade.pkl")

df_safety

Unnamed: 0,geography,geo_type,grade,very_safe_pct,safe_pct,neither_pct,unsafe_pct,very_unsafe_pct,years,level_of_safety_filter
0,California,State,9,0.128,0.420,0.364,0.053,0.035,2017-2019,All
1,California,State,11,0.134,0.403,0.373,0.055,0.036,2017-2019,All
2,Alameda County,County,9,0.132,0.459,0.341,0.044,0.023,2017-2019,All
3,Alameda County,County,11,0.145,0.423,0.351,0.051,0.029,2017-2019,All
4,Amador County,County,9,0.153,0.403,0.374,0.048,0.021,2017-2019,All
...,...,...,...,...,...,...,...,...,...,...
109,Ventura County,County,11,0.162,0.420,0.335,0.050,0.033,2017-2019,All
110,Yolo County,County,9,0.139,0.424,0.371,0.043,0.023,2017-2019,All
111,Yolo County,County,11,0.162,0.437,0.342,0.034,0.025,2017-2019,All
112,Yuba County,County,9,0.075,0.415,0.359,0.097,0.055,2017-2019,All


### Perception of Safety by School Connectedness

Public raw data can be downloaded here:  
[https://calschls.org/reports-data/query-calschls/?ind=60](https://calschls.org/reports-data/query-calschls/?ind=60)

The safety and supportiveness of young people's school environments play a crucial role in their development and academic success. Students who feel safe and supported at school tend to have better emotional health and are less likely to engage in risky behaviors (1, 2). Exposure to violence in schools and school neighborhoods is associated with many negative outcomes for youth, including poor academic performance, truancy, substance use, violent behavior, depression-related feelings, and suicidal thoughts and behaviors (1, 3). Experiencing violence during childhood or adolescence also increases the likelihood of long-term physical, behavioral, and mental health problems in adulthood (1). Further, school violence not only affects the individuals involved but also can adversely impact teachers, bystanders, and surrounding communities (3).

Unfortunately, school safety is often compromised. According to a 2019 survey, nearly half (44%) of U.S. high school students had one or more violent experiences in the previous year, such as bullying, physical fighting, being threatened with a weapon at school, dating violence, or sexual violence (1). Females and LGBTQ students were significantly more likely to experience multiple types of violence when compared with males and heterosexual students, respectively (1). In addition, studies show that reports of hate crimes and mass casualty events in schools have increased in recent years (3, 4).


In [28]:
df_raw = pd.read_excel(
    ca_schls / "Kidsdata-Perceptions-of-School-Safety--by-Level-of-School-C.xls",
    header=None,
)

df_connected = clean_safety_by_connectedness(df_raw)

df_connected.to_pickle(raw_pickle / "raw_safety_connect.pkl")

df_connected

Unnamed: 0,Geography,Connectedness,Very Safe,Safe,Neither Safe nor Unsafe,Unsafe,Very Unsafe,Safety_Positive
0,California,High,0.268,0.559,0.157,0.011,0.005,0.827
1,California,Medium,0.052,0.334,0.520,0.065,0.028,0.386
2,California,Low,0.069,0.111,0.428,0.196,0.196,0.180
3,Alameda County,High,0.268,0.582,0.138,0.009,0.004,0.850
4,Alameda County,Medium,0.060,0.370,0.494,0.057,0.020,0.430
...,...,...,...,...,...,...,...,...
172,Yolo County,Medium,0.064,0.385,0.475,0.053,0.022,0.449
173,Yolo County,Low,0.083,0.136,0.456,0.163,0.162,0.219
174,Yuba County,High,0.234,0.584,0.160,0.015,0.007,0.818
175,Yuba County,Medium,0.036,0.331,0.498,0.086,0.049,0.367
