# Data Preparation

### Filtering Strategy

**Aggregate Level:** "S". Focuses analysis on individual schools.  
**CharterSchool:** "No" or "N". Excludes charter schools to focus on traditional public high schools.   
**DASS:** "No" or "N". Removes alternative/continuation programs so graduation rates reflect typical comprehensive high schools.   
**ReportingCategory:** "TA". Keeps aggregate totals for each school (not broken down by subgroup) to simplify modeling.


In [1]:
# import libraries
import importlib
import os

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import subprocess
import sys

from pathlib import Path

In [2]:
# import other libraries
from helper import (
    rpkl, # build_cdscode
)

# check if jcds library is installed
package_name = "jcds"

if importlib.util.find_spec(package_name) is None:
    print(f" '{package_name}' not found. Installing from Github... ")
    subprocess.check_call(
        [
            sys.executable,
            "-m",
            "pip",
            "install",
            "https://github.com/junclemente/jcds.git",
        ]
    )
else:
    print(f" '{package_name}' is already installed.")

from jcds import eda as jeda
from jcds import reports as jrep

 'jcds' is already installed.


In [3]:
# main data folder path
data_folder = Path("../data")
raw_pickle = Path(data_folder / "raw_pickle")

## Public Schools and Districts

Get all unique schools that only teach grades 9 - 12. 
`cdscode` is a unique code for each school that depicts county, district, and school ID numbers. 

[Data Dictionary: Public Schools and Districts](https://www.cde.ca.gov/ds/si/ds/fspubschls.asp)

In [4]:
df_schooldata = rpkl(raw_pickle, "raw_school_data.pkl")


ℹ️ 'cdscode' already exists — skipping creation

📁 Columns in raw_school_data.pkl:
['cdscode', 'ncesdist', 'ncesschool', 'statustype', 'county', 'district', 'school', 'street', 'streetabr', 'city', 'zip', 'state', 'mailstreet', 'mailstrabr', 'mailcity', 'mailzip', 'mailstate', 'phone', 'ext', 'faxnumber', 'website', 'opendate', 'closeddate', 'charter', 'charternum', 'fundingtype', 'doc', 'doctype', 'soc', 'soctype', 'edopscode', 'edopsname', 'eilcode', 'eilname', 'gsoffered', 'gsserved', 'virtual', 'magnet', 'yearroundyn', 'federaldfcdistrictid', 'latitude', 'longitude', 'admfname', 'admlname', 'lastupdate', 'multilingual']


In [5]:
cols_schooldata = ['cdscode',
#  'ncesdist',
#  'ncesschool',
#  'statustype',
#  'county',
#  'district',
#  'school',
#  'street',
#  'streetabr',
#  'city',
#  'zip',
#  'state',
#  'mailstreet',
#  'mailstrabr',
#  'mailcity',
#  'mailzip',
#  'mailstate',
#  'phone',
#  'ext',
#  'faxnumber',
#  'website',
 'opendate',
#  'closeddate',
 'charter',
#  'charternum',
#  'fundingtype',
#  'doc',
 'doctype',
#  'soc',
 'soctype',
 'edopscode',
#  'edopsname',
 'eilcode',
#  'eilname',
#  'gsoffered',
#  'gsserved',
 'virtual',
 'magnet',
 'yearroundyn',
#  'federaldfcdistrictid',
 'latitude',
 'longitude',
#  'admfname',
#  'admlname',
#  'lastupdate',
 'multilingual']


df_schooldata = df_schooldata[cols_schooldata]
df_schooldata.head()

Unnamed: 0,cdscode,opendate,charter,doctype,soctype,edopscode,eilcode,virtual,magnet,yearroundyn,latitude,longitude,multilingual
59,1611190130229,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,37.764958,-122.24593,N
91,1611270130450,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,37.896661,-122.29257,N
113,1611430131177,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,37.868913,-122.2712,Y
153,1611500132225,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,37.705184,-122.07847,N
154,1611500133876,2016-07-01 00:00:00,N,Unified School District,K-12 Schools (Public),TRAD,HS,V,N,N,37.713501,-122.09222,N


In [6]:
# check if list has uniuqe cdscode
df_schooldata["cdscode"].is_unique
df_schooldata.head()

Unnamed: 0,cdscode,opendate,charter,doctype,soctype,edopscode,eilcode,virtual,magnet,yearroundyn,latitude,longitude,multilingual
59,1611190130229,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,37.764958,-122.24593,N
91,1611270130450,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,37.896661,-122.29257,N
113,1611430131177,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,37.868913,-122.2712,Y
153,1611500132225,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,37.705184,-122.07847,N
154,1611500133876,2016-07-01 00:00:00,N,Unified School District,K-12 Schools (Public),TRAD,HS,V,N,N,37.713501,-122.09222,N


In [7]:
jrep.data_info(df_schooldata, show_columns=True)


SHAPE:
There are 1067 rows and 13 columns (0.85 MB).

DUPLICATES:
There are 0 duplicated rows.

COLUMNS/VARIABLES:
Column dType Summary:
 * object: 13
There are 0 numerical (int/float/bool) variables.
 * Columns: []
There are 13 categorical (nominal/ordinal) variables.
 * Columns: ['cdscode', 'opendate', 'charter', 'doctype', 'soctype', 'edopscode', 'eilcode', 'virtual', 'magnet', 'yearroundyn', 'latitude', 'longitude', 'multilingual']

DATETIME COLUMNS:
There are 0 datetime variables and 1 possible datetime variables.

OTHER COLUMN/VARIABLE INFO:
ID Like Columns (threshold = 95.0%): 3
Columns with mixed datatypes: 1
 * Columns: ['opendate']


# California: Department of Education

## Adjusted Cohort Graduation Rate and Outcome Data (ACGR)

**Adjusted Cohort Graduation Rate and Outcome Data**
Four-year Adjusted Cohort Graduation Rate (ACGR) and Outcome data reported by race/ethnicity, student group, and gender.  

[Data Dictionary: ACGR](https://www.cde.ca.gov/ds/ad/fsacgr.asp)

In [8]:
df_acgr = rpkl(raw_pickle, "raw_acgr.pkl")

✅ Added 'cdscode' using: countycode, districtcode, schoolcode

📁 Columns in raw_acgr.pkl:
['academicyear', 'aggregatelevel', 'countycode', 'districtcode', 'schoolcode', 'countyname', 'districtname', 'schoolname', 'charterschool', 'dass', 'reportingcategory', 'cohortstudents', 'regular_hs_diploma_graduates_count', 'regular_hs_diploma_graduates_rate', 'met_uccsu_grad_reqs_count', 'met_uccsu_grad_reqs_rate', 'seal_of_biliteracy_count', 'seal_of_biliteracy_rate', 'golden_state_seal_merit_diploma_count', 'golden_state_seal_merit_diploma_rate', 'chspe_completer_count', 'chspe_completer_rate', 'adult_ed_hs_diploma_count', 'adult_ed_hs_diploma_rate', 'sped_certificate_count', 'sped_certificate_rate', 'ged_completer_count', 'ged_completer_rate', 'other_transfer_count', 'other_transfer_rate', 'dropout_count', 'dropout_rate', 'still_enrolled_count', 'still_enrolled_rate', 'cdscode']


In [9]:
# select columns

cols_acgr = [
 'cdscode',
#  'academicyear',
#  'aggregatelevel',
#  'countycode',
#  'districtcode',
#  'schoolcode',
#  'countyname',
#  'districtname',
#  'schoolname',
#  'charterschool',
#  'dass',
#  'reportingcategory',
 'cohortstudents', # QA for weighing
#  'regular_hs_diploma_graduates_count',
 'regular_hs_diploma_graduates_rate', # target variable
#  'met_uccsu_grad_reqs_count',
 'met_uccsu_grad_reqs_rate',
#  'seal_of_biliteracy_count',
 'seal_of_biliteracy_rate', # language proficiency ???
#  'golden_state_seal_merit_diploma_count',
#  'golden_state_seal_merit_diploma_rate',
#  'chspe_completer_count',
#  'chspe_completer_rate',
#  'adult_ed_hs_diploma_count',
#  'adult_ed_hs_diploma_rate',
#  'sped_certificate_count',
#  'sped_certificate_rate',
#  'ged_completer_count',
#  'ged_completer_rate',
#  'other_transfer_count',
#  'other_transfer_rate',
#  'dropout_count',
 'dropout_rate', # secondary target
#  'still_enrolled_count',
 'still_enrolled_rate' # 5th year senior
 ]


df_acgr = df_acgr[cols_acgr]
df_acgr

Unnamed: 0,cdscode,cohortstudents,regular_hs_diploma_graduates_rate,met_uccsu_grad_reqs_rate,seal_of_biliteracy_rate,dropout_rate,still_enrolled_rate
66594,01316090131755,11,0.0,0.0,0.0,63.6,0.0
66654,01316170131763,38,63.2,0.0,33.3,2.6,28.9
66718,01611190000001,*,*,*,*,*,*
66782,01611190106401,43,100.0,95.3,2.3,0.0,0.0
66910,01611190130229,394,92.4,73.9,22.8,2.3,1.0
...,...,...,...,...,...,...,...
254262,58727360000000,69,56.5,0.0,0.0,43.5,0.0
254314,58727360000001,*,*,*,*,*,*
254426,58727365830013,226,87.2,36.0,11.7,6.2,5.8
254630,58727365835202,201,90.5,37.4,1.1,7.0,2.0


In [10]:
df_acgr["cdscode"].is_unique
# df_acgr

True

In [11]:
df_combined = df_schooldata.merge(
    df_acgr,
    on="cdscode",
    how="left"
)

df_combined.head()

Unnamed: 0,cdscode,opendate,charter,doctype,soctype,edopscode,eilcode,virtual,magnet,yearroundyn,latitude,longitude,multilingual,cohortstudents,regular_hs_diploma_graduates_rate,met_uccsu_grad_reqs_rate,seal_of_biliteracy_rate,dropout_rate,still_enrolled_rate
0,1611190130229,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,37.764958,-122.24593,N,394,92.4,73.9,22.8,2.3,1.0
1,1611270130450,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,37.896661,-122.29257,N,284,95.1,67.8,21.5,3.5,0.0
2,1611430131177,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,37.868913,-122.2712,Y,861,90.5,62.3,12.1,8.1,0.8
3,1611500132225,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,37.705184,-122.07847,N,672,96.4,72.8,25.0,2.2,0.0
4,1611500133876,2016-07-01 00:00:00,N,Unified School District,K-12 Schools (Public),TRAD,HS,V,N,N,37.713501,-122.09222,N,31,100.0,64.5,6.5,0.0,0.0


## Absenteeism

### Chronic Absenteeism Data

[Data Dictionary: Chronic Absenteeism](https://www.cde.ca.gov/ds/ad/fsabd.asp)


In [12]:
# load raw dataset and filter
df_chron_abs = rpkl(raw_pickle, "raw_chronic_absent.pkl")


✅ Added 'cdscode' using: county_code, district_code, school_code

📁 Columns in raw_chronic_absent.pkl:
['academic_year', 'aggregate_level', 'county_code', 'district_code', 'school_code', 'county_name', 'district_name', 'school_name', 'charter_school', 'reporting_category', 'chronicabsenteeismeligiblecumula', 'chronicabsenteeismcount', 'chronicabsenteeismrate', 'cdscode']


In [13]:
# select columns
chron_abs_cols = [
#  'academic_year',
#  'aggregate_level',
#  'county_code',
#  'district_code',
#  'school_code',
#  'county_name',
#  'district_name',
#  'school_name',
#  'charter_school',
#  'reporting_category',
#  'chronicabsenteeismeligiblecumula',
#  'chronicabsenteeismcount',
 'chronicabsenteeismrate',
 'cdscode']

df_chron_abs = df_chron_abs[chron_abs_cols]
df_chron_abs.head()

Unnamed: 0,chronicabsenteeismrate,cdscode
57598,84.4,1100170130419
57599,61.7,1100170130401
57621,8.8,1316090131755
57644,11.6,1316170131763
58027,3.7,1611196090013


In [14]:
df_combined = df_combined.merge(
    df_chron_abs,
    on="cdscode", 
    how="left"
)

df_combined

Unnamed: 0,cdscode,opendate,charter,doctype,soctype,edopscode,eilcode,virtual,magnet,yearroundyn,latitude,longitude,multilingual,cohortstudents,regular_hs_diploma_graduates_rate,met_uccsu_grad_reqs_rate,seal_of_biliteracy_rate,dropout_rate,still_enrolled_rate,chronicabsenteeismrate
0,01611190130229,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,37.764958,-122.24593,N,394,92.4,73.9,22.8,2.3,1.0,12.7
1,01611270130450,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,37.896661,-122.29257,N,284,95.1,67.8,21.5,3.5,0.0,70.3
2,01611430131177,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,37.868913,-122.27120,Y,861,90.5,62.3,12.1,8.1,0.8,5.2
3,01611500132225,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,37.705184,-122.07847,N,672,96.4,72.8,25.0,2.2,0.0,3.5
4,01611500133876,2016-07-01 00:00:00,N,Unified School District,K-12 Schools (Public),TRAD,HS,V,N,N,37.713501,-122.09222,N,31,100.0,64.5,6.5,0.0,0.0,8.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1062,57727100101162,2003-09-02 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,38.657047,-121.74229,Y,327,96.0,54.5,23.6,1.8,0.0,3.9
1063,57727105738802,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,38.683705,-121.78395,N,277,96.0,42.5,25.2,2.5,0.0,13.2
1064,58727365830013,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,39.079263,-121.53070,N,226,87.2,36.0,11.7,6.2,5.8,24
1065,58727365835202,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,39.155225,-121.58565,N,201,90.5,37.4,1.1,7.0,2.0,21.1


### Absenteeism by Reason

[Data Dictionary: Absenteeism by Reason](https://www.cde.ca.gov/ds/ad/fsabr.asp)


In [15]:
df_abs = rpkl(raw_pickle, "raw_absent_reason.pkl")


✅ Added 'cdscode' using: county_code, district_code, school_code

📁 Columns in raw_absent_reason.pkl:
['academic_year', 'aggregate_level', 'county_code', 'district_code', 'school_code', 'county_name', 'district_name', 'school_name', 'charter_school', 'dass', 'reporting_category', 'eligible_cumulative_enrollment', 'count_of_students_with_one_or_more_absences', 'average_days_absent', 'total_days_absent', 'excused_absences_percent', 'unexcused_absences_percent', 'outofschool_suspension_absences_percent', 'incomplete_independent_study_absences_percent', 'excused_absences_count', 'unexcused_absences_count', 'outofschool_suspension_absences_count', 'incomplete_independent_study_absences_count', 'cdscode']


In [16]:
# select columns
abs_cols = [
#  'academic_year',
#  'aggregate_level',
#  'county_code',
#  'district_code',
#  'school_code',
#  'county_name',
#  'district_name',
#  'school_name',
#  'charter_school',
#  'dass',
#  'reporting_category',
 'eligible_cumulative_enrollment',
#  'count_of_students_with_one_or_more_absences',
#  'average_days_absent',
#  'total_days_absent',
#  'excused_absences_percent',
 'unexcused_absences_percent',
 'outofschool_suspension_absences_percent',
#  'incomplete_independent_study_absences_percent',
#  'excused_absences_count',
#  'unexcused_absences_count',
#  'outofschool_suspension_absences_count',
#  'incomplete_independent_study_absences_count',
 'cdscode']


df_abs = df_abs[abs_cols]
df_abs.head()

Unnamed: 0,eligible_cumulative_enrollment,unexcused_absences_percent,outofschool_suspension_absences_percent,cdscode
583,67,46.9,0.0,1316090131755
608,329,41.8,1.8,1316170131763
628,22,0.0,0.0,1611190000000
647,170,20.2,0.0,1611190106401
670,473,28.4,0.2,1611190111765


In [17]:
df_combined = df_combined.merge(
    df_abs,
    on="cdscode", 
    how="left"
)

df_combined.head()

Unnamed: 0,cdscode,opendate,charter,doctype,soctype,edopscode,eilcode,virtual,magnet,yearroundyn,...,cohortstudents,regular_hs_diploma_graduates_rate,met_uccsu_grad_reqs_rate,seal_of_biliteracy_rate,dropout_rate,still_enrolled_rate,chronicabsenteeismrate,eligible_cumulative_enrollment,unexcused_absences_percent,outofschool_suspension_absences_percent
0,1611190130229,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,...,394,92.4,73.9,22.8,2.3,1.0,12.7,1841,23.5,0.5
1,1611270130450,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,...,284,95.1,67.8,21.5,3.5,0.0,70.3,1192,46.2,0.4
2,1611430131177,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,...,861,90.5,62.3,12.1,8.1,0.8,5.2,3281,24.1,0.0
3,1611500132225,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,...,672,96.4,72.8,25.0,2.2,0.0,3.5,2771,28.0,0.9
4,1611500133876,2016-07-01 00:00:00,N,Unified School District,K-12 Schools (Public),TRAD,HS,V,N,N,...,31,100.0,64.5,6.5,0.0,0.0,8.3,420,19.1,3.8


## Free or Reduced-Price Meal (Student Poverty)

[Data Dictionary: FRPM ](https://www.cde.ca.gov/ds/ad/fsspfrpm.asp)


In [18]:
df_frpm = rpkl(raw_pickle, "raw_frpm.pkl")



✅ Added 'cdscode' using: county_code, district_code, school_code

📁 Columns in raw_frpm.pkl:
['academic_year', 'county_code', 'district_code', 'school_code', 'county_name', 'district_name', 'school_name', 'district_type', 'school_type', 'educational_option_type', 'nslp_provision_status', 'charter_school_yn', 'charter_school_number', 'charter_funding_type', 'irc', 'low_grade', 'high_grade', 'enrollment_k12', 'free_meal_count_k12', 'percent__eligible_free_k12', 'frpm_count_k12', 'percent__eligible_frpm_k12', 'enrollment_ages_517', 'free_meal_count_ages_517', 'percent__eligible_free_ages_517', 'frpm_count_ages_517', 'percent__eligible_frpm_ages_517', 'calpads_fall_1_certification_status', 'cdscode']


In [19]:
cols_frpm = [
    # 'academic_year', 
    # 'county_code', 
    # 'district_code', 
    # 'school_code',
    # 'county_name', 
    # 'district_name', 
    # 'school_name', 
    # 'district_type', 
    # 'school_type', 
    # 'educational_option_type', 
    # 'nslp_provision_status', 
    # 'charter_school_yn', 
    # 'charter_school_number', 
    # 'charter_funding_type', 
    # 'irc', 
    # 'low_grade', 
    # 'high_grade', 
    # 'enrollment_k12', 
    # 'free_meal_count_k12', 
    'percent__eligible_free_k12', 
    'frpm_count_k12', 
    # 'percent__eligible_frpm_k12', 
    # 'enrollment_ages_517', 
    # 'free_meal_count_ages_517', 
    # 'percent__eligible_free_ages_517', 
    # 'frpm_count_ages_517', 
    # 'percent__eligible_frpm_ages_517', 
    'calpads_fall_1_certification_status', 
    'cdscode'
    ]


temp = [
    "Academic Year",
    "County Code",
    "District Code",
    "School Code",
    "County Name",
    "District Name",
    "School Name",
    "District Type",
    "School Type",
    "Educational Option Type",
    # "NSLP Provision Status",
    "Charter School (Y/N)",
    # "Charter School Number",
    # "Charter Funding Type",
    "IRC",
    # "Low Grade",
    # "High Grade",
    "Enrollment (K-12)",
    # "Free Meal Count (K-12)",
    "Percent (%) Eligible Free (K-12)",
    "FRPM Count (K-12)",
    "Percent (%) Eligible FRPM (K-12)",
    # "Enrollment (Ages 5-17)",
    # "Free Meal Count (Ages 5-17)",
    # "Percent (%) Eligible Free (Ages 5-17)",
    # "FRPM Count (Ages 5-17)",
    # "Percent (%) Eligible FRPM (Ages 5-17)",
    "CALPADS Fall 1 Certification Status",
    "cdscode"
]

df_frpm = df_frpm[cols_frpm]
df_frpm.head()

Unnamed: 0,percent__eligible_free_k12,frpm_count_k12,calpads_fall_1_certification_status,cdscode
0,0.789474,47,Y,1100170130419
1,1.0,64,Y,1100170130401
14,1.0,62,Y,1316090131755
15,1.0,318,Y,1316170131763
17,0.172013,327,Y,1611190130229


In [20]:
df_combined = df_combined.merge(
    df_frpm, 
    on="cdscode", 
    how="left"
)

df_combined.head()

Unnamed: 0,cdscode,opendate,charter,doctype,soctype,edopscode,eilcode,virtual,magnet,yearroundyn,...,seal_of_biliteracy_rate,dropout_rate,still_enrolled_rate,chronicabsenteeismrate,eligible_cumulative_enrollment,unexcused_absences_percent,outofschool_suspension_absences_percent,percent__eligible_free_k12,frpm_count_k12,calpads_fall_1_certification_status
0,1611190130229,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,...,22.8,2.3,1.0,12.7,1841,23.5,0.5,0.172013,327.0,Y
1,1611270130450,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,...,21.5,3.5,0.0,70.3,1192,46.2,0.4,0.174389,307.0,Y
2,1611430131177,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,...,12.1,8.1,0.8,5.2,3281,24.1,0.0,0.262259,935.0,Y
3,1611500132225,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,...,25.0,2.2,0.0,3.5,2771,28.0,0.9,0.166358,491.0,Y
4,1611500133876,2016-07-01 00:00:00,N,Unified School District,K-12 Schools (Public),TRAD,HS,V,N,N,...,6.5,0.0,0.0,8.3,420,19.1,3.8,0.27381,97.0,Y


## CBEDS Data about Schools & Districts

[Data Dictionary: CBEDS](https://www.cde.ca.gov/ds/ad/fscbedsorab19.asp)

In [21]:
df_cbeds = rpkl(raw_pickle, "raw_cbeds.pkl")

ℹ️ 'cdscode' already exists — skipping creation

📁 Columns in raw_cbeds.pkl:
['cdscode', 'countyname', 'districtname', 'schoolname', 'description', 'level', 'section', 'rownumber', 'value', 'year']


In [22]:
col_cbeds = [
    'cdscode', 
    # 'countyname', 
    # 'districtname', 
    # 'schoolname', 
    # 'description', 
    'level', 
    'section', 
    'rownumber', 
    'value', 
    'year']

df_cbeds = df_cbeds[col_cbeds]
df_cbeds.head()

Unnamed: 0,cdscode,level,section,rownumber,value,year
18,1100170112607,S,B,4,True,2122
19,1100170112607,S,B,8,True,2122
20,1100170112607,S,C,1,0,2122
21,1100170112607,S,C,2,0,2122
22,1100170112607,S,D,1,True,2122


## Staff Data Files

### Student / Staff Ratio

[Data Dictionary: Student-Staff Ratio](https://www.cde.ca.gov/ds/ad/fsstrat.asp)

In [23]:
df_ss_ratio = rpkl(raw_pickle, "raw_student_staff_ratio.pkl")

✅ Added 'cdscode' using: county_code, district_code, school_code

📁 Columns in raw_student_staff_ratio.pkl:
['academic_year', 'aggregate_level', 'county_code', 'district_code', 'school_code', 'county_name', 'district_name', 'school_name', 'charter_school', 'dass', 'school_grade_span', 'total_enr_n', 'tch_fte_n', 'adm_fte_n', 'psv_fte_n', 'oth_fte_n', 'stu_tch_ratio', 'stu_adm_ratio', 'stu_psv_ratio', 'stu_oth_ratio', 'cdscode']


In [24]:
cols_ss_ratio = [
    # 'academic_year', 
    # 'aggregate_level', 
    # 'county_code', 
    # 'district_code', 
    # 'school_code', 
    # 'county_name', 
    # 'district_name', 
    # 'school_name', 
    # 'charter_school', 
    # 'dass', 
    'school_grade_span', 
    # 'total_enr_n', 
    # 'tch_fte_n', 
    # 'adm_fte_n', 
    # 'psv_fte_n', 
    # 'oth_fte_n', 
    'stu_tch_ratio', 
    'stu_adm_ratio', 
    'stu_psv_ratio', 
    # 'stu_oth_ratio', 
    'cdscode']


[
    "Academic Year",
    "Aggregate Level",
    "County Code",
    "District Code",
    "School Code",
    "County Name",
    "District Name",
    "School Name",
    "Charter School",
    "DASS",
    "School Grade Span",
    # "TOTAL_ENR_N",
    # "TCH_FTE_N",
    # "ADM_FTE_N",
    # "PSV_FTE_N",
    # "OTH_FTE_N",
    "STU_TCH_RATIO",  # student / teacher ratio
    "STU_ADM_RATIO",  # student / admin ratio
    "STU_PSV_RATIO",  # student / counselor ratio
    # "STU_OTH_RATIO",
]

df_ss_ratio = df_ss_ratio[cols_ss_ratio]
df_ss_ratio = df_ss_ratio[df_ss_ratio["school_grade_span"] == "GS_K12"]
df_ss_ratio.head()

Unnamed: 0,school_grade_span,stu_tch_ratio,stu_adm_ratio,stu_psv_ratio,cdscode
556,GS_K12,*,*,*,1100170000000
571,GS_K12,4.8,12.4,3.9,1316090131755
572,GS_K12,*,*,*,1316170000000
573,GS_K12,4.4,40.3,159,1316170131763
574,GS_K12,*,*,*,1611190000000


In [25]:
df_combined = df_combined.merge(
    df_ss_ratio, 
    on="cdscode",
    how="left"
)

df_combined.head()

Unnamed: 0,cdscode,opendate,charter,doctype,soctype,edopscode,eilcode,virtual,magnet,yearroundyn,...,eligible_cumulative_enrollment,unexcused_absences_percent,outofschool_suspension_absences_percent,percent__eligible_free_k12,frpm_count_k12,calpads_fall_1_certification_status,school_grade_span,stu_tch_ratio,stu_adm_ratio,stu_psv_ratio
0,1611190130229,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,...,1841,23.5,0.5,0.172013,327.0,Y,,,,
1,1611270130450,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,...,1192,46.2,0.4,0.174389,307.0,Y,,,,
2,1611430131177,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,...,3281,24.1,0.0,0.262259,935.0,Y,,,,
3,1611500132225,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,...,2771,28.0,0.9,0.166358,491.0,Y,,,,
4,1611500133876,2016-07-01 00:00:00,N,Unified School District,K-12 Schools (Public),TRAD,HS,V,N,N,...,420,19.1,3.8,0.27381,97.0,Y,GS_K12,17.7,*,224.0


### Staff Education

[Data Dictionary: Staff Education](https://www.cde.ca.gov/ds/ad/fssted.asp)


In [53]:
df_staff_ed = rpkl(raw_pickle, "raw_staff_edu.pkl")


✅ Added 'cdscode' using: county_code, district_code, school_code

📁 Columns in raw_staff_edu.pkl:
['academic_year', 'aggregate_level', 'county_code', 'district_code', 'school_code', 'county_name', 'district_name', 'school_name', 'charter_school', 'dass', 'staff_type', 'school_grade_span', 'staff_gender', 'total_staff_count', 'associate', 'baccalaureate', 'baccalaureate_plus', 'master', 'master_plus', 'doctorate', 'special_juris_doctor', 'none', 'cdscode']


In [56]:
cols_staff_ed = [
    # 'academic_year', 
    'aggregate_level', 
    # 'county_code', 
    # 'district_code', 
    # 'school_code', 
    # 'county_name', 
    # 'district_name', 
    # 'school_name', 
    'charter_school', 
    'dass', 
    'staff_type', 
    'school_grade_span', 
    'staff_gender', 
    'total_staff_count', 
    'associate', 
    'baccalaureate', 
    'baccalaureate_plus', 
    'master', 
    'master_plus', 
    'doctorate', 
    'special_juris_doctor', 
    'none', 
    'cdscode']


df_staff_ed = df_staff_ed[cols_staff_ed]
df_staff_ed = df_staff_ed[
  (df_staff_ed["school_grade_span"] == "GS_912") &
    (df_staff_ed["staff_type"] == "ALL") &
    (df_staff_ed["staff_gender"] == "ALL")
]

df_staff_ed.head()

Unnamed: 0,aggregate_level,charter_school,dass,staff_type,school_grade_span,staff_gender,total_staff_count,associate,baccalaureate,baccalaureate_plus,master,master_plus,doctorate,special_juris_doctor,none,cdscode
7491,S,N,N,ALL,GS_912,ALL,95,0,12,30,12,33,2,0,6,1611190130229
7503,S,N,N,ALL,GS_912,ALL,10,0,0,5,0,5,0,0,0,1611190106401
7723,S,N,N,ALL,GS_912,ALL,236,0,101,35,85,11,4,0,0,1611430131177
7998,S,N,N,ALL,GS_912,ALL,149,0,28,47,24,47,3,0,0,1611500132225
8116,S,N,N,ALL,GS_912,ALL,5,0,0,2,3,0,0,0,0,1611500130047


In [57]:
# normalize staff education

cols = ['total_staff_count', 
    'associate', 
    'baccalaureate', 
    'baccalaureate_plus', 
    'master', 
    'master_plus', 
    'doctorate', 
    'special_juris_doctor', 
    'none',]

df_staff_ed[cols] = df_staff_ed[cols].apply(pd.to_numeric, errors = "coerce")

df_staff_ed["pct_associate"] = df_staff_ed["associate"] / df_staff_ed["total_staff_count"]
df_staff_ed["pct_bachelors"] = df_staff_ed["baccalaureate"] / df_staff_ed["total_staff_count"]
df_staff_ed["pct_bachelors_plus"] = df_staff_ed["baccalaureate_plus"] / df_staff_ed["total_staff_count"]
df_staff_ed["pct_master"] = df_staff_ed["master"] / df_staff_ed["total_staff_count"]
df_staff_ed["pct_master_plus"] = df_staff_ed["master_plus"] / df_staff_ed["total_staff_count"]
df_staff_ed["pct_doctorate"] = df_staff_ed["doctorate"] / df_staff_ed["total_staff_count"]
df_staff_ed["pct_juris_doctor"] = df_staff_ed["special_juris_doctor"] / df_staff_ed["total_staff_count"]
df_staff_ed["pct_no_degree"] = df_staff_ed["total_staff_count"]

df_staff_ed.head()

Unnamed: 0,aggregate_level,charter_school,dass,staff_type,school_grade_span,staff_gender,total_staff_count,associate,baccalaureate,baccalaureate_plus,...,none,cdscode,pct_associate,pct_bachelors,pct_bachelors_plus,pct_master,pct_master_plus,pct_doctorate,pct_juris_doctor,pct_no_degree
7491,S,N,N,ALL,GS_912,ALL,95,0,12,30,...,6,1611190130229,0.0,0.126316,0.315789,0.126316,0.347368,0.021053,0.0,95
7503,S,N,N,ALL,GS_912,ALL,10,0,0,5,...,0,1611190106401,0.0,0.0,0.5,0.0,0.5,0.0,0.0,10
7723,S,N,N,ALL,GS_912,ALL,236,0,101,35,...,0,1611430131177,0.0,0.427966,0.148305,0.360169,0.04661,0.016949,0.0,236
7998,S,N,N,ALL,GS_912,ALL,149,0,28,47,...,0,1611500132225,0.0,0.187919,0.315436,0.161074,0.315436,0.020134,0.0,149
8116,S,N,N,ALL,GS_912,ALL,5,0,0,2,...,0,1611500130047,0.0,0.0,0.4,0.6,0.0,0.0,0.0,5


In [58]:
staff_ed_cols = [
 'cdscode',
 'pct_associate',
 'pct_bachelors',
 'pct_bachelors_plus',
 'pct_master',
 'pct_master_plus',
 'pct_doctorate',
 'pct_juris_doctor',
 'pct_no_degree']

staff_ed = df_staff_ed[staff_ed_cols]
staff_ed

Unnamed: 0,cdscode,pct_associate,pct_bachelors,pct_bachelors_plus,pct_master,pct_master_plus,pct_doctorate,pct_juris_doctor,pct_no_degree
7491,01611190130229,0.000000,0.126316,0.315789,0.126316,0.347368,0.021053,0.0,95
7503,01611190106401,0.000000,0.000000,0.500000,0.000000,0.500000,0.000000,0.0,10
7723,01611430131177,0.000000,0.427966,0.148305,0.360169,0.046610,0.016949,0.0,236
7998,01611500132225,0.000000,0.187919,0.315436,0.161074,0.315436,0.020134,0.0,149
8116,01611500130047,0.000000,0.000000,0.400000,0.600000,0.000000,0.000000,0.0,5
...,...,...,...,...,...,...,...,...,...
358551,57727100101162,0.000000,0.181818,0.420455,0.272727,0.125000,0.000000,0.0,88
358626,57727105738802,0.000000,0.168831,0.493506,0.233766,0.103896,0.000000,0.0,77
360655,58727365830013,0.014925,0.462687,0.104478,0.238806,0.029851,0.000000,0.0,67
360700,58727365835202,0.000000,0.431034,0.155172,0.241379,0.051724,0.017241,0.0,58


In [59]:
df_combined = df_combined.merge(
    staff_ed, 
    on="cdscode",
    how="left"
)

df_combined.head()

Unnamed: 0,cdscode,opendate,charter,doctype,soctype,edopscode,eilcode,virtual,magnet,yearroundyn,...,stu_adm_ratio,stu_psv_ratio,pct_associate,pct_bachelors,pct_bachelors_plus,pct_master,pct_master_plus,pct_doctorate,pct_juris_doctor,pct_no_degree
0,1611190130229,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,...,,,0.0,0.126316,0.315789,0.126316,0.347368,0.021053,0.0,95.0
1,1611270130450,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,...,,,,,,,,,,
2,1611430131177,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,...,,,0.0,0.427966,0.148305,0.360169,0.04661,0.016949,0.0,236.0
3,1611500132225,1980-07-01 00:00:00,N,Unified School District,High Schools (Public),TRAD,HS,N,N,N,...,,,0.0,0.187919,0.315436,0.161074,0.315436,0.020134,0.0,149.0
4,1611500133876,2016-07-01 00:00:00,N,Unified School District,K-12 Schools (Public),TRAD,HS,V,N,N,...,*,224.0,,,,,,,,


### Staff Experience

[Data Dictionary: Staff Experience](https://www.cde.ca.gov/ds/ad/fsstex.asp)

In [60]:
df_staff_xp = rpkl(raw_pickle, "raw_staff_exp.pkl")


✅ Added 'cdscode' using: county_code, district_code, school_code

📁 Columns in raw_staff_exp.pkl:
['academic_year', 'aggregate_level', 'county_code', 'district_code', 'school_code', 'county_name', 'district_name', 'school_name', 'charter_school', 'dass', 'staff_type', 'school_grade_span', 'staff_gender', 'total_staff_count', 'average_total_years_experience', 'average_district_years_experience', 'experienced', 'inexperienced', 'first_year', 'second_year', 'cdscode']


In [63]:
cols_staff_xp = [
    # 'academic_year', 
    # 'aggregate_level', 
    # 'county_code', 
    # 'district_code', 
    # 'school_code', 
    # 'county_name', 
    # 'district_name', 
    # 'school_name', 
    # 'charter_school', 
    # 'dass', 
    'staff_type', 
    'school_grade_span', 
    'staff_gender', 
    'total_staff_count', 
    'average_total_years_experience', 
    'average_district_years_experience', 
    'experienced', 
    'inexperienced', 
    'first_year', 
    'second_year', 
    'cdscode']




# [
#     # "Academic Year",
#     # "Aggregate Level",
#     # "County Code",
#     # "District Code",
#     # "School Code",
#     # "County Name",
#     # "District Name",
#     # "School Name",
#     # "Charter School",
#     # "DASS",
#     "Staff Type",
#     "School Grade Span",
#     "Staff Gender",
#     "Total Staff Count",
#     "Average Total Years Experience",
#     "Average District Years Experience",
#     "Experienced",  # 2+ years experience
#     "Inexperienced",  # <2 years experience
#     "First Year",  # No of staff in 1st year
#     "Second Year",  # No of staff in 2nd year
# ]

df_staff_xp = df_staff_xp[cols_staff_xp]
df_staff_xp = df_staff_xp[
    (df_staff_xp["staff_type"] == "ALL") &
    (df_staff_xp["staff_gender"] == "ALL") &
    (df_staff_xp["school_grade_span"] == "GS_912")
]

df_staff_xp.head()

Unnamed: 0,staff_type,school_grade_span,staff_gender,total_staff_count,average_total_years_experience,average_district_years_experience,experienced,inexperienced,first_year,second_year,cdscode
7491,ALL,GS_912,ALL,95,13.6,10.0,82,13,10,3,1611190130229
7503,ALL,GS_912,ALL,10,18.3,12.8,10,0,0,0,1611190106401
7723,ALL,GS_912,ALL,236,10.7,10.3,202,34,22,12,1611430131177
7998,ALL,GS_912,ALL,149,13.8,9.5,136,13,8,5,1611500132225
8116,ALL,GS_912,ALL,5,16.2,7.6,5,0,0,0,1611500130047


## Enrollment by School

[Data Dictionary: Enrollment by School](https://www.cde.ca.gov/ds/ad/fsenrps.asp)


In [None]:
df_enroll = rpkl(raw_pickle, "raw_school_enroll.pkl")

df_enroll.columns.to_list()

In [None]:
cols_enroll = [
    "ACADEMIC_YEAR",
    "CDS_CODE",
    "COUNTY",
    "DISTRICT",
    "SCHOOL",
    "ENR_TYPE",
    "RACE_ETHNICITY",
    "GENDER",
    # "GR_KN",
    # "GR_1",
    # "GR_2",
    # "GR_3",
    # "GR_4",
    # "GR_5",
    # "GR_6",
    # "GR_7",
    # "GR_8",
    # "UNGR_ELM",
    "GR_9",
    "GR_10",
    "GR_11",
    "GR_12",
    "UNGR_SEC",
    "ENR_TOTAL",
    # "ADULT",
]

df_enroll[cols_enroll]

# Ca DOE School Climate, Health, and Learning Surveys

## Perception of Safety by Grade Level

In [None]:
df_safety = rpkl(raw_pickle, "raw_safety_percept_grade.pkl")

df_safety.columns.to_list()

In [None]:
cols_safety = ['geography',
 'geo_type',
 'grade',
 'very_safe_pct',
 'safe_pct',
 'neither_pct',
 'unsafe_pct',
 'very_unsafe_pct',
 'years',
 'level_of_safety_filter']

df_safety[cols_safety]

## Perception of Safety by School Connectedness

In [None]:
df_connected = rpkl(raw_pickle, "raw_safety_connect.pkl")

df_connected.columns.to_list()

In [None]:
cols_connected = ['Geography',
 'Connectedness',
 'Very Safe',
 'Safe',
 'Neither Safe nor Unsafe',
 'Unsafe',
 'Very Unsafe',
 'Safety_Positive']

df_connected[cols_connected]