# Data Inspection for Absenteeism Prediction

This notebook guides you through the initial exploration of the two raw datasets:
- `chr_abs_raw.xlsx`
- `evaldata_raw.xlsx`

We will load, inspect, and summarize both datasets to inform the cleaning process.

In [1]:
# Import Required Libraries
import pandas as pd

## Load the Datasets

Let's load both Excel files from the data folder.

In [2]:
# Load datasets
chr_abs = pd.read_excel('../data/chr_abs_raw.xlsx')
evaldata = pd.read_excel('../data/evaldata_raw.xlsx')

print('chr_abs shape:', chr_abs.shape)
print('evaldata shape:', evaldata.shape)

chr_abs shape: (18638, 28)
evaldata shape: (79460, 122)


## Preview the Data

Let's look at the first few rows of each dataset to get an idea of their structure.

In [3]:
# Preview first few rows
chr_abs.head()

Unnamed: 0,ID,LastName,FirstName,Birthdate,DT,DaysEnr,DaysAbs,DaysPresent,AttRate,AttGrp,...,Current Weighted Total GPA (GT),AddressResidence,CityResidence,ZipResidence,ParentName,Telephone,PG_Email_1,SED Status,NumSusp,NumDaysSusp
0,443282,Aarif,Aslam,2010-07-02,2025-05-29,180,87,93,0.5167,Severe Chronic Absent,...,,1931 Myrtle St,Oakland,94607,Danyelle Aarif,5103055000.0,kkisa@yahoo.com,Not SED,,
1,436859,Abarca,Josiah,2018-05-31,2025-05-29,180,31,149,0.8278,Chronic Absent,...,,1001 105TH AVE,Oakland,94603,Roxana Aguilar,5106958000.0,Roxanaaguilar1011@yahoo.com,SED,,
2,435234,Abarca Carranza,Maura,2004-08-26,2025-05-29,180,121,59,0.3278,Severe Chronic Absent,...,0.0,6108 HARMON AVE,Oakland,94621,Jose Mauricio Polanco,5105411000.0,dayana.abarca1023@gmail.com,SED,,
3,408468,Abarca Climaco,Valeria,2016-03-09,2025-05-29,167,17,150,0.8982,Chronic Absent,...,,1001 105TH AVE,Oakland,94603,Edith Climaco / Jose Abarca,5104200000.0,edithclimaco87i@gmail.com,SED,,
4,440496,Abarca Escobar,Genesis,2015-11-24,2025-05-29,180,20,160,0.8889,Chronic Absent,...,,1058 75TH AVE,Oakland,94621,Angela Escobar,2098087000.0,alegriaesc0@gmail.com,SED,,


In [4]:
# Preview first few rows of evaldata
evaldata.head()

Unnamed: 0,ANON_ID,Birthdate,Gen,Eth_1718,Fluency_1718,SpEd_1718,SiteName_1718,School Address_1718,City_1718,Zip_1718,...,Grade_2324,AttRate_2324,DaysEnr_2324,DaysAbs_2324,Susp_2324,Address_2324,City_2324.1,Zip_2324.1,CurrWeightedTotGPA_2324,SED_2324
0,1,1997-08-21,F,Asian,RFEP,Not Special Ed,Oakland International HS,4521 Webster St,Oakland,94609.0,...,,,,,,,,,,
1,2,1999-10-10,F,Asian,EL,Not Special Ed,Oakland International HS,4521 Webster St,Oakland,94609.0,...,,,,,,,,,,
2,3,2019-05-09,F,,,,,,,,...,-1.0,0.9278,180.0,13.0,,7559 Hansom Dr,Oakland,94605.0,,Not SED
3,4,2007-07-05,F,African American,EO,Not Special Ed,EnCompass Academy,1025 81st Avenue,Oakland,94621.0,...,,,,,,,,,,
4,5,2016-01-26,M,,,,,,,,...,2.0,0.9556,180.0,8.0,,6912 Broadway Ter,Oakland,94611.0,,Not SED


## Data Info and Summary

Let's check the info and summary statistics for both datasets.

In [5]:
# Info and summary for chr_abs
display(chr_abs.info())
display(chr_abs.describe(include='all'))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18638 entries, 0 to 18637
Data columns (total 28 columns):
 #   Column                              Non-Null Count  Dtype         
---  ------                              --------------  -----         
 0   ID                                  18638 non-null  int64         
 1   LastName                            18638 non-null  object        
 2   FirstName                           18638 non-null  object        
 3   Birthdate                           18638 non-null  datetime64[ns]
 4   DT                                  18638 non-null  datetime64[ns]
 5   DaysEnr                             18638 non-null  int64         
 6   DaysAbs                             18638 non-null  int64         
 7   DaysPresent                         18638 non-null  int64         
 8   AttRate                             18638 non-null  float64       
 9   AttGrp                              18638 non-null  object        
 10  SiteName              

None

Unnamed: 0,ID,LastName,FirstName,Birthdate,DT,DaysEnr,DaysAbs,DaysPresent,AttRate,AttGrp,...,Current Weighted Total GPA (GT),AddressResidence,CityResidence,ZipResidence,ParentName,Telephone,PG_Email_1,SED Status,NumSusp,NumDaysSusp
count,18638.0,18638,18638,18638,18638,18638.0,18638.0,18638.0,18638.0,18638,...,6786.0,18637,18638,18638.0,18629,18595.0,18515,18638,955.0,955.0
unique,,9598,7364,,,,,,,3,...,,14663,64,,15693,,14159,2,,
top,,Williams,Jose,,,,,,,At Risk,...,,746 GRAND AVE,Oakland,,Tokuda Washington,,wallace.hazel26@gmail.com,SED,,
freq,,220,93,,,,,,,8726,...,,20,18106,,10,,8,16670,,
mean,404355.586758,,,2013-07-16 13:38:11.962657024,2025-05-28 22:39:52.788925952,172.722234,26.026237,146.695997,0.846269,,...,2.43513,,,94589.550005,,5156015000.0,,,1.549738,3.018848
min,267030.0,,,2002-07-08 00:00:00,2025-05-28 00:00:00,3.0,1.0,0.0,0.0,,...,0.0,,,946.0,,415548800.0,,,1.0,1.0
25%,385460.5,,,2009-11-22 06:00:00,2025-05-29 00:00:00,179.0,13.0,143.0,0.8278,,...,1.71,,,94603.0,,5103337000.0,,,1.0,1.0
50%,414238.0,,,2013-10-09 12:00:00,2025-05-29 00:00:00,180.0,18.0,159.0,0.8944,,...,2.67,,,94605.0,,5105851000.0,,,1.0,2.0
75%,437557.75,,,2017-03-09 00:00:00,2025-05-29 00:00:00,180.0,30.0,166.0,0.9278,,...,3.43,,,94611.0,,5108462000.0,,,2.0,4.0
max,451929.0,,,2021-12-23 00:00:00,2025-05-29 00:00:00,180.0,180.0,170.0,0.9497,,...,5.0,,,96401.0,,9852781000.0,,,11.0,20.0


In [6]:
# Info and summary for evaldata
display(evaldata.info())
display(evaldata.describe(include='all'))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79460 entries, 0 to 79459
Columns: 122 entries, ANON_ID to SED_2324
dtypes: datetime64[ns](1), float64(55), int64(1), object(65)
memory usage: 74.0+ MB


None

Unnamed: 0,ANON_ID,Birthdate,Gen,Eth_1718,Fluency_1718,SpEd_1718,SiteName_1718,School Address_1718,City_1718,Zip_1718,...,Grade_2324,AttRate_2324,DaysEnr_2324,DaysAbs_2324,Susp_2324,Address_2324,City_2324.1,Zip_2324.1,CurrWeightedTotGPA_2324,SED_2324
count,79460.0,79460,79460,40625,40625,40625,40625,40625,40625,40625.0,...,37663.0,36695.0,36695.0,36695.0,1439.0,37663,37663,37663.0,18280.0,37663
unique,,,4,9,5,2,88,77,1,,...,,,,,,28101,90,,,2
top,,,M,Latino,EO,Not Special Ed,Oakland Technical High School,4351 Broadway,Oakland,,...,,,,,,746 GRAND AVE,Oakland,,,SED
freq,,,41291,17223,19842,34939,2024,2024,40625,,...,,,,,,22,36659,,,30915
mean,39730.5,2009-07-12 00:34:53.128618240,,,,,,,,94698.938511,...,5.591164,0.889455,166.679793,16.944189,1.557331,,,94587.0766,2.505184,
min,1.0,1996-03-26 00:00:00,,,,,,,,94601.0,...,-1.0,0.0,1.0,0.0,1.0,,,1.0,0.0,
25%,19865.75,2005-07-25 00:00:00,,,,,,,,94603.0,...,2.0,0.8722,179.0,5.0,1.0,,,94602.0,1.67,
50%,39730.5,2009-08-10 00:00:00,,,,,,,,94608.0,...,5.0,0.9333,180.0,11.0,1.0,,,94606.0,2.88,
75%,59595.25,2013-08-27 00:00:00,,,,,,,,94618.0,...,9.0,0.9667,180.0,21.0,2.0,,,94611.0,3.67,
max,79460.0,2019-08-16 00:00:00,,,,,,,,96409.0,...,12.0,1.0,180.0,180.0,14.0,,,96403.0,5.0,


## Understand evaldata Column Structure

The evaldata has 122 columns spanning multiple years. Let's parse the column naming pattern.

In [7]:
# Parse evaldata columns by year suffix
import re

year_pattern = re.compile(r'_(\d{4})$')
years_found = set()
cols_by_year = {}

for col in evaldata.columns:
    match = year_pattern.search(col)
    if match:
        yr = match.group(1)
        years_found.add(yr)
        base = col[:match.start()]
        cols_by_year.setdefault(yr, []).append(base)
    
# Columns without year suffix
no_year_cols = [c for c in evaldata.columns if not year_pattern.search(c)]

print("Columns without year suffix:", no_year_cols)
print("\nYears found:", sorted(years_found))
print("\nFields per year:")
for yr in sorted(cols_by_year):
    print(f"  {yr}: {cols_by_year[yr]}")

Columns without year suffix: ['ANON_ID', 'Birthdate', 'Gen', 'City_1718.1', 'Zip_1718.1', 'City_1819.1', 'Zip_1819.1', 'City_1920.1', 'Zip_1920.1', 'City_2021.1', 'Zip_2021.1', 'City_2122.1', 'Zip_2122.1', 'City_2223.1', 'Zip_2223.1', 'City_2324.1', 'Zip_2324.1']

Years found: ['1718', '1819', '1920', '2021', '2122', '2223', '2324']

Fields per year:
  1718: ['Eth', 'Fluency', 'SpEd', 'SiteName', 'School Address', 'City', 'Zip', 'Grade', 'AttRate', 'DaysEnr', 'DaysAbs', 'Susp', 'Address', 'CurrWeightedTotGPA', 'SED']
  1819: ['Eth', 'Fluency', 'SpEd', 'SiteName', 'School Address', 'City', 'Zip', 'Grade', 'AttRate', 'DaysEnr', 'DaysAbs', 'Susp', 'Address', 'CurrWeightedTotGPA', 'SED']
  1920: ['Eth', 'Fluency', 'SpEd', 'SiteName', 'School Address', 'City', 'Zip', 'Grade', 'AttRate', 'DaysEnr', 'DaysAbs', 'Susp', 'Address', 'CurrWeightedTotGPA', 'SED']
  2021: ['Eth', 'Fluency', 'SpEd', 'SiteName', 'School Address', 'City', 'Zip', 'Grade', 'AttRate', 'DaysEnr', 'DaysAbs', 'Susp', 'Addres

In [8]:
# How many students have data in each year?
for yr in sorted(cols_by_year):
    att_col = f'AttRate_{yr}'
    if att_col in evaldata.columns:
        n = evaldata[att_col].notna().sum()
        print(f"  {yr}: {n} students with AttRate")

  1718: 39929 students with AttRate
  1819: 39579 students with AttRate
  1920: 38839 students with AttRate
  2021: 37558 students with AttRate
  2122: 36153 students with AttRate
  2223: 36552 students with AttRate
  2324: 36695 students with AttRate


## Missing Values Analysis

In [9]:
# Missing values in chr_abs
print("=== chr_abs missing values ===")
missing_chr = chr_abs.isnull().sum()
missing_chr_pct = (missing_chr / len(chr_abs) * 100).round(1)
missing_df = pd.DataFrame({'missing': missing_chr, 'pct': missing_chr_pct})
print(missing_df[missing_df['missing'] > 0].sort_values('pct', ascending=False).to_string())

=== chr_abs missing values ===
                                    missing   pct
NumSusp                               17683  94.9
NumDaysSusp                           17683  94.9
Cumulative Weighted Total GPA (TP)    11852  63.6
Current Weighted Total GPA (GT)       11852  63.6
PG_Email_1                              123   0.7
Telephone                                43   0.2
AddressResidence                          1   0.0
ParentName                                9   0.0


In [10]:
# Missing values in evaldata - summarize by field type across years
print("=== evaldata missing values by field across years ===")
for yr in sorted(cols_by_year):
    print(f"\n--- Year {yr} ---")
    for base in cols_by_year[yr]:
        col = f'{base}_{yr}'
        if col in evaldata.columns:
            n_missing = evaldata[col].isnull().sum()
            pct = n_missing / len(evaldata) * 100
            if pct > 0:
                print(f"  {col}: {n_missing} ({pct:.1f}%)")

=== evaldata missing values by field across years ===

--- Year 1718 ---
  Eth_1718: 38835 (48.9%)
  Fluency_1718: 38835 (48.9%)
  SpEd_1718: 38835 (48.9%)
  SiteName_1718: 38835 (48.9%)
  School Address_1718: 38835 (48.9%)
  City_1718: 38835 (48.9%)
  Zip_1718: 38835 (48.9%)
  Grade_1718: 38835 (48.9%)
  AttRate_1718: 39531 (49.7%)
  DaysEnr_1718: 39531 (49.7%)
  DaysAbs_1718: 39531 (49.7%)
  Susp_1718: 77877 (98.0%)
  Address_1718: 38864 (48.9%)
  CurrWeightedTotGPA_1718: 60797 (76.5%)
  SED_1718: 38835 (48.9%)

--- Year 1819 ---
  Eth_1819: 39247 (49.4%)
  Fluency_1819: 39247 (49.4%)
  SpEd_1819: 39247 (49.4%)
  SiteName_1819: 39247 (49.4%)
  School Address_1819: 39247 (49.4%)
  City_1819: 39247 (49.4%)
  Zip_1819: 39247 (49.4%)
  Grade_1819: 39247 (49.4%)
  AttRate_1819: 39881 (50.2%)
  DaysEnr_1819: 39881 (50.2%)
  DaysAbs_1819: 39881 (50.2%)
  Susp_1819: 78121 (98.3%)
  Address_1819: 39257 (49.4%)
  CurrWeightedTotGPA_1819: 60849 (76.6%)
  SED_1819: 39247 (49.4%)

--- Year 1920 -

## Target Variable Analysis

In `chr_abs`, the target is `AttGrp` (Severe Chronic Absent / Chronic Absent / At Risk). 
Note: this dataset only contains students who are already flagged — it may NOT include students with good attendance.

For modeling, chronic absenteeism is typically defined as `AttRate < 0.90`.

In [11]:
# Target distribution in chr_abs
print("=== AttGrp distribution in chr_abs ===")
print(chr_abs['AttGrp'].value_counts())
print(f"\nAttRate range: {chr_abs['AttRate'].min()} - {chr_abs['AttRate'].max()}")
print(f"Max AttRate is {chr_abs['AttRate'].max()} (< 0.95), confirming this dataset only has at-risk/chronic absent students)")

# Check: does chr_abs include ANY students with good attendance?
print(f"\nStudents with AttRate >= 0.95: {(chr_abs['AttRate'] >= 0.95).sum()}")
print(f"Students with AttRate >= 0.90: {(chr_abs['AttRate'] >= 0.90).sum()}")

=== AttGrp distribution in chr_abs ===
AttGrp
At Risk                  8726
Chronic Absent           6068
Severe Chronic Absent    3844
Name: count, dtype: int64

AttRate range: 0.0 - 0.9497
Max AttRate is 0.9497 (< 0.95), confirming this dataset only has at-risk/chronic absent students)

Students with AttRate >= 0.95: 0
Students with AttRate >= 0.90: 8994


In [12]:
# Target distribution in evaldata (using most recent year 2324)
print("=== Chronic absence in evaldata (2023-24) ===")
att_2324 = evaldata['AttRate_2324'].dropna()
print(f"Students with AttRate data: {len(att_2324)}")
print(f"AttRate range: {att_2324.min()} - {att_2324.max()}")
print(f"\nChronic absent (AttRate < 0.90): {(att_2324 < 0.90).sum()} ({(att_2324 < 0.90).mean()*100:.1f}%)")
print(f"Severe chronic (AttRate < 0.80): {(att_2324 < 0.80).sum()} ({(att_2324 < 0.80).mean()*100:.1f}%)")
print(f"On track (AttRate >= 0.95): {(att_2324 >= 0.95).sum()} ({(att_2324 >= 0.95).mean()*100:.1f}%)")

print("\n=== AttRate distribution by year in evaldata ===")
for yr in sorted(cols_by_year):
    att_col = f'AttRate_{yr}'
    if att_col in evaldata.columns:
        vals = evaldata[att_col].dropna()
        chronic_pct = (vals < 0.90).mean() * 100
        print(f"  {yr}: n={len(vals)}, mean={vals.mean():.3f}, chronic_absent_pct={chronic_pct:.1f}%")

=== Chronic absence in evaldata (2023-24) ===
Students with AttRate data: 36695
AttRate range: 0.0 - 1.0

Chronic absent (AttRate < 0.90): 11992 (32.7%)
Severe chronic (AttRate < 0.80): 5061 (13.8%)
On track (AttRate >= 0.95): 15153 (41.3%)

=== AttRate distribution by year in evaldata ===
  1718: n=39929, mean=0.935, chronic_absent_pct=16.2%
  1819: n=39579, mean=0.891, chronic_absent_pct=33.8%
  1920: n=38839, mean=0.927, chronic_absent_pct=19.2%
  2021: n=37558, mean=0.916, chronic_absent_pct=20.1%
  2122: n=36153, mean=0.860, chronic_absent_pct=43.8%
  2223: n=36552, mean=0.842, chronic_absent_pct=58.3%
  2324: n=36695, mean=0.889, chronic_absent_pct=32.7%


## Dataset Relationship

Can we link these two datasets? `chr_abs` has real student `ID`, while `evaldata` has `ANON_ID`. Let's check if there's any linkage possible.

In [13]:
# Check if the two datasets can be linked
print("=== Dataset linkage analysis ===")
print(f"chr_abs IDs: {chr_abs['ID'].nunique()} unique (range {chr_abs['ID'].min()}-{chr_abs['ID'].max()})")
print(f"evaldata ANON_IDs: {evaldata['ANON_ID'].nunique()} unique (range {evaldata['ANON_ID'].min()}-{evaldata['ANON_ID'].max()})")

# Check overlap by ID value
overlap = set(chr_abs['ID']) & set(evaldata['ANON_ID'])
print(f"\nDirect ID overlap: {len(overlap)} students")

# Try matching on birthdate + gender
print("\n--- Trying to match on Birthdate + Gen ---")
chr_key = set(zip(chr_abs['Birthdate'].dt.date, chr_abs['Gen']))
eval_key = set(zip(evaldata['Birthdate'].dt.date, evaldata['Gen']))
bday_gen_overlap = chr_key & eval_key
print(f"Birthdate+Gen overlap: {len(bday_gen_overlap)} unique combos")

print(f"\nchr_abs is 2024-25 data (DT ~ May 2025)")
print(f"evaldata spans 2017-18 through 2023-24")
print(f"These datasets likely cover different time periods with some student overlap")

=== Dataset linkage analysis ===
chr_abs IDs: 17656 unique (range 267030-451929)
evaldata ANON_IDs: 79460 unique (range 1-79460)



Direct ID overlap: 0 students



--- Trying to match on Birthdate + Gen ---
Birthdate+Gen overlap: 7948 unique combos

chr_abs is 2024-25 data (DT ~ May 2025)
evaldata spans 2017-18 through 2023-24
These datasets likely cover different time periods with some student overlap


## Demographic Distributions

In [14]:
# Demographic breakdowns in chr_abs
print("=== chr_abs demographics ===")
for col in ['Eth', 'Gen', 'Fluency', 'Special Ed Status', 'SED Status', 'Gr']:
    print(f"\n--- {col} ---")
    print(chr_abs[col].value_counts().to_string())

=== chr_abs demographics ===

--- Eth ---
Eth
Latino                9711
African American      4515
White                 1436
Multiple Ethnicity    1044
Asian                 1043
Not Reported           523
Pacific Islander       213
Filipino                79
Native American         74

--- Gen ---
Gen
M    9697
F    8896
N      45

--- Fluency ---
Fluency
English Only              9621
English Learner           6267
Recl English Fluent       1905
To Be Determined           473
Initial English Fluent     371
Adult EL (ADEL)              1

--- Special Ed Status ---
Special Ed Status
Not Special Ed    14572
Special Ed         4066

--- SED Status ---
SED Status
SED        16670
Not SED     1968

--- Gr ---
Gr
 0     1580
 12    1507
 2     1450
 1     1428
 3     1424
 5     1372
 11    1310
 4     1301
 8     1287
 10    1258
 6     1208
 7     1204
 9     1177
-1      865
-2      170
 15      97


In [15]:
# Demographic breakdowns in evaldata (using 2324 as reference year)
print("=== evaldata demographics (2023-24) ===")
for base in ['Eth', 'Fluency', 'SpEd', 'SED']:
    col = f'{base}_2324'
    if col in evaldata.columns:
        print(f"\n--- {col} ---")
        print(evaldata[col].value_counts().to_string())

print(f"\n--- Gen (time-invariant) ---")
print(evaldata['Gen'].value_counts().to_string())

=== evaldata demographics (2023-24) ===

--- Eth_2324 ---
Eth_2324
Latino                18214
African American       7663
White                  4225
Asian                  3505
Multiple Ethnicity     2549
Not Reported            910
Pacific Islander        311
Filipino                186
Native American         100

--- Fluency_2324 ---
Fluency_2324
EO         18694
EL         12785
RFEP        5011
IFEP        1090
TBD           75
ADEL           7
Unknown        1

--- SpEd_2324 ---
SpEd_2324
Not Special Ed    31000
Special Ed         6663

--- SED_2324 ---
SED_2324
SED        30915
Not SED     6748

--- Gen (time-invariant) ---
Gen
M    41291
F    38029
N      138
m        2


## School Analysis

In [16]:
# Schools in chr_abs
print(f"=== Schools in chr_abs ===")
print(f"Number of schools: {chr_abs['SiteName'].nunique()}")
print(f"\nTop 15 schools by student count:")
print(chr_abs['SiteName'].value_counts().head(15).to_string())

# Schools in evaldata (2324)
print(f"\n=== Schools in evaldata 2023-24 ===")
site_col = 'SiteName_2324'
if site_col in evaldata.columns:
    sites = evaldata[site_col].dropna()
    print(f"Number of schools: {sites.nunique()}")
    print(f"\nTop 15 schools by student count:")
    print(sites.value_counts().head(15).to_string())

=== Schools in chr_abs ===
Number of schools: 77

Top 15 schools by student count:
SiteName
Oakland Technical High School    885
Skyline High School              795
Oakland High School              675
Fremont High School              633
Castlemont High School           569
Coliseum College Prep Academy    520
Madison Park Academy 6-12        515
Elmhurst United Middle School    502
Lockwood STEAM Academy           418
Montera Middle School            367
Greenleaf Elementary             328
Rudsdale Continuation            327
Reach Academy                    323
Edna M Brewer Middle School      317
Laurel Elementary                302

=== Schools in evaldata 2023-24 ===
Number of schools: 81

Top 15 schools by student count:
SiteName_2324
Oakland Technical High School    1923
Oakland High School              1642
Skyline High School              1525
Fremont High School              1271
Castlemont High School            874
Coliseum College Prep Academy     869
Edna M Brewer Midd

## Grade Distribution

In [17]:
# Grade distribution
print("=== Grade in chr_abs ===")
print(chr_abs['Gr'].value_counts().sort_index().to_string())

print("\n=== Grade in evaldata (2023-24) ===")
grade_col = 'Grade_2324'
if grade_col in evaldata.columns:
    print(evaldata[grade_col].dropna().astype(int).value_counts().sort_index().to_string())
    print(f"\nNote: Grade -1 likely means pre-K or TK")

=== Grade in chr_abs ===
Gr
-2      170
-1      865
 0     1580
 1     1428
 2     1450
 3     1424
 4     1301
 5     1372
 6     1208
 7     1204
 8     1287
 9     1177
 10    1258
 11    1310
 12    1507
 15      97

=== Grade in evaldata (2023-24) ===
Grade_2324
-1     1151
 0     2891
 1     3083
 2     3170
 3     2998
 4     3024
 5     3041
 6     2451
 7     2528
 8     2528
 9     2706
 10    2755
 11    2561
 12    2776

Note: Grade -1 likely means pre-K or TK


## GPA and Suspension Analysis

In [18]:
# GPA availability — only for secondary students
print("=== GPA in chr_abs ===")
gpa_mask = chr_abs['Current Weighted Total GPA (GT)'].notna()
print(f"Students with GPA: {gpa_mask.sum()} / {len(chr_abs)} ({gpa_mask.mean()*100:.1f}%)")
print(f"Grade range of students with GPA: {chr_abs.loc[gpa_mask, 'Gr'].min()} - {chr_abs.loc[gpa_mask, 'Gr'].max()}")
print(f"GPA stats:\n{chr_abs['Current Weighted Total GPA (GT)'].describe()}")

print("\n=== Suspensions in chr_abs ===")
susp_mask = chr_abs['NumSusp'].notna()
print(f"Students with suspensions: {susp_mask.sum()} / {len(chr_abs)} ({susp_mask.mean()*100:.1f}%)")
print(f"Suspension stats:\n{chr_abs['NumSusp'].describe()}")

print("\n=== GPA in evaldata (2023-24) ===")
gpa_col = 'CurrWeightedTotGPA_2324'
if gpa_col in evaldata.columns:
    gpa_vals = evaldata[gpa_col].dropna()
    print(f"Students with GPA: {len(gpa_vals)} / {len(evaldata)}")
    print(f"GPA stats:\n{gpa_vals.describe()}")

print("\n=== Suspensions in evaldata (2023-24) ===")
susp_col = 'Susp_2324'
if susp_col in evaldata.columns:
    susp_vals = evaldata[susp_col].dropna()
    print(f"Students with suspensions: {len(susp_vals)} / {len(evaldata)}")
    print(f"Suspension stats:\n{susp_vals.describe()}")

=== GPA in chr_abs ===
Students with GPA: 6786 / 18638 (36.4%)
Grade range of students with GPA: 6 - 15
GPA stats:
count    6786.000000
mean        2.435130
std         1.239431
min         0.000000
25%         1.710000
50%         2.670000
75%         3.430000
max         5.000000
Name: Current Weighted Total GPA (GT), dtype: float64

=== Suspensions in chr_abs ===
Students with suspensions: 955 / 18638 (5.1%)
Suspension stats:
count    955.000000
mean       1.549738
std        1.137500
min        1.000000
25%        1.000000
50%        1.000000
75%        2.000000
max       11.000000
Name: NumSusp, dtype: float64

=== GPA in evaldata (2023-24) ===
Students with GPA: 18280 / 79460
GPA stats:
count    18280.000000
mean         2.505184
std          1.386287
min          0.000000
25%          1.670000
50%          2.880000
75%          3.670000
max          5.000000
Name: CurrWeightedTotGPA_2324, dtype: float64

=== Suspensions in evaldata (2023-24) ===
Students with suspensions: 1439 /

## Duplicate and ID Checks

In [19]:
# Check for duplicate IDs
print("=== Duplicate check ===")
print(f"chr_abs: {len(chr_abs)} rows, {chr_abs['ID'].nunique()} unique IDs")
dup_ids = chr_abs[chr_abs['ID'].duplicated(keep=False)]
if len(dup_ids) > 0:
    print(f"  Duplicate IDs found: {dup_ids['ID'].nunique()} IDs with duplicates")
    print(f"  Sample duplicates:")
    print(dup_ids.sort_values('ID').head(10)[['ID', 'LastName', 'FirstName', 'SiteName', 'Gr', 'AttRate']].to_string())
else:
    print("  No duplicate IDs")

print(f"\nevaldata: {len(evaldata)} rows, {evaldata['ANON_ID'].nunique()} unique ANON_IDs")
dup_anon = evaldata[evaldata['ANON_ID'].duplicated(keep=False)]
if len(dup_anon) > 0:
    print(f"  Duplicate ANON_IDs found: {dup_anon['ANON_ID'].nunique()}")
else:
    print("  No duplicate ANON_IDs — each row is one student")

=== Duplicate check ===
chr_abs: 18638 rows, 17656 unique IDs
  Duplicate IDs found: 982 IDs with duplicates
  Sample duplicates:
           ID     LastName  FirstName                       SiteName  Gr  AttRate
16957  270598       Watson  Brooklynn            Young Adult Program  15   0.9441
17956  270598       Watson  Brooklynn            Young Adult Program  15   0.9438
18561  273093  Zavala-Cruz   Kimberly      Madison Park Academy 6-12  12   0.8960
17528  273093  Zavala-Cruz   Kimberly      Madison Park Academy 6-12  12   0.8966
18406  274130          Xie    Allison  Oakland Technical High School  12   0.9441
17388  274130          Xie    Allison  Oakland Technical High School  12   0.9444
17569  274140         Zhen   Michelle  Oakland Technical High School  12   0.9333
18604  274140         Zhen   Michelle  Oakland Technical High School  12   0.9330
18603  274141         Zhen     Elaine  Oakland Technical High School  12   0.8268
17568  274141         Zhen     Elaine  Oakland Tec

## DaysEnr Distribution (enrollment duration)

In [20]:
# Students with very few days enrolled — their AttRate may be unreliable
print("=== DaysEnr distribution in chr_abs ===")
print(chr_abs['DaysEnr'].describe())
print(f"\nStudents with DaysEnr < 30: {(chr_abs['DaysEnr'] < 30).sum()}")
print(f"Students with DaysEnr < 90 (half year): {(chr_abs['DaysEnr'] < 90).sum()}")

print("\n=== DaysEnr distribution in evaldata (2023-24) ===")
days_col = 'DaysEnr_2324'
if days_col in evaldata.columns:
    days = evaldata[days_col].dropna()
    print(days.describe())
    print(f"\nStudents with DaysEnr < 30: {(days < 30).sum()}")
    print(f"Students with DaysEnr < 90: {(days < 90).sum()}")

=== DaysEnr distribution in chr_abs ===
count    18638.000000
mean       172.722234
std         23.946625
min          3.000000
25%        179.000000
50%        180.000000
75%        180.000000
max        180.000000
Name: DaysEnr, dtype: float64

Students with DaysEnr < 30: 77
Students with DaysEnr < 90 (half year): 546

=== DaysEnr distribution in evaldata (2023-24) ===
count    36695.000000
mean       166.679793
std         36.307636
min          1.000000
25%        179.000000
50%        180.000000
75%        180.000000
max        180.000000
Name: DaysEnr_2324, dtype: float64

Students with DaysEnr < 30: 778
Students with DaysEnr < 90: 2620


## Key Observations & Next Steps

**Findings to verify after running this notebook:**

1. **chr_abs is a filtered dataset** — it only contains students who are at-risk or chronically absent (AttRate ≤ ~0.95). It does NOT represent the full student population. This means we cannot use it alone for training a classifier (no negative examples).

2. **evaldata is the main modeling dataset** — it has longitudinal data across 7 years (2017-18 through 2023-24) with the full student population. Each row is one student with year-suffixed columns.

3. **Dataset linkage is unclear** — chr_abs uses real IDs while evaldata uses anonymized IDs. They may not be directly joinable unless there's a crosswalk.

4. **Modeling approach:** Use evaldata as the primary dataset. Train on years 2017-18 through 2022-23 to predict 2023-24 chronic absenteeism. Define target as `AttRate_2324 < 0.90`.

5. **GPA is only available for secondary students** (~36% in chr_abs). This will need careful handling — either as a separate feature for secondary students or excluded.

6. **Suspensions are sparse** — only ~5% of students have suspension records. Likely best encoded as binary (any suspension yes/no) plus count.

7. **Short enrollment students** — students with very few DaysEnr may need to be filtered out as their AttRate is unreliable.