# Chicago Public Schools (2011â€“2012)
## Working with a Real-World Dataset using SQL and Python

**Objectives**
- Load the CSV into SQLite
- Explore schema & metadata
- Answer example business questions with SQL
- Add a few simple charts with matplotlib

**Artifacts**
- **SQLite DB**: `/mnt/data/RealWorldData.db`
- **CSV (uploaded)**: [ChicagoPublicSchools.csv](/mnt/data/ChicagoPublicSchools.csv)


### 0) Setup and Connection

In [None]:
%load_ext sql
%sql sqlite:///RealWorldData.db
print("Connected: sqlite:///RealWorldData.db")


### 1) Load CSV into SQLite (idempotent)

In [None]:
import pandas as pd
df = pd.read_csv(r"/mnt/data/ChicagoPublicSchools.csv")
df.to_sql("CHICAGO_PUBLIC_SCHOOLS_DATA", con=None, if_exists="replace", index=False)
len(df)


### 2) Metadata

In [None]:
%sql SELECT name FROM sqlite_master WHERE type='table';


In [None]:
%sql SELECT COUNT(name) AS column_count FROM PRAGMA_TABLE_INFO('CHICAGO_PUBLIC_SCHOOLS_DATA');


In [None]:
%sql SELECT name, type, length(type) AS type_len FROM PRAGMA_TABLE_INFO('CHICAGO_PUBLIC_SCHOOLS_DATA');


#### 2.1) Column naming checks

In [None]:
cols = %sql SELECT name FROM PRAGMA_TABLE_INFO('CHICAGO_PUBLIC_SCHOOLS_DATA');
cols = cols.DataFrame()
cols.head(10)


In [None]:
names = cols['name'].tolist()
print("Literal 'SCHOOL ID' present? ->", "SCHOOL ID" in names)
print("Columns including 'Community' & 'Area' & 'Name':",
      [c for c in names if ("Community" in c and "Area" in c and "Name" in c)])
print("Columns containing underscores:",
      [c for c in names if "_" in c])


### 3) Core SQL Analysis

**3.1 How many Elementary Schools?**

In [None]:
%sql SELECT COUNT(*) AS elementary_count      FROM CHICAGO_PUBLIC_SCHOOLS_DATA      WHERE "Elementary, Middle, or High School"='ES';


**3.2 Highest Safety Score**

In [None]:
%sql SELECT MAX(CAST(Safety_Score AS INTEGER)) AS MAX_SAFETY_SCORE      FROM CHICAGO_PUBLIC_SCHOOLS_DATA;


**3.3 Schools with the highest Safety Score**

In [None]:
%sql SELECT Name_of_School, Safety_Score      FROM CHICAGO_PUBLIC_SCHOOLS_DATA      WHERE CAST(Safety_Score AS INTEGER) = (        SELECT MAX(CAST(Safety_Score AS INTEGER))        FROM CHICAGO_PUBLIC_SCHOOLS_DATA      )      ORDER BY Name_of_School;


**3.4 Top 10 schools by Average Student Attendance**

In [None]:
%sql SELECT Name_of_School, Average_Student_Attendance      FROM CHICAGO_PUBLIC_SCHOOLS_DATA      ORDER BY CAST(REPLACE(Average_Student_Attendance, '%','') AS FLOAT) DESC      LIMIT 10;


**3.5 Bottom 5 schools by Average Student Attendance**

In [None]:
%sql SELECT Name_of_School, Average_Student_Attendance      FROM CHICAGO_PUBLIC_SCHOOLS_DATA      ORDER BY CAST(REPLACE(Average_Student_Attendance, '%','') AS FLOAT) ASC      LIMIT 5;


**3.6 Schools with attendance < 70%**

In [None]:
%sql SELECT Name_of_School, Average_Student_Attendance      FROM CHICAGO_PUBLIC_SCHOOLS_DATA      WHERE CAST(REPLACE(Average_Student_Attendance, '%', '') AS FLOAT) < 70      ORDER BY CAST(REPLACE(Average_Student_Attendance, '%','') AS FLOAT) ASC;


**3.7 Total College Enrollment by Community Area**

In [None]:
%sql SELECT Community_Area_Name, SUM(CAST(College_Enrollment AS INTEGER)) AS TOTAL_ENROLLMENT      FROM CHICAGO_PUBLIC_SCHOOLS_DATA      GROUP BY Community_Area_Name;


**3.8 Bottom 5 Community Areas by College Enrollment**

In [None]:
%sql SELECT Community_Area_Name, SUM(CAST(College_Enrollment AS INTEGER)) AS TOTAL_ENROLLMENT      FROM CHICAGO_PUBLIC_SCHOOLS_DATA      GROUP BY Community_Area_Name      ORDER BY TOTAL_ENROLLMENT ASC      LIMIT 5;


**3.9 5 schools with the lowest Safety Score**

In [None]:
%sql SELECT Name_of_School, CAST(Safety_Score AS INTEGER) AS Safety_Score      FROM CHICAGO_PUBLIC_SCHOOLS_DATA      WHERE Safety_Score != 'None' AND Safety_Score IS NOT NULL      ORDER BY CAST(Safety_Score AS INTEGER) ASC      LIMIT 5;


### 4) Visualizations (matplotlib)

- Distribution of Average Student Attendance
- Top 10 schools by attendance
- Top 10 Community Areas by College Enrollment

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

att_df = %sql SELECT REPLACE(Average_Student_Attendance, '%','') AS att               FROM CHICAGO_PUBLIC_SCHOOLS_DATA;
att_df = att_df.DataFrame()
att_df['att'] = pd.to_numeric(att_df['att'], errors='coerce')
att_df = att_df.dropna()

plt.figure()
plt.hist(att_df['att'], bins=20)
plt.xlabel("Average Student Attendance (%)")
plt.ylabel("Number of Schools")
plt.title("Distribution of Average Student Attendance")
plt.show()


In [None]:
top10_att = %sql SELECT Name_of_School, CAST(REPLACE(Average_Student_Attendance, '%','') AS FLOAT) AS att                   FROM CHICAGO_PUBLIC_SCHOOLS_DATA                   ORDER BY att DESC                   LIMIT 10;
top10_att = top10_att.DataFrame()

plt.figure()
plt.barh(top10_att['Name_of_School'][::-1], top10_att['att'][::-1])
plt.xlabel("Average Student Attendance (%)")
plt.title("Top 10 Schools by Attendance")
plt.tight_layout()
plt.show()


In [None]:
comm = %sql SELECT Community_Area_Name,                    SUM(CAST(College_Enrollment AS INTEGER)) AS TOTAL_ENROLLMENT             FROM CHICAGO_PUBLIC_SCHOOLS_DATA             GROUP BY Community_Area_Name             ORDER BY TOTAL_ENROLLMENT DESC;
comm = comm.DataFrame().dropna(subset=['Community_Area_Name']).head(10)

plt.figure()
plt.barh(comm['Community_Area_Name'][::-1], comm['TOTAL_ENROLLMENT'][::-1])
plt.xlabel("Total College Enrollment")
plt.title("Top 10 Community Areas by College Enrollment")
plt.tight_layout()
plt.show()


### 5) Close

In [None]:
print("Notebook ready. Run cells top-to-bottom.")
