A project that analyzes Chicago’s census, crime, and school data using MySQL and Python. It identifies factors that affect educational outcomes of children and young people. A Jupyter notebook shows the analysis, queries, and results. A good way to practice SQL skills and learn about Chicago’s education system.
Using this Python notebook you will:
- Understand three Chicago datasets
- Load the three datasets into three tables in a SQLIte database
- Execute SQL queries to answer assignment questions
To complete the assignment problems in this notebook you will be using three datasets that are available on the city of Chicago's Data Portal:
This dataset contains a selection of six socioeconomic indicators of public health significance and a “hardship index,” for each Chicago community area, for the years 2008 – 2012.
A detailed description of this dataset and the original dataset can be obtained from the Chicago Data Portal at: https://data.cityofchicago.org/Health-Human-Services/Census-Data-Selected-socioeconomic-indicators-in-C/kn9c-c2s2
This dataset shows all school level performance data used to create CPS School Report Cards for the 2011-2012 school year. This dataset is provided by the city of Chicago's Data Portal.
A detailed description of this dataset and the original dataset can be obtained from the Chicago Data Portal at: https://data.cityofchicago.org/Education/Chicago-Public-Schools-Progress-Report-Cards-2011-/9xs2-f89t
This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days.
A detailed description of this dataset and the original dataset can be obtained from the Chicago Data Portal at: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
This assignment requires you to have these three tables populated with a subset of the whole datasets.
In many cases the dataset to be analyzed is available as a .CSV (comma separated values) file, perhaps on the internet. Click on the links below to download and save the datasets (.CSV files):
NOTE: Ensure you have downloaded the datasets using the links above instead of directly from the Chicago Data Portal. The versions linked here are subsets of the original datasets and have some of the column names modified to be more database friendly which will make it easier to complete this assignment.
To analyze the data using SQL, it first needs to be loaded into SQLite DB. We will create three tables in as under:
- CENSUS_DATA
- CHICAGO_PUBLIC_SCHOOLS
- CHICAGO_CRIME_DATA
Let us now load the ipython-sql extension and establish a connection with the database
-
Here you will be loading the csv files into the pandas Dataframe and then loading the data into the above mentioned sqlite tables.
-
Next you will be connecting to the sqlite database FinalDB.
Refer to the previous lab for hints .
Hands-on Lab: Analyzing a real World Data Set
%load_ext sqlimport csv, sqlite3
con = sqlite3.connect("chicagodataset.db")
cur = con.cursor()
!pip install -q pandas==1.1.5%sql sqlite:///chicagodataset.dbimport pandas
df = pandas.read_csv('ChicagoCrimeData.csv')
df.to_sql("ChicagoCrimeData", con, if_exists='replace', index=False,method="multi")import pandas
df = pandas.read_csv('ChicagoCensusData.csv')
df.to_sql("ChicagoCensusData", con, if_exists='replace', index=False,method="multi")import pandas
df = pandas.read_csv('ChicagoPublicSchools.csv')
df.to_sql("ChicagoPublicSchools", con, if_exists='replace', index=False,method="multi")C:\Users\LENOVO\anaconda3\lib\site-packages\pandas\core\generic.py:2605: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
sql.to_sql(
%sql SELECT * FROM ChicagoCrimeData limit 5; * sqlite:///chicagodataset.db
Done.
| ID | CASE_NUMBER | DATE | BLOCK | IUCR | PRIMARY_TYPE | DESCRIPTION | LOCATION_DESCRIPTION | ARREST | DOMESTIC | BEAT | DISTRICT | WARD | COMMUNITY_AREA_NUMBER | FBICODE | X_COORDINATE | Y_COORDINATE | YEAR | LATITUDE | LONGITUDE | LOCATION |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3512276 | HK587712 | 2004-08-28 | 047XX S KEDZIE AVE | 890 | THEFT | FROM BUILDING | SMALL RETAIL STORE | 0 | 0 | 911 | 9 | 14.0 | 58.0 | 6 | 1155838.0 | 1873050.0 | 2004 | 41.807440500000006 | -87.70395585 | (41.8074405, -87.703955849) |
| 3406613 | HK456306 | 2004-06-26 | 009XX N CENTRAL PARK AVE | 820 | THEFT | $500 AND UNDER | OTHER | 0 | 0 | 1112 | 11 | 27.0 | 23.0 | 6 | 1152206.0 | 1906127.0 | 2004 | 41.89827996 | -87.71640551 | (41.898279962, -87.716405505) |
| 8002131 | HT233595 | 2011-04-04 | 043XX S WABASH AVE | 820 | THEFT | $500 AND UNDER | NURSING HOME/RETIREMENT HOME | 0 | 0 | 221 | 2 | 3.0 | 38.0 | 6 | 1177436.0 | 1876313.0 | 2011 | 41.81593313 | -87.62464213 | (41.815933131, -87.624642127) |
| 7903289 | HT133522 | 2010-12-30 | 083XX S KINGSTON AVE | 840 | THEFT | FINANCIAL ID THEFT: OVER $300 | RESIDENCE | 0 | 0 | 423 | 4 | 7.0 | 46.0 | 6 | 1194622.0 | 1850125.0 | 2010 | 41.74366532 | -87.56246276 | (41.743665322, -87.562462756) |
| 10402076 | HZ138551 | 2016-02-02 | 033XX W 66TH ST | 820 | THEFT | $500 AND UNDER | ALLEY | 0 | 0 | 831 | 8 | 15.0 | 66.0 | 6 | 1155240.0 | 1860661.0 | 2016 | 41.773455299999995 | -87.70648047 | (41.773455295, -87.706480471) |
Now write and execute SQL queries to solve assignment problems
%sql SELECT COUNT (*) FROM ChicagoCrimeData; * sqlite:///chicagodataset.db
Done.
| COUNT (*) |
|---|
| 533 |
%sql SELECT COMMUNITY_AREA_NAME, PER_CAPITA_INCOME FROM ChicagoCensusData WHERE PER_CAPITA_INCOME < 11000; * sqlite:///chicagodataset.db
Done.
| COMMUNITY_AREA_NAME | PER_CAPITA_INCOME |
|---|---|
| West Garfield Park | 10934 |
| South Lawndale | 10402 |
| Fuller Park | 10432 |
| Riverdale | 8201 |
List all case numbers for crimes involving minors?(children are not considered minors for the purposes of crime analysis)
%sql SELECT DISTINCT CASE_NUMBER FROM ChicagoCrimeData where DESCRIPTION like '%MINOR%' * sqlite:///chicagodataset.db
Done.
| CASE_NUMBER |
|---|
| HL266884 |
| HK238408 |
%sql SELECT COUNT (*), PRIMARY_TYPE, DESCRIPTION FROM ChicagoCrimeData WHERE PRIMARY_TYPE LIKE 'K%'; * sqlite:///chicagodataset.db
Done.
| COUNT (*) | PRIMARY_TYPE | DESCRIPTION |
|---|---|---|
| 1 | KIDNAPPING | CHILD ABDUCTION/STRANGER |
%sql SELECT DISTINCT(PRIMARY_TYPE) FROM ChicagoCrimeData WHERE LOCATION_DESCRIPTION LIKE '%SCHOOL%' * sqlite:///chicagodataset.db
Done.
| PRIMARY_TYPE |
|---|
| BATTERY |
| CRIMINAL DAMAGE |
| NARCOTICS |
| ASSAULT |
| CRIMINAL TRESPASS |
| PUBLIC PEACE VIOLATION |
%%sql
SELECT "Elementary, Middle, or High School",AVG(SAFETY_SCORE) AS AVERAGE_SAFETY_SCORE
FROM ChicagoPublicSchools
GROUP BY "Elementary, Middle, or High School"; * sqlite:///chicagodataset.db
Done.
| Elementary, Middle, or High School | AVERAGE_SAFETY_SCORE |
|---|---|
| ES | 49.52038369304557 |
| HS | 49.62352941176471 |
| MS | 48.0 |
%%sql
SELECT COMMUNITY_AREA_NAME, PERCENT_HOUSEHOLDS_BELOW_POVERTY
FROM ChicagoCensusData
ORDER BY PERCENT_HOUSEHOLDS_BELOW_POVERTY DESC
LIMIT 5 ; * sqlite:///chicagodataset.db
Done.
| COMMUNITY_AREA_NAME | PERCENT_HOUSEHOLDS_BELOW_POVERTY |
|---|---|
| Riverdale | 56.5 |
| Fuller Park | 51.2 |
| Englewood | 46.6 |
| North Lawndale | 43.1 |
| East Garfield Park | 42.4 |
%%sql
SELECT CCD.COMMUNITY_AREA_NUMBER ,COUNT(CCD.COMMUNITY_AREA_NUMBER) AS FREQUENCY
FROM ChicagoCrimeData AS CCD
GROUP BY CCD.COMMUNITY_AREA_NUMBER
ORDER BY COUNT(CCD.COMMUNITY_AREA_NUMBER) DESC
LIMIT 1; * sqlite:///chicagodataset.db
Done.
| COMMUNITY_AREA_NUMBER | FREQUENCY |
|---|---|
| 25.0 | 43 |
Double-click here for a hint
%%sql
SELECT COMMUNITY_AREA_NAME
FROM ChicagoCensusData
WHERE HARDSHIP_INDEX = (SELECT MAX(HARDSHIP_INDEX) FROM ChicagoCensusData); * sqlite:///chicagodataset.db
Done.
| COMMUNITY_AREA_NAME |
|---|
| Riverdale |
%%sql
SELECT community_area_name
FROM ChicagoCensusData
WHERE COMMUNITY_AREA_NUMBER = (
SELECT CCD.COMMUNITY_AREA_NUMBER
FROM ChicagoCrimeData AS CCD
GROUP BY CCD.COMMUNITY_AREA_NUMBER
ORDER BY COUNT(CCD.COMMUNITY_AREA_NUMBER) DESC
LIMIT 1)
LIMIT 1; * sqlite:///chicagodataset.db
Done.
| COMMUNITY_AREA_NAME |
|---|
| Austin |
Copyright © 2020 This notebook and its source code are released under the terms of the MIT License.
| Date | Version | Changed by | Change Description |
|---|---|---|---|
| 2022-03-04 | 2.5 | Lakshmi Holla | Changed markdown. |
| 2021-05-19 | 2.4 | Lakshmi Holla | Updated the question |
| 2021-04-30 | 2.3 | Malika Singla | Updated the libraries |
| 2021-01-15 | 2.2 | Rav Ahuja | Removed problem 11 and fixed changelog |
| 2020-11-25 | 2.1 | Ramesh Sannareddy | Updated the problem statements, and datasets |
| 2020-09-05 | 2.0 | Malika Singla | Moved lab to course repo in GitLab |
| 2018-07-18 | 1.0 | Rav Ahuja | Several updates including loading instructions |
| 2018-05-04 | 0.1 | Hima Vasudevan | Created initial version |
jupyter nbconvert --to markdown DB0201EN-PeerAssign-v5_SQLite.ipynb Input In [62]
jupyter nbconvert --to markdown DB0201EN-PeerAssign-v5_SQLite.ipynb
^
SyntaxError: invalid syntax