Skip to content

A project that analyzes Chicago’s census, crime, and school data using MySQL and Python. It identifies factors that affect educational outcomes of children and young people. A Jupyter notebook shows the analysis, queries, and results. A good way to practice SQL skills and learn about Chicago’s education system.

Notifications You must be signed in to change notification settings

guzmajo/SQL-Using-Python

Repository files navigation

SQL-Using-Python

A project that analyzes Chicago’s census, crime, and school data using MySQL and Python. It identifies factors that affect educational outcomes of children and young people. A Jupyter notebook shows the analysis, queries, and results. A good way to practice SQL skills and learn about Chicago’s education system.

Introduction

Using this Python notebook you will:

  1. Understand three Chicago datasets
  2. Load the three datasets into three tables in a SQLIte database
  3. Execute SQL queries to answer assignment questions

Understand the datasets

To complete the assignment problems in this notebook you will be using three datasets that are available on the city of Chicago's Data Portal:

  1. Socioeconomic Indicators in Chicago
  2. Chicago Public Schools
  3. Chicago Crime Data

1. Socioeconomic Indicators in Chicago

This dataset contains a selection of six socioeconomic indicators of public health significance and a “hardship index,” for each Chicago community area, for the years 2008 – 2012.

A detailed description of this dataset and the original dataset can be obtained from the Chicago Data Portal at: https://data.cityofchicago.org/Health-Human-Services/Census-Data-Selected-socioeconomic-indicators-in-C/kn9c-c2s2

2. Chicago Public Schools

This dataset shows all school level performance data used to create CPS School Report Cards for the 2011-2012 school year. This dataset is provided by the city of Chicago's Data Portal.

A detailed description of this dataset and the original dataset can be obtained from the Chicago Data Portal at: https://data.cityofchicago.org/Education/Chicago-Public-Schools-Progress-Report-Cards-2011-/9xs2-f89t

3. Chicago Crime Data

This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days.

A detailed description of this dataset and the original dataset can be obtained from the Chicago Data Portal at: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2

Download the datasets

This assignment requires you to have these three tables populated with a subset of the whole datasets.

In many cases the dataset to be analyzed is available as a .CSV (comma separated values) file, perhaps on the internet. Click on the links below to download and save the datasets (.CSV files):

NOTE: Ensure you have downloaded the datasets using the links above instead of directly from the Chicago Data Portal. The versions linked here are subsets of the original datasets and have some of the column names modified to be more database friendly which will make it easier to complete this assignment.

Store the datasets in database tables

To analyze the data using SQL, it first needs to be loaded into SQLite DB. We will create three tables in as under:

  1. CENSUS_DATA
  2. CHICAGO_PUBLIC_SCHOOLS
  3. CHICAGO_CRIME_DATA

Let us now load the ipython-sql extension and establish a connection with the database

  • Here you will be loading the csv files into the pandas Dataframe and then loading the data into the above mentioned sqlite tables.

  • Next you will be connecting to the sqlite database FinalDB.

Refer to the previous lab for hints .

Hands-on Lab: Analyzing a real World Data Set

%load_ext sql
import csv, sqlite3

con = sqlite3.connect("chicagodataset.db")
cur = con.cursor()
!pip install -q pandas==1.1.5
%sql sqlite:///chicagodataset.db
import pandas
df = pandas.read_csv('ChicagoCrimeData.csv')
df.to_sql("ChicagoCrimeData", con, if_exists='replace', index=False,method="multi")
import pandas
df = pandas.read_csv('ChicagoCensusData.csv')
df.to_sql("ChicagoCensusData", con, if_exists='replace', index=False,method="multi")
import pandas
df = pandas.read_csv('ChicagoPublicSchools.csv')
df.to_sql("ChicagoPublicSchools", con, if_exists='replace', index=False,method="multi")
C:\Users\LENOVO\anaconda3\lib\site-packages\pandas\core\generic.py:2605: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
  sql.to_sql(
%sql SELECT * FROM ChicagoCrimeData limit 5;
 * sqlite:///chicagodataset.db
Done.
ID CASE_NUMBER DATE BLOCK IUCR PRIMARY_TYPE DESCRIPTION LOCATION_DESCRIPTION ARREST DOMESTIC BEAT DISTRICT WARD COMMUNITY_AREA_NUMBER FBICODE X_COORDINATE Y_COORDINATE YEAR LATITUDE LONGITUDE LOCATION
3512276 HK587712 2004-08-28 047XX S KEDZIE AVE 890 THEFT FROM BUILDING SMALL RETAIL STORE 0 0 911 9 14.0 58.0 6 1155838.0 1873050.0 2004 41.807440500000006 -87.70395585 (41.8074405, -87.703955849)
3406613 HK456306 2004-06-26 009XX N CENTRAL PARK AVE 820 THEFT $500 AND UNDER OTHER 0 0 1112 11 27.0 23.0 6 1152206.0 1906127.0 2004 41.89827996 -87.71640551 (41.898279962, -87.716405505)
8002131 HT233595 2011-04-04 043XX S WABASH AVE 820 THEFT $500 AND UNDER NURSING HOME/RETIREMENT HOME 0 0 221 2 3.0 38.0 6 1177436.0 1876313.0 2011 41.81593313 -87.62464213 (41.815933131, -87.624642127)
7903289 HT133522 2010-12-30 083XX S KINGSTON AVE 840 THEFT FINANCIAL ID THEFT: OVER $300 RESIDENCE 0 0 423 4 7.0 46.0 6 1194622.0 1850125.0 2010 41.74366532 -87.56246276 (41.743665322, -87.562462756)
10402076 HZ138551 2016-02-02 033XX W 66TH ST 820 THEFT $500 AND UNDER ALLEY 0 0 831 8 15.0 66.0 6 1155240.0 1860661.0 2016 41.773455299999995 -87.70648047 (41.773455295, -87.706480471)

Problems

Now write and execute SQL queries to solve assignment problems

Problem 1

Find the total number of crimes recorded in the CRIME table.
%sql SELECT COUNT (*) FROM ChicagoCrimeData;
 * sqlite:///chicagodataset.db
Done.
COUNT (*)
533

Problem 2

List community areas with per capita income less than 11000.
%sql SELECT COMMUNITY_AREA_NAME, PER_CAPITA_INCOME FROM ChicagoCensusData WHERE PER_CAPITA_INCOME < 11000;
 * sqlite:///chicagodataset.db
Done.
COMMUNITY_AREA_NAME PER_CAPITA_INCOME
West Garfield Park 10934
South Lawndale 10402
Fuller Park 10432
Riverdale 8201

Problem 3

List all case numbers for crimes involving minors?(children are not considered minors for the purposes of crime analysis)
%sql SELECT DISTINCT CASE_NUMBER FROM ChicagoCrimeData where DESCRIPTION like '%MINOR%'
 * sqlite:///chicagodataset.db
Done.
CASE_NUMBER
HL266884
HK238408

Problem 4

List all kidnapping crimes involving a child?
%sql SELECT COUNT (*), PRIMARY_TYPE, DESCRIPTION FROM ChicagoCrimeData WHERE PRIMARY_TYPE LIKE 'K%';
 * sqlite:///chicagodataset.db
Done.
COUNT (*) PRIMARY_TYPE DESCRIPTION
1 KIDNAPPING CHILD ABDUCTION/STRANGER

Problem 5

What kinds of crimes were recorded at schools?
%sql SELECT DISTINCT(PRIMARY_TYPE) FROM ChicagoCrimeData WHERE LOCATION_DESCRIPTION LIKE '%SCHOOL%'
 * sqlite:///chicagodataset.db
Done.
PRIMARY_TYPE
BATTERY
CRIMINAL DAMAGE
NARCOTICS
ASSAULT
CRIMINAL TRESPASS
PUBLIC PEACE VIOLATION

Problem 6

List the average safety all for each type of school.
%%sql
    
SELECT "Elementary, Middle, or High School",AVG(SAFETY_SCORE) AS AVERAGE_SAFETY_SCORE
FROM ChicagoPublicSchools
GROUP BY "Elementary, Middle, or High School";
 * sqlite:///chicagodataset.db
Done.
Elementary, Middle, or High School AVERAGE_SAFETY_SCORE
ES 49.52038369304557
HS 49.62352941176471
MS 48.0

Problem 7

List 5 community areas with highest % of households below poverty line
%%sql

SELECT COMMUNITY_AREA_NAME, PERCENT_HOUSEHOLDS_BELOW_POVERTY
FROM ChicagoCensusData
ORDER BY PERCENT_HOUSEHOLDS_BELOW_POVERTY DESC
LIMIT 5 ;
 * sqlite:///chicagodataset.db
Done.
COMMUNITY_AREA_NAME PERCENT_HOUSEHOLDS_BELOW_POVERTY
Riverdale 56.5
Fuller Park 51.2
Englewood 46.6
North Lawndale 43.1
East Garfield Park 42.4

Problem 8

Which community area is most crime prone?
%%sql

SELECT CCD.COMMUNITY_AREA_NUMBER ,COUNT(CCD.COMMUNITY_AREA_NUMBER) AS FREQUENCY
FROM ChicagoCrimeData AS CCD 
GROUP BY CCD.COMMUNITY_AREA_NUMBER
ORDER BY COUNT(CCD.COMMUNITY_AREA_NUMBER) DESC
LIMIT 1;
 * sqlite:///chicagodataset.db
Done.
COMMUNITY_AREA_NUMBER FREQUENCY
25.0 43

Double-click here for a hint

Problem 9

Use a sub-query to find the name of the community area with highest hardship index
%%sql

SELECT COMMUNITY_AREA_NAME
FROM  ChicagoCensusData
WHERE HARDSHIP_INDEX = (SELECT MAX(HARDSHIP_INDEX) FROM ChicagoCensusData);
 * sqlite:///chicagodataset.db
Done.
COMMUNITY_AREA_NAME
Riverdale

Problem 10

Use a sub-query to determine the Community Area Name with most number of crimes?
%%sql

SELECT community_area_name
FROM ChicagoCensusData 
WHERE COMMUNITY_AREA_NUMBER = (    
    SELECT CCD.COMMUNITY_AREA_NUMBER 
    FROM ChicagoCrimeData AS CCD 
    GROUP BY CCD.COMMUNITY_AREA_NUMBER
    ORDER BY COUNT(CCD.COMMUNITY_AREA_NUMBER) DESC
    LIMIT 1)

LIMIT 1;
 * sqlite:///chicagodataset.db
Done.
COMMUNITY_AREA_NAME
Austin

Copyright © 2020 This notebook and its source code are released under the terms of the MIT License.

Author(s)

Hima Vasudevan

Rav Ahuja

Ramesh Sannreddy

Contribtuor(s)

Malika Singla

Change log

Date Version Changed by Change Description
2022-03-04 2.5 Lakshmi Holla Changed markdown.
2021-05-19 2.4 Lakshmi Holla Updated the question
2021-04-30 2.3 Malika Singla Updated the libraries
2021-01-15 2.2 Rav Ahuja Removed problem 11 and fixed changelog
2020-11-25 2.1 Ramesh Sannareddy Updated the problem statements, and datasets
2020-09-05 2.0 Malika Singla Moved lab to course repo in GitLab
2018-07-18 1.0 Rav Ahuja Several updates including loading instructions
2018-05-04 0.1 Hima Vasudevan Created initial version

© IBM Corporation 2020. All rights reserved.

jupyter nbconvert --to markdown DB0201EN-PeerAssign-v5_SQLite.ipynb
  Input In [62]
    jupyter nbconvert --to markdown DB0201EN-PeerAssign-v5_SQLite.ipynb
            ^
SyntaxError: invalid syntax

About

A project that analyzes Chicago’s census, crime, and school data using MySQL and Python. It identifies factors that affect educational outcomes of children and young people. A Jupyter notebook shows the analysis, queries, and results. A good way to practice SQL skills and learn about Chicago’s education system.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published