# Data Exploration

In this section, we load the dataset and inspect its structure, missing values, and key columns.


## Loading the Dataset

In this step, we load the crime dataset into a pandas DataFrame.

Since the full dataset is very large (hundreds of MBs and millions of records),
we load only the first 200,000 rows for faster development and experimentation.

This approach prevents memory overload and allows us to explore the data efficiently.


In [5]:
import pandas as pd

# Load first 200,000 rows (safe for large file)
df = pd.read_csv("../data/raw/chicago_crime.csv", nrows=200000)

print("Dataset Loaded Successfully ✅")

print("\nShape of dataset:")
print(df.shape)

print("\nColumns in dataset:")
print(df.columns)

print("\nFirst 5 rows:")
df.head()


Dataset Loaded Successfully ✅

Shape of dataset:
(200000, 23)

Columns in dataset:
Index(['Unnamed: 0', 'ID', 'Case Number', 'Date', 'Block', 'IUCR',
       'Primary Type', 'Description', 'Location Description', 'Arrest',
       'Domestic', 'Beat', 'District', 'Ward', 'Community Area', 'FBI Code',
       'X Coordinate', 'Y Coordinate', 'Year', 'Updated On', 'Latitude',
       'Longitude', 'Location'],
      dtype='str')

First 5 rows:


Unnamed: 0.1,Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,3,10508693,HZ250496,05/03/2016 11:40:00 PM,013XX S SAWYER AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,True,...,24.0,29.0,08B,1154907.0,1893681.0,2016,05/10/2016 03:56:50 PM,41.864073,-87.706819,"(41.864073157, -87.706818608)"
1,89,10508695,HZ250409,05/03/2016 09:40:00 PM,061XX S DREXEL AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,RESIDENCE,False,...,20.0,42.0,08B,1183066.0,1864330.0,2016,05/10/2016 03:56:50 PM,41.782922,-87.604363,"(41.782921527, -87.60436317)"
2,197,10508697,HZ250503,05/03/2016 11:31:00 PM,053XX W CHICAGO AVE,470,PUBLIC PEACE VIOLATION,RECKLESS CONDUCT,STREET,False,...,37.0,25.0,24,1140789.0,1904819.0,2016,05/10/2016 03:56:50 PM,41.894908,-87.758372,"(41.894908283, -87.758371958)"
3,673,10508698,HZ250424,05/03/2016 10:10:00 PM,049XX W FULTON ST,460,BATTERY,SIMPLE,SIDEWALK,False,...,28.0,25.0,08B,1143223.0,1901475.0,2016,05/10/2016 03:56:50 PM,41.885687,-87.749516,"(41.885686845, -87.749515983)"
4,911,10508699,HZ250455,05/03/2016 10:00:00 PM,003XX N LOTUS AVE,820,THEFT,$500 AND UNDER,RESIDENCE,False,...,28.0,25.0,06,1139890.0,1901675.0,2016,05/10/2016 03:56:50 PM,41.886297,-87.761751,"(41.886297242, -87.761750709)"


## Missing Value Analysis and Crime Type Exploration

In this section, we analyze:

1. Missing values in each column
2. The number of unique crime categories
3. The most frequent crime types
4. The data type of the Date column

This helps us understand data quality and prepare for cleaning and feature engineering.


In [6]:
print("\nMissing values per column:")
print(df.isnull().sum())

print("\nNumber of unique crime types:")
print(df["Primary Type"].nunique())

print("\nTop 10 crime types:")
print(df["Primary Type"].value_counts().head(10))

print("\nDate column data type:")
print(df["Date"].dtype)



Missing values per column:
Unnamed: 0                 0
ID                         0
Case Number                0
Date                       0
Block                      0
IUCR                       0
Primary Type               0
Description                0
Location Description     173
Arrest                     0
Domestic                   0
Beat                       0
District                   0
Ward                       5
Community Area            40
FBI Code                   0
X Coordinate            7348
Y Coordinate            7348
Year                       0
Updated On                 0
Latitude                7348
Longitude               7348
Location                7348
dtype: int64

Number of unique crime types:
32

Top 10 crime types:
Primary Type
THEFT                  41887
BATTERY                36568
NARCOTICS              22518
CRIMINAL DAMAGE        21067
ASSAULT                12630
OTHER OFFENSE          12074
BURGLARY               11262
DECEPTIVE PRACTICE   