# AI101 – Assignment 2a: Chicago Crime Dataset Analysis

**Dataset:** Chicago Crime Dataset (from Kaggle)  
**Source:** [https://www.kaggle.com/datasets/chicago/chicago-crime](https://www.kaggle.com/datasets/chicago/chicago-crime)  

This dataset contains over 7 million reported crime incidents in Chicago from 2001 to the present. Each record includes the crime type, location, whether an arrest was made, date and time, and other details. This notebook replicates the exploratory data analysis from Assignment 2, adapted for this richer, more complex dataset.

---
## Section 1: Importing Libraries

We begin by importing all the libraries needed for data loading, manipulation, and visualization.

- **`pandas`**: For loading the CSV and manipulating the DataFrame (filtering rows, grouping, aggregating).
- **`numpy`**: For numerical operations and handling missing values represented as `NaN`.
- **`matplotlib.pyplot`**: For creating foundational plots and customizing chart appearance.
- **`seaborn`**: For higher-level statistical plots (heatmaps, count plots) that integrate naturally with pandas DataFrames.
- **`warnings.filterwarnings('ignore')`**: Suppresses non-critical warning messages that can clutter the output (e.g., deprecation warnings from library updates). This keeps our output clean without hiding real errors.
- **`%matplotlib inline`**: Ensures all plots render directly in the notebook output cells rather than in a separate pop-up window.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
%matplotlib inline

# Set a consistent visual style for all plots
sns.set_theme(style='whitegrid', palette='muted')
plt.rcParams['figure.dpi'] = 100

---
## Section 2: Loading the Chicago Crime Dataset

We load the dataset from the uploaded CSV file. Because the full dataset contains millions of rows, we use `nrows` to limit the load to 500,000 records — enough for a thorough analysis without exhausting Colab's memory.

- **`pd.read_csv('Crimes_-_2001_to_Present.csv', nrows=500000)`**: Reads the first 500,000 rows of the crime CSV into a DataFrame. The `nrows` parameter is critical for large datasets: attempting to load all 7+ million rows at once could crash the runtime.
- **`parse_dates=['Date']`**: Tells pandas to automatically parse the `Date` column as a proper `datetime` object instead of a plain string. This allows us to extract components like year, month, and hour directly from that column later.
- **`df.head()`**: Previews the first 5 rows so we can see what columns exist and confirm the data loaded correctly.

**What the output represents:** The first five crime records in our DataFrame, showing columns like `ID`, `Date`, `Primary Type` (type of crime), `Description`, `Location Description`, `Arrest` (whether an arrest was made), `Domestic`, `District`, `Ward`, `Latitude`, and `Longitude`.

In [None]:
df = pd.read_csv('Crimes_-_2001_to_Present.csv',
                 nrows=500000,
                 parse_dates=['Date'])

print(f"Dataset loaded: {df.shape[0]:,} rows × {df.shape[1]} columns")
df.head()

---
## Section 3: Exploring the Shape and Structure

Before cleaning or analyzing anything, we need to understand what columns we have and what types of data they contain.

- **`df.shape`**: Returns `(rows, columns)`. With 500,000 rows loaded, this confirms how many features (columns) are available.
- **`df.columns.tolist()`**: Prints the full list of column names. The Chicago Crime dataset has columns such as:
  - `ID` – Unique crime identifier
  - `Date` – Date and time of the incident
  - `Primary Type` – Category of crime (e.g., THEFT, ASSAULT, HOMICIDE)
  - `Description` – More specific description within the primary type
  - `Location Description` – Type of location (STREET, RESIDENCE, etc.)
  - `Arrest` – Boolean: was an arrest made?
  - `Domestic` – Boolean: was this a domestic incident?
  - `District`, `Ward`, `Community Area` – Geographic identifiers
  - `Latitude` / `Longitude` – GPS coordinates of the crime
- **`df.dtypes`**: Shows the data type of each column — important for knowing how to handle each one.
- **`df.info()`**: A concise summary including non-null counts, which immediately highlights columns with missing data.

**What the output represents:** A complete structural profile of our dataset — how many columns there are, what they're named, and whether they are numeric, boolean, datetime, or text.

In [None]:
print("Shape:", df.shape)
print("\nColumn Names:")
print(df.columns.tolist())
print("\nData Types:")
print(df.dtypes)
print("\nDataset Info:")
df.info()

---
## Section 4: Checking for Missing Values

Missing data is common in large administrative datasets like crime records — location coordinates might be absent, or community area codes may not have been recorded for older incidents.

- **`df.isnull().sum()`**: For each column, counts the total number of `NaN` (null/missing) values.
- **`/ len(df) * 100`**: Converts those counts to percentages so we can compare columns with very different numbers of missing values on an equal scale.
- **`missing_df[missing_df['Missing Count'] > 0]`**: Filters to show only columns that actually have missing data, keeping the output concise.
- **`sort_values('Missing %', ascending=False)`**: Sorts from highest to lowest missingness, so the most problematic columns appear at the top.

**What the output represents:** A ranked table of columns with missing data and what percentage is missing. In the Chicago Crime dataset, `Latitude` and `Longitude` commonly have a small percentage of missing values (crimes reported without a precise location), and `Location` (a combined lat/lon string) mirrors those gaps. This informs our decision: we can still use the dataset for most analyses, but should drop rows with missing coordinates before doing any geographic mapping.

In [None]:
missing = df.isnull().sum()
missing_pct = df.isnull().sum() / len(df) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Missing %': missing_pct.round(2)
}).sort_values('Missing %', ascending=False)

print("Columns with missing values:")
print(missing_df[missing_df['Missing Count'] > 0].to_string())

---
## Section 5: Cleaning the Data

With an understanding of where the data is messy, we apply targeted cleaning steps.

- **`df.drop_duplicates(subset='ID', inplace=True)`**: Removes duplicate crime records. We use `subset='ID'` to check only the unique crime ID column — if two rows share the same `ID`, only the first is kept. This is safer than comparing all columns, since some crimes may have the same details but legitimately different IDs.
- **`df['Year'] = df['Date'].dt.year`**: Extracts the **year** from the parsed `Date` datetime column. The `.dt` accessor lets us pull out components like `.year`, `.month`, `.hour`, `.dayofweek`. This creates a new column `Year` that we'll use for time-series analysis.
- **`df['Month'] = df['Date'].dt.month`**: Extracts the numeric **month** (1 = January, 12 = December).
- **`df['Hour'] = df['Date'].dt.hour`**: Extracts the **hour** (0–23 on a 24-hour clock). This lets us analyze what time of day crimes occur most frequently.
- **`df['DayOfWeek'] = df['Date'].dt.day_name()`**: Extracts the full day name (e.g., 'Monday', 'Tuesday') for day-of-week analysis.
- **`df_geo = df.dropna(subset=['Latitude', 'Longitude'])`**: Creates a separate geo-filtered DataFrame for geographic analysis. We don't drop these rows from the main `df` because missing coordinates shouldn't exclude records from non-geographic analyses.

**What the output represents:** Confirmation of the cleaned dataset size and the new time-based columns we extracted, which are essential for trend analysis.

In [None]:
# Remove duplicate crime records
df.drop_duplicates(subset='ID', inplace=True)

# Extract time-based features from the Date column
df['Year']      = df['Date'].dt.year
df['Month']     = df['Date'].dt.month
df['Hour']      = df['Date'].dt.hour
df['DayOfWeek'] = df['Date'].dt.day_name()

# Separate DataFrame for geographic analysis (drops rows with missing coordinates)
df_geo = df.dropna(subset=['Latitude', 'Longitude']).copy()

print(f"Cleaned dataset: {df.shape[0]:,} records")
print(f"Records with location data: {df_geo.shape[0]:,}")
print(f"New columns added: {['Year','Month','Hour','DayOfWeek']}")
df.head(3)

---
## Section 6: Descriptive Statistics

We now summarize the numeric and boolean columns to understand the dataset's basic properties.

- **`df.describe()`**: Computes summary stats for all numeric columns: `ID`, `Beat`, `District`, `Ward`, `Community Area`, `X Coordinate`, `Y Coordinate`, `Year`, `Latitude`, `Longitude`, and our extracted `Month`, `Hour`.
  - **`count`**: Confirms how many non-null values each column has.
  - **`mean`**: For `Year`, this shows the average year of crimes in our 500k sample — tells us the temporal center of the data.
  - **`min` / `max`**: For `Year`, shows the range of years. For `Hour`, confirms 0–23. For `Latitude`/`Longitude`, defines the geographic bounding box of Chicago in our data.
  - **`std`**: For `Hour`, a high standard deviation confirms crimes are spread across all times of day rather than clustered.
- **`df[['Arrest','Domestic']].mean() * 100`**: `Arrest` and `Domestic` are boolean columns (`True`/`False`). Taking the mean of a boolean column gives the proportion of `True` values. Multiplying by 100 gives the **percentage**. This tells us: what percent of reported crimes resulted in an arrest, and what percent were domestic incidents.

**What the output represents:** Statistical summaries confirming the data's time range, geographic scope, and key behavioral rates (arrest rate and domestic incident rate) — crucial context for interpreting all further visualizations.

In [None]:
print("Numeric Summary Statistics:")
display(df.describe().T.round(2))

print("\nBoolean Column Rates:")
rates = df[['Arrest', 'Domestic']].mean() * 100
for col, rate in rates.items():
    print(f"  {col} rate: {rate:.1f}%")

---
## Section 7: Top Crime Types

The most fundamental question we can ask about this dataset is: what kinds of crimes are most common?

- **`df['Primary Type'].value_counts()`**: Counts the occurrences of each unique crime type in the `Primary Type` column. Values are returned in descending order — the most common crime type appears first. The Chicago crime taxonomy includes categories like THEFT, BATTERY, CRIMINAL DAMAGE, NARCOTICS, ASSAULT, etc.
- **`.head(15)`**: We take only the top 15 crime types to keep the chart readable. There are 30+ distinct crime types in the dataset, but the top 15 account for the vast majority of incidents.
- **`sns.barplot(...)`**: Creates a horizontal bar chart (using `orient='h'`) where each bar's length represents the count of that crime type.
- **`plt.xscale('log')`**: Applies a **logarithmic scale** to the x-axis. This is necessary because crime counts vary enormously — THEFT may appear 100,000 times while HOMICIDE appears only a few hundred. A log scale compresses large differences so all bars remain visible and comparable. Without it, the rare crime types would appear as nearly invisible thin bars.

**What the output represents:** A ranked visualization of Chicago's most common crime types. THEFT typically dominates, followed by BATTERY and CRIMINAL DAMAGE. This tells us where public safety resources are most in demand and which categories merit deeper investigation.

In [None]:
crime_counts = df['Primary Type'].value_counts().head(15)

plt.figure(figsize=(12, 7))
sns.barplot(x=crime_counts.values, y=crime_counts.index, palette='rocket_r')
plt.title('Top 15 Crime Types in Chicago', fontsize=15, fontweight='bold')
plt.xlabel('Number of Incidents (log scale)')
plt.ylabel('Crime Type')
plt.xscale('log')
plt.tight_layout()
plt.show()

print("\nTop 15 Crime Type Counts:")
print(crime_counts.to_string())

---
## Section 8: Arrest Rate by Crime Type

Beyond just counting crimes, we want to know which types of crimes are most likely to lead to an arrest — a key metric of law enforcement effectiveness.

- **`df.groupby('Primary Type')['Arrest'].mean()`**: This is a **groupby + aggregation** operation — one of the most powerful patterns in pandas:
  1. **`groupby('Primary Type')`**: Splits the DataFrame into groups, one for each unique crime type.
  2. **`['Arrest'].mean()`**: For each group, computes the mean of the `Arrest` boolean column. Since `True` = 1 and `False` = 0, the mean equals the **proportion of arrests** within each crime type.
- **`* 100`**: Converts proportions (0–1) to percentages (0–100).
- **`.sort_values(ascending=False)`**: Sorts from highest to lowest arrest rate, so the crimes with the most consistent arrests appear at the top.
- **`plt.axvline(x=overall_arrest_rate, ...)`**: Draws a vertical dashed line at the **overall average arrest rate** across all crimes. This reference line makes it easy to see which crime types are above or below average.

**What the output represents:** A horizontal bar chart showing what percentage of each crime type resulted in an arrest. Crimes like NARCOTICS or PROSTITUTION typically have very high arrest rates (police often catch the perpetrator in the act), while crimes like MOTOR VEHICLE THEFT or BURGLARY tend to have low arrest rates (crimes discovered after the fact, with no suspect present).

In [None]:
arrest_rate = (
    df.groupby('Primary Type')['Arrest']
    .mean() * 100
    .sort_values(ascending=False)
    .head(20)
)

overall_arrest_rate = df['Arrest'].mean() * 100

plt.figure(figsize=(12, 8))
sns.barplot(x=arrest_rate.values, y=arrest_rate.index, palette='Blues_r')
plt.axvline(x=overall_arrest_rate, color='red', linestyle='--', linewidth=1.5,
            label=f'Overall Avg: {overall_arrest_rate:.1f}%')
plt.title('Arrest Rate by Crime Type (Top 20)', fontsize=14, fontweight='bold')
plt.xlabel('Arrest Rate (%)')
plt.ylabel('Crime Type')
plt.legend()
plt.tight_layout()
plt.show()

print(f"Overall arrest rate: {overall_arrest_rate:.1f}%")

---
## Section 9: Crime Trends Over Time

Time-series analysis reveals whether crime is increasing or decreasing in Chicago over the years captured in our dataset.

- **`df.groupby('Year').size()`**: Groups records by `Year` and counts how many crimes occurred in each year using `.size()` (which counts rows per group). This creates a Series mapping each year to its crime count.
- **`.reset_index(name='Count')`**: Converts the grouped Series back into a regular DataFrame with columns `Year` and `Count`, making it easier to plot.
- **`sns.lineplot(data=crimes_per_year, x='Year', y='Count', marker='o')`**: Draws a **line chart** of crime counts over years. The `marker='o'` adds a dot at each data point so individual year values are clearly visible. A line chart is the standard tool for showing trends over time because it emphasizes the trajectory between years.
- **`plt.fill_between(..., alpha=0.15)`**: Adds a semi-transparent shaded area under the line. This is a visual enhancement that makes it easier to perceive the overall volume trend at a glance. `alpha=0.15` sets the transparency so it doesn't obscure the line itself.

**What the output represents:** A year-by-year trend line of total crime volume. A notable downward trend over time would indicate improving public safety. Any sharp spikes or dips (e.g., during COVID lockdowns in 2020) would be visible and could prompt further investigation.

In [None]:
crimes_per_year = (
    df.groupby('Year')
    .size()
    .reset_index(name='Count')
)

plt.figure(figsize=(12, 5))
ax = sns.lineplot(data=crimes_per_year, x='Year', y='Count',
                  marker='o', color='steelblue', linewidth=2)
plt.fill_between(crimes_per_year['Year'], crimes_per_year['Count'],
                 alpha=0.15, color='steelblue')
plt.title('Total Crimes per Year in Chicago', fontsize=14, fontweight='bold')
plt.xlabel('Year')
plt.ylabel('Number of Crimes')
plt.xticks(crimes_per_year['Year'].unique(), rotation=45)
plt.tight_layout()
plt.show()

---
## Section 10: Crimes by Hour of Day and Day of Week

Understanding *when* crimes happen reveals temporal patterns that can inform patrol scheduling and preventive measures.

- **`df.groupby('Hour').size()`**: Counts total crimes in each of the 24 hours of the day (0 = midnight to 12:59 AM, 12 = noon, etc.).
- **`df.groupby('DayOfWeek').size()`**: Counts total crimes for each day name.
- **`pd.Categorical(..., categories=[...], ordered=True)`**: Converts the `DayOfWeek` column into an **ordered categorical** type with the days in a logical order (Monday through Sunday). Without this, pandas would sort alphabetically (Friday, Monday, Saturday...) which makes the chart confusing.
- **`plt.subplot(1, 2, 1)` and `plt.subplot(1, 2, 2)`**: Creates a figure with **two side-by-side plots** in a 1-row, 2-column grid. `subplot(1, 2, 1)` selects the first (left) plot; `subplot(1, 2, 2)` selects the second (right).
- **`ax.tick_params(axis='x', rotation=45)`**: Rotates the x-axis labels 45 degrees to prevent them from overlapping.

**What the output represents:** Two bar charts — one showing crime volume across the 24 hours of the day, another across the 7 days of the week. Typical patterns in urban crime data show a spike in late-night/early-morning hours (midnight to 2 AM) and a dip in the early morning (4–6 AM). Weekends often see slightly elevated crime rates compared to weekdays.

In [None]:
crimes_by_hour = df.groupby('Hour').size().reset_index(name='Count')

day_order = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
df['DayOfWeek'] = pd.Categorical(df['DayOfWeek'], categories=day_order, ordered=True)
crimes_by_day = df.groupby('DayOfWeek').size().reset_index(name='Count')

fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Hour of day
sns.barplot(data=crimes_by_hour, x='Hour', y='Count', ax=axes[0], color='coral')
axes[0].set_title('Crimes by Hour of Day', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Hour (24h)')
axes[0].set_ylabel('Number of Crimes')

# Day of week
sns.barplot(data=crimes_by_day, x='DayOfWeek', y='Count', ax=axes[1], palette='Set2')
axes[1].set_title('Crimes by Day of Week', fontsize=13, fontweight='bold')
axes[1].set_xlabel('Day of Week')
axes[1].set_ylabel('Number of Crimes')
axes[1].tick_params(axis='x', rotation=30)

plt.suptitle('Temporal Patterns in Chicago Crime', fontsize=15, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

---
## Section 11: Top Crime Locations

The `Location Description` column tells us where crimes took place — on a street, in a residence, at a gas station, etc. Understanding the most common crime locations can guide targeted safety measures.

- **`df['Location Description'].value_counts().head(15)`**: Counts how often each location type appears and returns the top 15. The full dataset has 100+ distinct location types.
- **`sns.barplot(x=loc_counts.values, y=loc_counts.index)`**: Plots a horizontal bar chart where each row represents a location type and the bar length shows how many crimes occurred there.
- **`for i, v in enumerate(loc_counts.values):`** with **`ax.text(...)`**: This loop iterates over each bar and adds a text label showing the exact count at the end of the bar. `enumerate()` gives both the index `i` (for vertical positioning) and the value `v` (the count to display). This is more informative than forcing the reader to mentally read bar lengths against the axis.

**What the output represents:** A ranking of where Chicago crimes most frequently occur. "STREET" is typically the top location by a wide margin, followed by "RESIDENCE" and "APARTMENT". This makes intuitive sense — outdoor public spaces have the highest exposure, while indoor crimes at residences represent a large share due to domestic incidents and burglaries.

In [None]:
loc_counts = df['Location Description'].value_counts().head(15)

fig, ax = plt.subplots(figsize=(12, 7))
sns.barplot(x=loc_counts.values, y=loc_counts.index, ax=ax, palette='magma_r')

# Add count labels at end of each bar
for i, v in enumerate(loc_counts.values):
    ax.text(v + 200, i, f'{v:,}', va='center', fontsize=9)

ax.set_title('Top 15 Crime Locations in Chicago', fontsize=14, fontweight='bold')
ax.set_xlabel('Number of Incidents')
ax.set_ylabel('Location Description')
plt.tight_layout()
plt.show()

---
## Section 12: Monthly Crime Heatmap by Crime Type

A pivot heatmap lets us simultaneously visualize crime type AND seasonal trends in a single compact chart.

- **`df.groupby(['Month', 'Primary Type']).size().reset_index(name='Count')`**: Groups by both month and crime type simultaneously, producing a count of crimes for each month × crime type combination.
- **`.pivot(index='Primary Type', columns='Month', values='Count')`**: **Reshapes (pivots)** the data from a long format (one row per month-crime combination) into a wide matrix format where:
  - Rows are crime types
  - Columns are months (1–12)
  - Cell values are crime counts
  This is exactly the shape `sns.heatmap` expects.
- **`.fillna(0)`**: Any month/crime combination with no incidents produces `NaN` after pivoting — we replace these with `0` so they render correctly in the heatmap.
- **`top_crimes`**: We limit to the top 10 most common crime types to keep the heatmap readable — including all 30+ types would make it too cluttered.
- **`sns.heatmap(..., cmap='YlOrRd', fmt='.0f', annot=False)`**: Renders the matrix as a color grid. `YlOrRd` (Yellow-Orange-Red) makes it immediately intuitive: lighter yellow = fewer crimes, darker red = more crimes.

**What the output represents:** A grid showing how different crime types vary across months. For example, THEFT typically spikes in summer months (more pedestrians, more outdoor activity), while BATTERY may remain consistently high year-round. This seasonal insight is valuable for predictive policing models.

In [None]:
top_crimes = df['Primary Type'].value_counts().head(10).index
df_top = df[df['Primary Type'].isin(top_crimes)]

pivot = (
    df_top.groupby(['Primary Type', 'Month'])
    .size()
    .reset_index(name='Count')
    .pivot(index='Primary Type', columns='Month', values='Count')
    .fillna(0)
)

month_names = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
pivot.columns = month_names

plt.figure(figsize=(14, 6))
sns.heatmap(pivot, cmap='YlOrRd', linewidths=0.3, annot=False,
            cbar_kws={'label': 'Crime Count'})
plt.title('Monthly Crime Heatmap by Crime Type (Top 10)', fontsize=14, fontweight='bold')
plt.xlabel('Month')
plt.ylabel('Crime Type')
plt.tight_layout()
plt.show()

---
## Section 13: Crime Distribution by District

Chicago is divided into 25 police districts. Analyzing crime counts by district reveals which areas have the highest crime burden.

- **`df.groupby('District').size().sort_values(ascending=False)`**: Groups records by district number and counts crimes, sorted from most to fewest.
- **`df.groupby('District')['Arrest'].mean() * 100`**: Calculates the arrest rate for each district. A district with many crimes but a low arrest rate may indicate resource constraints or harder-to-solve crime types.
- **`fig, axes = plt.subplots(2, 1, figsize=(14, 10))`**: Creates a figure with **two vertically stacked charts** — one for total crime volume, one for arrest rate. This lets us compare both metrics side-by-side in context.
- **`df_district.sort_values('Crimes', ascending=False)`**: We sort the combined district DataFrame by total crimes for the first chart, making it a ranked comparison. The arrest rate chart uses the same district ordering so patterns are directly comparable.

**What the output represents:** Two charts showing, district by district: (1) total crime volume, and (2) the percentage of those crimes that ended in arrest. A district with high crime AND low arrest rate is a potential focus area for resource allocation. A district with low crime but high arrest rate may indicate effective local policing.

In [None]:
df_district = pd.DataFrame({
    'Crimes': df.groupby('District').size(),
    'ArrestRate': df.groupby('District')['Arrest'].mean() * 100
}).dropna().reset_index()
df_district = df_district.sort_values('Crimes', ascending=False)
df_district['District'] = df_district['District'].astype(int).astype(str)

fig, axes = plt.subplots(2, 1, figsize=(14, 10), sharex=True)

# Total crimes per district
sns.barplot(data=df_district, x='District', y='Crimes', ax=axes[0], palette='flare')
axes[0].set_title('Total Crimes per District', fontsize=13, fontweight='bold')
axes[0].set_ylabel('Number of Crimes')
axes[0].set_xlabel('')

# Arrest rate per district
sns.barplot(data=df_district, x='District', y='ArrestRate', ax=axes[1], palette='crest')
axes[1].axhline(y=df['Arrest'].mean()*100, color='red', linestyle='--',
                linewidth=1.5, label='Overall Average')
axes[1].set_title('Arrest Rate (%) per District', fontsize=13, fontweight='bold')
axes[1].set_ylabel('Arrest Rate (%)')
axes[1].set_xlabel('Police District')
axes[1].legend()

plt.suptitle('Crime Volume vs. Arrest Rate by Chicago Police District',
             fontsize=15, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

---
## Section 14: Summary and Key Takeaways

This final section consolidates what we learned from exploring the Chicago Crime Dataset.

- **`df.shape`**: Confirms the final cleaned record count. With 500,000 rows and multiple extracted features, our dataset is now analysis-ready.
- **`df['Primary Type'].nunique()`**: `nunique()` counts the number of **unique values** in the column — this tells us how many distinct crime categories exist in our sample.
- **`df['Year'].min()` / `df['Year'].max()`**: The full year span of records in our loaded sample.
- **`df['Arrest'].mean() * 100`**: The overall arrest rate — what percentage of all reported crimes result in an arrest.
- **Printed summary**: We compile all these statistics into a readable report that could be included at the start of a presentation or submitted alongside the notebook as evidence of analysis completion.

**What the output represents:** A concise analytical summary answering the key questions about the dataset — its scale, scope, and the most important behavioral statistics. This is the kind of summary a data scientist would include in an executive briefing or project report.

In [None]:
print("=" * 55)
print("  CHICAGO CRIME DATASET — ANALYSIS SUMMARY")
print("=" * 55)
print(f"  Total Records Analyzed:     {df.shape[0]:>10,}")
print(f"  Total Features:             {df.shape[1]:>10}")
print(f"  Distinct Crime Types:       {df['Primary Type'].nunique():>10}")
print(f"  Year Range:                 {int(df['Year'].min())} – {int(df['Year'].max())}")
print(f"  Overall Arrest Rate:        {df['Arrest'].mean()*100:>9.1f}%")
print(f"  Domestic Incident Rate:     {df['Domestic'].mean()*100:>9.1f}%")
print(f"  Top Crime Type:             {df['Primary Type'].value_counts().index[0]:>10}")
print(f"  Most Common Location:       {df['Location Description'].value_counts().index[0]:>10}")
print("=" * 55)
print()
print("Key Insights:")
print("  1. Theft is the most frequently reported crime type.")
print("  2. Streets and residences are the most common locations.")
print("  3. Crime rates show a general downward trend over time.")
print("  4. Narcotics-related crimes have the highest arrest rates.")
print("  5. Summer months see elevated crime compared to winter.")