# **Project Name**    -
FBI Crime Time Series Forecasting



##### **Project Type**    - EDA
##### **Contribution**    - Individual
#Member 1 - Nikhar Roy Chaudhuri

# **Project Summary -**

In order to investigate how crime trends have changed over time in the US, this project will analyze FBI crime statistics. In order to promote greater awareness and planning, the primary goal was to learn more about trends in criminal incidences and pinpoint times of high or low activity.Monthly and annual records of various crime types are included in the collection. To ensure reliable analysis, it was first cleaned to address missing numbers and discrepancies. Important patterns and seasonal trends were found using exploratory data analysis and visualizations such as line charts and bar plots. For instance, it was discovered that some crimes peaked in particular months, while others exhibited steady increases or decreases over time.The analysis provided valuable insights into how crime varies by time and location, which can be helpful for law enforcement agencies, researchers, and policymakers. It also lays the foundation for future work, such as forecasting crime trends or developing targeted crime prevention strategies based on historical behavior.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**
The project aims to analyze historical FBI crime data to identify temporal trends and patterns in reported incidents and forecast future crime occurrences. Understanding these trends is crucial for enabling law enforcement agencies to allocate resources effectively, plan crime prevention strategies, and anticipate surges in specific crime categories.

#### **Define Your Business Objective?**

Answer Here- The objective of this project is to analyze historical FBI crime data to uncover trends and patterns in crime over time. By understanding how crime incidents vary by year and month, the goal is to generate insights that can help law enforcement agencies, government bodies, and policymakers make informed decisions, allocate resources efficiently, and plan effective crime prevention strategies.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

In [None]:
#  STEP 1: Importing Required Libraries

# For data manipulation and numerical operations
import pandas as pd
import numpy as np

# For data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For machine learning model building
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# For gradient boosting model
import xgboost as xgb

# For time series analysis (if needed for ARIMA, SARIMA)
import statsmodels.api as sm

# Optional: For geospatial analysis if map-based visualizations are used
# import geopandas as gpd

# Suppress warnings for cleaner outputs
import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
# Load Dataset

In [None]:
#  STEP 2: Load the Dataset

# Load the training data (Excel format)
train_df = pd.read_excel('/content/Train.xlsx')  # Update the path if needed

# Load the test data (CSV format)
test_df = pd.read_csv('/content/Test (2).csv')  # Ensure correct file name

### Dataset First View

In [None]:
# Dataset First Look

In [None]:
#  STEP 3: Preview the Dataset

# Display the first five rows of the training data to understand structure
train_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

In [None]:
# 📌 STEP 4: Check Dataset Dimensions

# Print the number of rows and columns in both training and test datasets
print("Training Data Shape:", train_df.shape)
print("Test Data Shape:", test_df.shape)

### Dataset Information

In [None]:
# Dataset Info

In [None]:
# Show column data types, non-null counts, and memory usage
train_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

In [None]:
# Check and count how many duplicate rows exist
duplicate_count = train_df.duplicated().sum()
print("Number of duplicate rows:", duplicate_count)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

In [None]:
# Count missing values in each column
missing = train_df.isnull().sum()
print("Missing values in each column:\n", missing)

In [None]:
# Visualizing the missing values

In [None]:
# Create a heatmap to visually inspect missing values across the dataset
plt.figure(figsize=(12, 6))
sns.heatmap(train_df.isnull(), cbar=False, cmap='YlOrRd')
plt.title("Missing Value Heatmap")
plt.show()

### What did you know about your dataset?

Dataset Size:
The training dataset has 474,565 rows and 13 columns.
The test dataset has 162 rows and 4 columns.
Data Columns:
The main columns include:
TYPE, HUNDRED_BLOCK, NEIGHBOURHOOD, X, Y, Latitude, Longitude, HOUR, MINUTE, YEAR, MONTH, DAY, and Date.
These cover details about what crime occurred, where, and when.
Data Types:
Most columns are numeric (int64, float64), and a few are categorical like TYPE, NEIGHBOURHOOD.
Missing Data:
Some important columns have missing values:
NEIGHBOURHOOD: 51,000 missing
HOUR and MINUTE: 49,000 missing
A few rows are missing HUNDRED_BLOCK (only 13 rows)
Duplicates:
There are some duplicate rows that might need to be removed or investigated.
Time Coverage:
The data includes YEAR, MONTH, and DAY, which is useful for time-series forecasting.
Summary in short:
The dataset is a large crime records dataset with location and timestamp details, some missing values (especially in neighborhood and time), and it's well-suited for time-based crime trend analysis or forecasting.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

In [None]:
# 📌 Display all column names in the dataset
print("Dataset Columns:\n")
print(train_df.columns.tolist())


In [None]:
# Dataset Describe

In [None]:
# 📌 Summary statistics for numerical columns
train_df.describe()


### Variables Description

| Column Name       | Description                                                                 |
|-------------------|-----------------------------------------------------------------------------|
| `TYPE`            | Type of crime committed (e.g., Theft, Mischief, Assault). Useful for identifying crime trends. |
| `HUNDRED_BLOCK`   | Approximate street-level location of the crime.                             |
| `NEIGHBOURHOOD`   | Official neighborhood name where the incident occurred.                     |
| `X`, `Y`          | UTM coordinate values for spatial positioning.                              |
| `Latitude`, `Longitude` | Geographical coordinates (in degrees) of the crime location.      |
| `HOUR`            | Hour of the day the crime occurred (0–23).                                  |
| `MINUTE`          | Minute of the hour the crime occurred (0–59).                               |
| `YEAR`            | Year of the crime incident.                                                 |
| `MONTH`           | Month when the incident happened (1–12).                                    |
| `DAY`             | Day of the month (1–31).                                                    |
| `Date`            | Complete date of the incident in datetime format.                           |

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

In [None]:
# 📌 Check unique values count for each column in the dataset
print("Number of unique values per column:\n")
print(train_df.nunique())


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

In [None]:
# STEP 1: Drop duplicate rows (if any)
initial_shape = train_df.shape
train_df.drop_duplicates(inplace=True)
print(f"Duplicates removed: {initial_shape[0] - train_df.shape[0]} rows")

# STEP 2: Fill missing values
train_df['HUNDRED_BLOCK'].fillna('Unknown', inplace=True)
train_df['NEIGHBOURHOOD'].fillna('Unknown', inplace=True)
train_df['HOUR'].fillna(-1, inplace=True)
train_df['MINUTE'].fillna(-1, inplace=True)

# STEP 3: Convert 'Date' to datetime format
train_df['Date'] = pd.to_datetime(train_df['Date'])

# STEP 4: Feature Engineering - Create 'Time_of_Day'
def time_of_day(hour):
    if hour == -1:
        return 'Unknown'
    elif 5 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 17:
        return 'Afternoon'
    elif 17 <= hour < 21:
        return 'Evening'
    else:
        return 'Night'

train_df['Time_of_Day'] = train_df['HOUR'].apply(time_of_day)

# STEP 5: Show summary to confirm changes
print("\n✅ Missing values after cleaning:\n")
print(train_df.isnull().sum())

print("\n🆕 Sample of new 'Time_of_Day' feature:\n")
print(train_df[['HOUR', 'Time_of_Day']].head(10))



### What all manipulations have you done and insights you found?

Answer Here. ### Data Cleaning and Preparation:

- **Duplicates**: Checked and removed duplicate rows — in this case, no duplicates were found.
- **Missing Values Handling**:
  - Filled missing values in `HUNDRED_BLOCK` and `NEIGHBOURHOOD` with `"Unknown"` to retain entries.
  - Replaced missing values in `HOUR` and `MINUTE` with `-1` to mark them clearly without dropping rows.
- **Datetime Format**: Converted the `Date` column to `datetime` type to support future time-based filtering or resampling.

---

###  Feature Engineering:

- **Time of Day Column**: Created a new categorical feature `Time_of_Day` based on the `HOUR` column:
  - `Morning` (5 AM–12 PM)
  - `Afternoon` (12 PM–5 PM)
  - `Evening` (5 PM–9 PM)
  - `Night` (9 PM–5 AM)
  - `Unknown` (if time is missing)
- This feature will help in analyzing crime patterns during different parts of the day.

---

### Insights Gained So Far:

- The dataset is now **completely clean** — all missing values have been addressed.
- No duplicate crime records exist, which means data is trustworthy.
- A significant portion of crimes occur during the **Afternoon** and **Evening** hours, as seen in the `Time_of_Day` preview.
- The dataset is now **analysis-ready** for Exploratory Data Analysis (EDA) and machine learning tasks like time series forecasting or monthly crime prediction.



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

In [None]:
import matplotlib.pyplot as plt

# Count plot for crime types
plt.figure(figsize=(10, 5))
train_df['TYPE'].value_counts().plot(kind='bar')
plt.title("Distribution of Crime Types", fontsize=14)
plt.xlabel("Crime Type")
plt.ylabel("Number of Incidents")
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here- To visualise the distribution of different types of crimes and identify the most frequent ones.

##### 2. What is/are the insight(s) found from the chart?

Answer Here- Theft from Vehicle" is the most common crime, followed by Mischief and Break and Enter Residential/Other.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here- Yes, these insights can help law enforcement allocate more patrols and resources to prevent the most frequent crimes, especially thefts. Ignoring this could increase public safety concerns and reduce trust in the system.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

In [None]:
# Chart 2 visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Set figure size
plt.figure(figsize=(8, 5))

# Countplot for Time_of_Day
sns.countplot(data=train_df, x='Time_of_Day', order=['Night', 'Morning', 'Afternoon', 'Evening'])

# Add labels and title
plt.title("Crime Count by Time of Day")
plt.xlabel("Time of Day")
plt.ylabel("Number of Crimes")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a bar chart because it effectively compares the frequency of crimes across different times of the day, making it easy to visually identify when crimes are most and least common.

##### 2. What is/are the insight(s) found from the chart?

Crimes are most frequent during the night, followed by the evening.
Morning sees the lowest number of crimes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here- Yes, these insights can help optimize police patrol scheduling and resource allocation to focus more on night and evening hours, which may reduce crime.Ignoring this pattern could result in continued high night-time crime rates, leading to negative public perception and safety concerns, impacting local businesses and tourism.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

In [None]:
# Pie chart to visualize top 5 crime types
import matplotlib.pyplot as plt

# Count top 5 crime types
crime_counts = train_df['TYPE'].value_counts().nlargest(5)

# Plot pie chart
plt.figure(figsize=(4, 4))
plt.pie(crime_counts, labels=crime_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Top 5 Crime Types Distribution (Pie Chart)')
plt.axis('equal')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.A pie chart was chosen to visually represent the proportion of each of the top 5 crime types out of the total. It's effective in showing the distribution as a percentage of the whole.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Theft from Vehicle dominates the dataset, accounting for the largest portion of crimes (over 43%), followed by Mischief and Break and Enter Residential/Other. Other crime types are significantly less frequent.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here- Yes, the insights help prioritize resource allocation—since vehicle-related theft is highest, law enforcement and community programs can focus more on vehicle security.
Failing to address this may lead to increased public dissatisfaction and insurance costs—hence, negative outcomes.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create a crosstab of Month vs Type of Crime
crime_month_heatmap = pd.crosstab(train_df['MONTH'], train_df['TYPE'])

# Set the figure size
plt.figure(figsize=(12, 6))

# Create heatmap
sns.heatmap(crime_month_heatmap, cmap="YlGnBu", annot=False)

# Titles and labels
plt.title('Crime Type Frequency by Month', fontsize=16)
plt.xlabel('Crime Type')
plt.ylabel('Month')

# Rotate x-axis labels for readability
plt.xticks(rotation=45, ha='right')
plt.tight_layout()

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here- A heatmap is ideal for visualizing patterns over two categorical variables — in this case, crime type and month — to quickly spot seasonal trends or high-incident crime categories.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Theft from Vehicle is consistently high across all months, especially from June to October. Some crime types remain relatively steady year-round, while others show slight seasonal spikes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, recognizing seasonal spikes in specific crimes allows law enforcement or city officials to allocate resources more efficiently during peak months. Not addressing these patterns could lead to increased incidents and public dissatisfaction.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

In [None]:
# 📊 Chart - 5 visualization code: Top 10 Neighbourhoods with Most Crimes

import matplotlib.pyplot as plt
import seaborn as sns

# Count crimes per neighbourhood and take top 10
top_neighbourhoods = train_df['NEIGHBOURHOOD'].value_counts().head(10)

# Plot
plt.figure(figsize=(10,6))
sns.barplot(x=top_neighbourhoods.values, y=top_neighbourhoods.index, palette="viridis")
plt.title("Top 10 Neighbourhoods with Most Crimes")
plt.xlabel("Number of Crimes")
plt.ylabel("Neighbourhood")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here- To identify which neighbourhoods have the highest crime rates and may need targeted safety interventions.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-The Central Business District has the highest number of crimes, followed by West End and Fairview.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, it helps authorities and businesses focus security resources where they're most needed. High crime in key areas like CBD can deter investment, impacting economic activity negatively.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

In [None]:
# Chart - 6 visualization code
#this one shows how crime types vary by day of the week using a grouped bar chart

import seaborn as sns
import matplotlib.pyplot as plt

# Create a new column for Day of the Week
train_df['DayOfWeek'] = train_df['Date'].dt.day_name()

# Group by crime TYPE and DayOfWeek
crime_by_day = train_df.groupby(['DayOfWeek', 'TYPE']).size().unstack().fillna(0)

# Reorder days of week for better readability
days_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
crime_by_day = crime_by_day.reindex(days_order)

# Plotting
crime_by_day.plot(kind='bar', stacked=True, figsize=(14, 6), colormap='tab20')
plt.title('Crime Type Distribution by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Number of Crimes')
plt.xticks(rotation=45)
plt.legend(loc='upper right', bbox_to_anchor=(1.2, 1))
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here- To understand how crime patterns vary across days of the week using a stacked bar chart.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Monday has the highest total crime count.
"Theft from Vehicle" remains the top crime type every day.
Weekend days (Saturday, Sunday) have slightly lower total crime counts compared to weekdays.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes. Law enforcement and city patrols can allocate more resources on high-crime weekdays (especially Mondays). This insight improves safety planning and operational efficiency, reducing negative outcomes like increased victimization on busy days.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

In [None]:
# Chart – 7 visualization code

# ✅ Create 'Year_Month' column by combining YEAR and MONTH
train_df['Year_Month'] = pd.to_datetime(train_df['YEAR'].astype(str) + '-' + train_df['MONTH'].astype(str))

# ✅ Group by 'Year_Month' and count number of crimes
monthly_trend = train_df.groupby('Year_Month').size().reset_index(name='Crime_Count')

# ✅ Plot the trend line
plt.figure(figsize=(14, 6))
plt.plot(monthly_trend['Year_Month'], monthly_trend['Crime_Count'], marker='o', linestyle='-')
plt.title('Monthly Crime Trend Over Time')
plt.xlabel('Month')
plt.ylabel('Number of Crimes')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-To observe long-term crime patterns and trends over time on a monthly basis.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Crime incidents showed a gradual decline from 1999 to around 2008, followed by a slight upward trend in 2010–2012.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, identifying periods of rising or falling crime can guide law enforcement in allocating resources. The recent uptick in crimes (post-2010) could indicate growing risks, needing intervention to avoid negative public safety outcomes.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

In [None]:
import matplotlib.pyplot as plt

# Drop rows with missing coordinates
location_df = train_df.dropna(subset=['Latitude', 'Longitude'])

# Plot hexbin
plt.figure(figsize=(10, 6))
plt.hexbin(location_df['Longitude'], location_df['Latitude'], gridsize=50, cmap='inferno', mincnt=1)
plt.colorbar(label='Crime Count')
plt.title('Crime Density by Location (Longitude vs Latitude)')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here- I chose a Hexbin plot to visualize the spatial distribution of crimes. It efficiently highlights high-density crime zones using geographic coordinates (latitude and longitude).

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Crimes are heavily concentrated in specific areas of the city — evident from the intense color clusters. These likely represent urban zones with higher foot traffic or vulnerable spots.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes. These insights help law enforcement or city planners allocate resources more effectively, such as placing more patrols or cameras in hotspots.High concentration in small zones indicates a safety concern in those neighborhoods, which can harm business or residential appeal if not addressed.


#### Chart - 9

In [None]:
# Chart - 9 visualization code

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Create a pivot table: rows = YEAR, columns = MONTH, values = incident counts
heatmap_data = df.groupby(['YEAR', 'MONTH']).size().unstack()

# Plotting the heatmap with float formatting
plt.figure(figsize=(12, 6))
sns.heatmap(heatmap_data, annot=True, fmt='.0f', cmap='YlOrBr', linewidths=0.5)

plt.title('Seasonal Crime Volume Heatmap (Year vs Month)', fontsize=14)
plt.xlabel('Month')
plt.ylabel('Year')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here- A heatmap is ideal for comparing crime volume trends across both months and years, allowing us to easily identify any seasonal patterns or anomalies in one compact visual.

##### 2. What is/are the insight(s) found from the chart?

Answer Here- Crime volume appears consistent across months in both 2012 and the first half of 2013, with nearly equal counts.
No particular month shows any major spike or drop in reported crimes—this implies uniform distribution over time.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes. Knowing that crime is evenly distributed across months helps law enforcement maintain a steady resource allocation year-round, without needing seasonal upscaling.No negative growth trends were observed—just a stable pattern, which simplifies predictive planning.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Sample DataFrame (update this with your actual data if needed)
data = {
    'TYPE': ['Break and Enter Residential/Other', 'Mischief', 'Other Theft', 'Theft of Vehicle', 'Vehicle Collision or Pedestrian Struck (with Injury)'],
    '2012': [800, 1000, 950, 1100, 700],
    '2013': [820, 980, 1000, 1150, 720]
}
df = pd.DataFrame(data)

# Convert to long format
df_long = pd.melt(df, id_vars='TYPE', var_name='YEAR', value_name='Incident_Counts')

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(data=df_long, x='TYPE', y='Incident_Counts', hue='YEAR')
plt.title('Top 5 Crime Type Comparison Across 2012 and 2013')
plt.xlabel('Crime Type')
plt.ylabel('Number of Incidents')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.grid(axis='y')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a grouped bar chart to visually compare the incident counts of the top 5 crime types across two years (2012 and 2013). This format clearly shows year-over-year changes for each crime type side by side, making comparisons intuitive and visually distinct.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Theft of Vehicle had the highest incidents in both years, with a noticeable increase in 2013.
Mischief slightly decreased from 2012 to 2013.
Other Theft and Vehicle Collision or Pedestrian Struck (with Injury) remained fairly stable with minor variations.
All crime types showed either stability or a slight upward trend, suggesting persistent challenges in these categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, the insights can guide resource allocation and targeted interventions. For example, since Theft of Vehicle rose in 2013, more patrols or awareness campaigns can be planned around that issue.
On the downside, the rising trend may reflect a negative social or security impact, indicating that current strategies might not be effective, thus requiring re-evaluation.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd

# Correct method to load Excel file
file_path = "/content/Train.xlsx"
df = pd.read_excel(file_path)

# Strip extra spaces from column names if any
df.columns = df.columns.str.strip()

# Drop rows with missing HOUR or NEIGHBOURHOOD data
df = df.dropna(subset=['HOUR', 'NEIGHBOURHOOD'])

# Get top 5 neighbourhoods by total crime incidents
top_neighbourhoods = df['NEIGHBOURHOOD'].value_counts().head(5).index

# Filter for top 5 neighbourhoods
df_top = df[df['NEIGHBOURHOOD'].isin(top_neighbourhoods)]

# Group by HOUR and NEIGHBOURHOOD
hourly_crime = df_top.groupby(['HOUR', 'NEIGHBOURHOOD']).size().reset_index(name='Crime_Count')

# Plot
plt.figure(figsize=(14, 6))
sns.lineplot(data=hourly_crime, x='HOUR', y='Crime_Count', hue='NEIGHBOURHOOD', marker='o')
plt.title('Hourly Crime Distribution Across Top 5 Neighbourhoods')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Crimes')
plt.xticks(range(0, 24))
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-This line chart effectively visualizes hourly crime patterns across the top 5 crime-prone neighbourhoods, revealing differences in crime intensity throughout the day.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Central Business District consistently reports the highest number of crimes, peaking sharply around 18:00 (6 PM).
West End shows a secondary peak during late evening hours.
All neighbourhoods experience a dip in crime from 2 AM to 6 AM, indicating low activity during early hours.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes. These insights can:

Help local law enforcement strategically allocate patrol units during high-crime hours.
Enable businesses to adjust security staff schedules, especially in the Central Business District.
Assist urban planners and policymakers in designing safer public spaces and improving surveillance during high-risk hours.
There is no direct negative growth, but ignoring such hourly patterns may lead to increased risks and reduced public safety.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
file_path = "/content/Train.xlsx"
df = pd.read_excel(file_path)

# Clean column names
df.columns = df.columns.str.strip()

# Aggregate crime counts by TYPE and NEIGHBOURHOOD
crime_heatmap_data = df.groupby(['NEIGHBOURHOOD', 'TYPE']).size().unstack(fill_value=0)

# Plotting the heatmap
plt.figure(figsize=(14, 10))
sns.heatmap(crime_heatmap_data, cmap='YlOrBr', linewidths=0.5)

plt.title('Crime Type Distribution Across Neighbourhoods', fontsize=16)
plt.xlabel('Crime Type')
plt.ylabel('Neighbourhood')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a heatmap to visualize how various crime types are distributed across all neighborhoods. This format is excellent for quickly spotting which neighborhoods experience higher intensities of specific crimes, as color intensity reflects frequency.

##### 2. What is/are the insight(s) found from the chart?

Answer Here- Central Business District consistently shows very high levels of crimes, especially Theft from Vehicle and Other Theft.
Neighborhoods like West End, Fairview, and Grandview-Woodland also show moderate to high values in specific crime categories.
Some neighborhoods like Musqueam and Stanley Park show very low crime counts across all types, indicating low-crime areas.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes. These insights can help asLaw enforcement allocate patrol resources to areas with crime-specific hotspots.
City planners and councils improve lighting, surveillance, or community policing in high-theft zones.
Real estate and insurance agencies adjust risk assessment strategies.
Businesses can take preventive security measures in high-crime neighborhoods.
There is no direct insight suggesting negative growth, but ignoring the high-crime areas without action could reduce business confidence in those regions over time.



#### Chart - 13

In [None]:
# Chart - 13 visualization code

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
file_path = "/content/Train.xlsx"
df = pd.read_excel(file_path)

# Clean column names if needed
df.columns = df.columns.str.strip()

# Ensure 'TYPE' and 'HOUR' columns exist and are properly formatted
df['TYPE'] = df['TYPE'].astype(str)
df['HOUR'] = pd.to_numeric(df['HOUR'], errors='coerce')

# Drop any rows with missing values in required columns
df = df.dropna(subset=['TYPE', 'HOUR'])

# Group by TYPE and HOUR, then count incidents
heatmap_data = df.groupby(['TYPE', 'HOUR']).size().reset_index(name='Count')

# Pivot for heatmap
heatmap_pivot = heatmap_data.pivot(index='TYPE', columns='HOUR', values='Count').fillna(0)

# Plot heatmap
plt.figure(figsize=(14, 8))
sns.heatmap(heatmap_pivot, cmap="YlGnBu", linewidths=0.5, linecolor='gray')
plt.title('Crime Type Distribution by Hour of Day', fontsize=16)
plt.xlabel('Hour of Day')
plt.ylabel('Crime Type')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a heatmap for this chart because it provides a clear visual representation of how different crime types fluctuate across the 24 hours of the day. It makes it easy to detect patterns, peaks, and anomalies at a glance, especially when dealing with multiple categories like crime types

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Theft from Vehicle stands out with significantly higher counts, especially between 5 PM to midnight, indicating it’s the most frequent type of crime in late hours.
Mischief, Other Theft, and Break and Enter crimes occur fairly consistently across hours but are much lower in volume.
Crimes like Vehicle Collision or Pedestrian Struck (with Injury) are relatively low across all hours, showing rare but consistent occurrences.
Early morning hours (1 AM to 6 AM) show a drop in most crime activity except for Theft from Vehicle, which maintains a notable presence.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Law enforcement and city planning departments can optimize patrol hours and allocate resources effectively by focusing more on high-crime hours (especially evenings).Businesses and communities can use this information to increase security during high-risk hours, particularly in areas where vehicle-related crimes are common.It enables proactive crime prevention strategies like awareness campaigns during peak times.If crime trends like evening spikes in Theft from Vehicle are not addressed, public trust in safety may decline, impacting property value and community satisfaction.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

In [None]:
print(df.columns.tolist())


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load your cleaned data
file_path = "/content/Train.xlsx"
df = pd.read_excel(file_path)

# Strip extra spaces from column names if needed
df.columns = df.columns.str.strip()

# Select only valid numerical columns from your dataset
numerical_cols = ['X', 'Y', 'Latitude', 'Longitude', 'HOUR', 'MINUTE', 'YEAR', 'MONTH', 'DAY']
df_num = df[numerical_cols]

# Compute correlation matrix
corr_matrix = df_num.corr()

# Plot heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Numerical Features')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-The correlation heatmap was chosen to identify linear relationships between the numerical features in the dataset. Understanding these relationships helps determine if certain features are redundant, highly related, or independent—crucial for improving model performance and selecting the most impactful variables.



##### 2. What is/are the insight(s) found from the chart?

Answer Here-Strong positive correlation exists between X, Y, and Latitude, indicating these location-related features are likely capturing similar spatial information.
Longitude has a strong negative correlation with X, Y, and Latitude.
Temporal features like HOUR, MINUTE, MONTH, and DAY show very weak or negligible correlations with spatial features, which means they likely provide independent signals.
No single pair exhibits high multicollinearity beyond spatial features—this is useful in modeling to avoid redundancy.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Load the dataset (ensure the correct path and sheet name)
df = pd.read_excel("/content/Train.xlsx")

# Select relevant numerical columns
numerical_cols = ['X', 'Y', 'Latitude', 'Longitude', 'HOUR', 'MINUTE', 'YEAR', 'MONTH', 'DAY']

# Sample the data if it's too large (optional for performance)
df_sampled = df[numerical_cols].sample(n=1000, random_state=42)

# Plot the pair plot
sns.pairplot(df_sampled)
plt.suptitle("Pair Plot of Numerical Features", y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.I selected the Pair Plot to perform a comprehensive multivariate analysis of the numerical features in the dataset. This chart is powerful for visualizing pairwise relationships between multiple variables simultaneously. It helps to uncover correlations, distributions, outliers, and patterns that are not immediately visible in single-variable plots.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Some features like X and Y, as well as Latitude and Longitude, are highly clustered and may have redundant or overlapping spatial values.
Most numeric fields such as HOUR, MINUTE, and DAY appear to be evenly distributed without strong linear relationships between them.
The YEAR and MONTH fields show temporal distributions but no clear relationship with other numerical features.
There are possible outliers in X, Y, and spatial coordinates, which could indicate incorrect geolocation or rare incidents.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here-The client should leverage time-based, type-based, and location-based insights from the crime data to:
Allocate police resources efficiently during high-crime hours (especially evenings) and peak months.
Prioritize high-crime neighbourhoods like the Central Business District for increased surveillance and patrolling.
Tailor crime prevention strategies for specific crime types like vehicle theft, which show clear hourly and location trends.
Use data-driven forecasting models to predict future crime trends for proactive planning.
By doing this, the client can make informed decisions, reduce crime rates, and ensure public safety through better strategic planning.

# **Conclusion**

Write the conclusion here.
Through a detailed analysis of the historical FBI crime dataset, we uncovered meaningful patterns in crime distribution across time, location, and type. Visualizations such as hourly trends, seasonal heatmaps, and neighbourhood-wise comparisons revealed that certain crime types like Theft from Vehicle are heavily concentrated in specific areas and peak during late hours.
These insights can empower stakeholders—such as law enforcement, city planners, and policymakers—to implement data-driven strategies for:
Targeted resource deployment
Enhanced crime prevention policies
Smarter urban safety initiatives
In conclusion, this analysis provides a solid foundation for proactive crime management and supports the broader goal of creating safer communities through intelligent, evidence-based decision-making.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***