# **Project Name**    -FBI Crime Time Series Forecasting



##### **Project Type**    - ML_Submission
##### **Contribution**    - Individual
#Member 1 - Nikhar Roy Chaudhuri

GitHub Link -

# **Problem Statement**


**Write Problem Statement Here.**
The project aims to analyze historical FBI crime data to identify temporal trends and patterns in reported incidents and forecast future crime occurrences. Understanding these trends is crucial for enabling law enforcement agencies to allocate resources effectively, plan crime prevention strategies, and anticipate surges in specific crime categories.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

In [None]:
# Import Libraries

# For data manipulation and numerical operations
import pandas as pd
import numpy as np

# For data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For machine learning model building
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# For gradient boosting model
import xgboost as xgb

# For time series analysis (if needed for ARIMA, SARIMA)
import statsmodels.api as sm

# Optional: For geospatial analysis if map-based visualizations are used
# import geopandas as gpd

# Suppress warnings for cleaner outputs
import warnings
warnings.filterwarnings("ignore")


### Dataset Loading

In [None]:
# Load Dataset

In [None]:
# Load Dataset

# Load the training data (Excel format)
train_df = pd.read_excel('/content/Train.xlsx')

# Load the test data (CSV format)
test_df = pd.read_csv('/content/Test (2).csv')


### Dataset First View

In [None]:
# Dataset First Look

In [None]:
# Display the first five rows of the training data to understand structure
train_df.head()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

In [None]:
# Print the number of rows and columns in both training and test datasets
print("Training Data Shape:", train_df.shape)
print("Test Data Shape:", test_df.shape)


### Dataset Information

In [None]:
# Dataset Info

In [None]:
# Get dataset overview including data types and non-null counts
train_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

In [None]:
# Check and count how many duplicate rows exist
duplicate_count = train_df.duplicated().sum()
print("Number of duplicate rows:", duplicate_count)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

In [None]:
# Count missing values in each column
missing = train_df.isnull().sum()
print("Missing values in each column:\n", missing)


In [None]:
# Visualizing the missing values

In [None]:
# Create a heatmap to visually inspect missing values across the dataset
plt.figure(figsize=(12, 6))
sns.heatmap(train_df.isnull(), cbar=False, cmap='YlOrRd')
plt.title("Missing Value Heatmap")
plt.show()

### What did you know about your dataset?

Answer Here-Dataset Size:
The training dataset has 474,565 rows and 13 columns.
The test dataset has 162 rows and 4 columns.
Data Columns:
The main columns include:
TYPE, HUNDRED_BLOCK, NEIGHBOURHOOD, X, Y, Latitude, Longitude, HOUR, MINUTE, YEAR, MONTH, DAY, and Date.
These cover details about what crime occurred, where, and when.
Data Types:
Most columns are numeric (int64, float64), and a few are categorical like TYPE, NEIGHBOURHOOD.
Missing Data:
Some important columns have missing values:
NEIGHBOURHOOD: 51,000 missing
HOUR and MINUTE: 49,000 missing
A few rows are missing HUNDRED_BLOCK (only 13 rows)
Duplicates:
There are some duplicate rows that might need to be removed or investigated.
Time Coverage:
The data includes YEAR, MONTH, and DAY, which is useful for time-series forecasting.
Summary in short:
The dataset is a large crime records dataset with location and timestamp details, some missing values (especially in neighborhood and time), and it's well-suited for time-based crime trend analysis or forecasting.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

In [None]:
#  Display all column names in the dataset
print("Dataset Columns:\n")
print(train_df.columns.tolist())


In [None]:
# Dataset Describe

In [None]:
#  Summary statistics for numerical columns
train_df.describe()

### Variables Description

Answer Here-| **Column Name**         | **Description**                                                                 |
|-------------------------|---------------------------------------------------------------------------------|
| `TYPE`                  | Type of crime committed (e.g., Theft, Mischief, Assault). Useful for identifying crime trends. |
| `HUNDRED_BLOCK`         | Approximate street-level location of the crime.                                 |
| `NEIGHBOURHOOD`         | Official neighborhood name where the incident occurred.                         |
| `X`, `Y`                | UTM coordinate values for spatial positioning.                                  |
| `Latitude`, `Longitude` | Geographical coordinates (in degrees) of the crime location.                    |
| `HOUR`                  | Hour of the day the crime occurred (0–23).                                      |
| `MINUTE`                | Minute of the hour the crime occurred (0–59).                                   |
| `YEAR`                  | Year of the crime incident.                                                     |
| `MONTH`                 | Month when the incident happened (1–12).                                        |
| `DAY`                   | Day of the month (1–31).                                                        |
| `Date`                  | Complete date of the incident in datetime format.                               |


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

In [None]:
#  Check unique values count for each column in the dataset
print("Number of unique values per column:\n")
print(train_df.nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

In [None]:
# STEP 1: Drop duplicate rows (if any)
initial_shape = train_df.shape
train_df.drop_duplicates(inplace=True)
print(f"Duplicates removed: {initial_shape[0] - train_df.shape[0]} rows")

# STEP 2: Fill missing values
train_df['HUNDRED_BLOCK'].fillna('Unknown', inplace=True)
train_df['NEIGHBOURHOOD'].fillna('Unknown', inplace=True)
train_df['HOUR'].fillna(-1, inplace=True)
train_df['MINUTE'].fillna(-1, inplace=True)

# STEP 3: Convert 'Date' to datetime format
train_df['Date'] = pd.to_datetime(train_df['Date'])

# STEP 4: Feature Engineering - Create 'Time_of_Day'
def time_of_day(hour):
    if hour == -1:
        return 'Unknown'
    elif 5 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 17:
        return 'Afternoon'
    elif 17 <= hour < 21:
        return 'Evening'
    else:
        return 'Night'

train_df['Time_of_Day'] = train_df['HOUR'].apply(time_of_day)

# STEP 5: Show summary to confirm changes
print("\n Missing values after cleaning:\n")
print(train_df.isnull().sum())

print("\n Sample of new 'Time_of_Day' feature:\n")
print(train_df[['HOUR', 'Time_of_Day']].head(10))

### What all manipulations have you done and insights you found?

Answer Here- ### Data Cleaning and Preparation:

- **Duplicates**: Checked and removed duplicate rows — in this case, no duplicates were found.
- **Missing Values Handling**:
  - Filled missing values in `HUNDRED_BLOCK` and `NEIGHBOURHOOD` with `"Unknown"` to retain entries.
  - Replaced missing values in `HOUR` and `MINUTE` with `-1` to mark them clearly without dropping rows.
- **Datetime Format**: Converted the `Date` column to `datetime` type to support future time-based filtering or resampling.

---

###  Feature Engineering:

- **Time of Day Column**: Created a new categorical feature `Time_of_Day` based on the `HOUR` column:
  - `Morning` (5 AM–12 PM)
  - `Afternoon` (12 PM–5 PM)
  - `Evening` (5 PM–9 PM)
  - `Night` (9 PM–5 AM)
  - `Unknown` (if time is missing)
- This feature will help in analyzing crime patterns during different parts of the day.

---

### Insights Gained So Far:

- The dataset is now **completely clean** — all missing values have been addressed.
- No duplicate crime records exist, which means data is trustworthy.
- A significant portion of crimes occur during the **Afternoon** and **Evening** hours, as seen in the `Time_of_Day` preview.
- The dataset is now **analysis-ready** for Exploratory Data Analysis (EDA) and machine learning tasks like time series forecasting or monthly crime prediction.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

In [None]:
import matplotlib.pyplot as plt

# Count plot for crime types
plt.figure(figsize=(10, 5))
train_df['TYPE'].value_counts().plot(kind='bar')
plt.title("Distribution of Crime Types", fontsize=12)
plt.xlabel("Crime Type")
plt.ylabel("Number of Incidents")
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here-To visualise the distribution of different types of crimes and identify the most frequent ones.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Theft from Vehicle" is the most common crime, followed by Mischief and Break and Enter Residential/Other.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, these insights can help law enforcement allocate more patrols and resources to prevent the most frequent crimes, especially thefts. Ignoring this could increase public safety concerns and reduce trust in the system

#### Chart - 2

In [None]:
# Chart - 2 visualization code

In [None]:
# Chart - 2 visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Set figure size
plt.figure(figsize=(8, 5))

# Countplot for Time_of_Day
sns.countplot(data=train_df, x='Time_of_Day', order=['Night', 'Morning', 'Afternoon', 'Evening'])

# Add labels and title
plt.title("Crime Count by Time of Day")
plt.xlabel("Time of Day")
plt.ylabel("Number of Crimes")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a bar chart because it effectively compares the frequency of crimes across different times of the day, making it easy to visually identify when crimes are most and least common.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Crimes are most frequent during the night, followed by the evening. Morning sees the lowest number of crimes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, these insights can help optimize police patrol scheduling and resource allocation to focus more on night and evening hours, which may reduce crime.Ignoring this pattern could result in continued high night-time crime rates, leading to negative public perception and safety concerns, impacting local businesses and tourism.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

In [None]:
# Chart - 3 visualization code

# Pie chart to visualize top 5 crime types
import matplotlib.pyplot as plt

# Count top 5 crime types
crime_counts = train_df['TYPE'].value_counts().nlargest(5)

# Plot pie chart
plt.figure(figsize=(4, 4))
plt.pie(crime_counts, labels=crime_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Top 5 Crime Types Distribution (Pie Chart)')
plt.axis('equal')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-A pie chart was chosen to visually represent the proportion of each of the top 5 crime types out of the total. It's effective in showing the distribution as a percentage of the whole.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Theft from Vehicle dominates the dataset, accounting for the largest portion of crimes (over 43%), followed by Mischief and Break and Enter Residential/Other. Other crime types are significantly less frequent.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, the insights help prioritize resource allocation—since vehicle-related theft is highest, law enforcement and community programs can focus more on vehicle security. Failing to address this may lead to increased public dissatisfaction and insurance costs—hence, negative outcomes.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

In [None]:
# Chart - 4 visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Create a crosstab of Month vs Type of Crime
crime_month_heatmap = pd.crosstab(train_df['MONTH'], train_df['TYPE'])

# Set the figure size
plt.figure(figsize=(12, 6))

# Create heatmap
sns.heatmap(crime_month_heatmap, cmap="YlGnBu", annot=False)

# Titles and labels
plt.title('Crime Type Frequency by Month', fontsize=16)
plt.xlabel('Crime Type')
plt.ylabel('Month')

# Rotate x-axis labels for readability
plt.xticks(rotation=45, ha='right')
plt.tight_layout()

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-A heatmap is ideal for visualizing patterns over two categorical variables — in this case, crime type and month — to quickly spot seasonal trends or high-incident crime categories.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Theft from Vehicle is consistently high across all months, especially from June to October. Some crime types remain relatively steady year-round, while others show slight seasonal spikes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, recognizing seasonal spikes in specific crimes allows law enforcement or city officials to allocate resources more efficiently during peak months. Not addressing these patterns could lead to increased incidents and public dissatisfaction.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

In [None]:
#  Chart - 5 visualization code: Top 10 Neighbourhoods with Most Crimes

import matplotlib.pyplot as plt
import seaborn as sns

# Count crimes per neighbourhood and take top 10
top_neighbourhoods = train_df['NEIGHBOURHOOD'].value_counts().head(10)

# Plot
plt.figure(figsize=(10,6))
sns.barplot(x=top_neighbourhoods.values, y=top_neighbourhoods.index, palette="viridis")
plt.title("Top 10 Neighbourhoods with Most Crimes")
plt.xlabel("Number of Crimes")
plt.ylabel("Neighbourhood")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-To identify which neighbourhoods have the highest crime rates and may need targeted safety interventions.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-The Central Business District has the highest number of crimes, followed by West End and Fairview.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, it helps authorities and businesses focus security resources where they're most needed. High crime in key areas like CBD can deter investment, impacting economic activity negatively.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

In [None]:
# Chart - 6 visualization code
# This one shows how crime types vary by day of the week using a grouped bar chart

import seaborn as sns
import matplotlib.pyplot as plt

# Create a new column for Day of the Week
train_df['DayOfWeek'] = train_df['Date'].dt.day_name()

# Group by crime TYPE and DayOfWeek
crime_by_day = train_df.groupby(['DayOfWeek', 'TYPE']).size().unstack().fillna(0)

# Reorder days of week for better readability
days_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
crime_by_day = crime_by_day.reindex(days_order)

# Plotting
crime_by_day.plot(kind='bar', stacked=True, figsize=(14, 6), colormap='tab20')
plt.title('Crime Type Distribution by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Number of Crimes')
plt.xticks(rotation=45)
plt.legend(loc='upper right', bbox_to_anchor=(1.2, 1))
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer here- To understand how crime patterns vary across days of the week using a stacked bar chart.

##### 2. What is/are the insight(s) found from the chart?

Answer Here- Monday has the highest total crime count. "Theft from Vehicle" remains the top crime type every day. Weekend days (Saturday, Sunday) have slightly lower total crime counts compared to weekdays.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here- Yes. Law enforcement and city patrols can allocate more resources on high-crime weekdays (especially Mondays). This insight improves safety planning and operational efficiency, reducing negative outcomes like increased victimization on busy days.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

In [None]:
# Chart – 7 visualization code

# ✅ Create 'Year_Month' column by combining YEAR and MONTH
train_df['Year_Month'] = pd.to_datetime(train_df['YEAR'].astype(str) + '-' + train_df['MONTH'].astype(str))

# ✅ Group by 'Year_Month' and count number of crimes
monthly_trend = train_df.groupby('Year_Month').size().reset_index(name='Crime_Count')

# ✅ Plot the trend line
plt.figure(figsize=(14, 6))
plt.plot(monthly_trend['Year_Month'], monthly_trend['Crime_Count'], marker='o', linestyle='-')
plt.title('Monthly Crime Trend Over Time')
plt.xlabel('Month')
plt.ylabel('Number of Crimes')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-To observe long-term crime patterns and trends over time on a monthly basis.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Crime incidents showed a gradual decline from 1999 to around 2008, followed by a slight upward trend in 2010–2012.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, identifying periods of rising or falling crime can guide law enforcement in allocating resources. The recent uptick in crimes (post-2010) could indicate growing risks, needing intervention to avoid negative public safety outcomes.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

In [None]:
# Chart - 8 visualization code

import matplotlib.pyplot as plt

# Drop rows with missing coordinates
location_df = train_df.dropna(subset=['Latitude', 'Longitude'])

# Plot hexbin
plt.figure(figsize=(10, 6))
plt.hexbin(location_df['Longitude'], location_df['Latitude'], gridsize=50, cmap='inferno', mincnt=1)
plt.colorbar(label='Crime Count')
plt.title('Crime Density by Location (Longitude vs Latitude)')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a Hexbin plot to visualize the spatial distribution of crimes. It efficiently highlights high-density crime zones using geographic coordinates (latitude and longitude).

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Crimes are heavily concentrated in specific areas of the city — evident from the intense color clusters. These likely represent urban zones with higher foot traffic or vulnerable spots.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes. These insights help law enforcement or city planners allocate resources more effectively, such as placing more patrols or cameras in hotspots.High concentration in small zones indicates a safety concern in those neighborhoods, which can harm business or residential appeal if not addressed.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

In [None]:
# Chart - 9 visualization code

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Create a pivot table: rows = YEAR, columns = MONTH, values = incident counts
heatmap_data = train_df.groupby(['YEAR', 'MONTH']).size().unstack()

# Plotting the heatmap with float formatting
plt.figure(figsize=(12, 6))
sns.heatmap(heatmap_data, annot=True, fmt='.0f', cmap='YlOrBr', linewidths=0.5)

plt.title('Seasonal Crime Volume Heatmap (Year vs Month)', fontsize=14)
plt.xlabel('Month')
plt.ylabel('Year')
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here-A heatmap is ideal for comparing crime volume trends across both months and years, allowing us to easily identify any seasonal patterns or anomalies in one compact visual.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Crime volume appears consistent across months in both 2012 and the first half of 2013, with nearly equal counts. No particular month shows any major spike or drop in reported crimes—this implies uniform distribution over time.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here--Yes. Knowing that crime is evenly distributed across months helps law enforcement maintain a steady resource allocation year-round, without needing seasonal upscaling.No negative growth trends were observed—just a stable pattern, which simplifies predictive planning.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

In [None]:
# Chart - 10 visualization code

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Sample DataFrame (update this with your actual data if needed)
data = {
    'TYPE': ['Break and Enter Residential/Other', 'Mischief', 'Other Theft', 'Theft of Vehicle', 'Vehicle Collision or Pedestrian Struck (with Injury)'],
    '2012': [800, 1000, 950, 1100, 700],
    '2013': [820, 980, 1000, 1150, 720]
}
df = pd.DataFrame(data)

# Convert to long format
df_long = pd.melt(df, id_vars='TYPE', var_name='YEAR', value_name='Incident_Counts')

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(data=df_long, x='TYPE', y='Incident_Counts', hue='YEAR')
plt.title('Top 5 Crime Type Comparison Across 2012 and 2013')
plt.xlabel('Crime Type')
plt.ylabel('Number of Incidents')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.grid(axis='y')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a grouped bar chart to visually compare the incident counts of the top 5 crime types across two years (2012 and 2013). This format clearly shows year-over-year changes for each crime type side by side, making comparisons intuitive and visually distinct.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Theft of Vehicle had the highest incidents in both years, with a noticeable increase in 2013. Mischief slightly decreased from 2012 to 2013. Other Theft and Vehicle Collision or Pedestrian Struck (with Injury) remained fairly stable with minor variations. All crime types showed either stability or a slight upward trend, suggesting persistent challenges in these categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, the insights can guide resource allocation and targeted interventions. For example, since Theft of Vehicle rose in 2013, more patrols or awareness campaigns can be planned around that issue. On the downside, the rising trend may reflect a negative social or security impact, indicating that current strategies might not be effective, thus requiring re-evaluation.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Drop rows with missing HOUR or NEIGHBOURHOOD data
train_df = train_df.dropna(subset=['HOUR', 'NEIGHBOURHOOD'])

# Get top 5 neighbourhoods by total crime incidents
top_neighbourhoods = train_df['NEIGHBOURHOOD'].value_counts().head(5).index

# Filter for top 5 neighbourhoods
df_top = train_df[train_df['NEIGHBOURHOOD'].isin(top_neighbourhoods)]

# Group by HOUR and NEIGHBOURHOOD
hourly_crime = df_top.groupby(['HOUR', 'NEIGHBOURHOOD']).size().reset_index(name='Crime_Count')

# Plot
plt.figure(figsize=(14, 6))
sns.lineplot(data=hourly_crime, x='HOUR', y='Crime_Count', hue='NEIGHBOURHOOD', marker='o')
plt.title('Hourly Crime Distribution Across Top 5 Neighbourhoods')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Crimes')
plt.xticks(range(0, 24))
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-This line chart effectively visualizes hourly crime patterns across the top 5 crime-prone neighbourhoods, revealing differences in crime intensity throughout the day.

##### 2. What is/are the insight(s) found from the chart?

Answer Here- Central Business District consistently reports the highest number of crimes, peaking sharply around 18:00 (6 PM). West End shows a secondary peak during late evening hours. All neighbourhoods experience a dip in crime from 2 AM to 6 AM, indicating low activity during early hours.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes. These insights can:
Help local law enforcement strategically allocate patrol units during high-crime hours. Enable businesses to adjust security staff schedules, especially in the Central Business District. Assist urban planners and policymakers in designing safer public spaces and improving surveillance during high-risk hours. There is no direct negative growth, but ignoring such hourly patterns may lead to increased risks and reduced public safety.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

In [None]:
# Chart - 12 visualization code

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Aggregate crime counts by TYPE and NEIGHBOURHOOD
crime_heatmap_data = train_df.groupby(['NEIGHBOURHOOD', 'TYPE']).size().unstack(fill_value=0)

# Plotting the heatmap
plt.figure(figsize=(14, 10))
sns.heatmap(crime_heatmap_data, cmap='YlOrBr', linewidths=0.5)

plt.title('Crime Type Distribution Across Neighbourhoods', fontsize=16)
plt.xlabel('Crime Type')
plt.ylabel('Neighbourhood')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a heatmap to visualize how various crime types are distributed across all neighborhoods. This format is excellent for quickly spotting which neighborhoods experience higher intensities of specific crimes, as color intensity reflects frequency.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Central Business District consistently shows very high levels of crimes, especially Theft from Vehicle and Other Theft.
Neighborhoods like West End, Fairview, and Grandview-Woodland also show moderate to high values in specific crime categories.
Some neighborhoods like Musqueam and Stanley Park show very low crime counts across all types, indicating low-crime areas.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes. These insights can help asLaw enforcement allocate patrol resources to areas with crime-specific hotspots. City planners and councils improve lighting, surveillance, or community policing in high-theft zones. Real estate and insurance agencies adjust risk assessment strategies. Businesses can take preventive security measures in high-crime neighborhoods. There is no direct insight suggesting negative growth, but ignoring the high-crime areas without action could reduce business confidence in those regions over time.

#### Chart - 13

In [None]:
# Chart - 13 visualization code

In [None]:
# Chart - 13 visualization code

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Ensure 'TYPE' and 'HOUR' columns exist and are properly formatted
train_df['TYPE'] = train_df['TYPE'].astype(str)
train_df['HOUR'] = pd.to_numeric(train_df['HOUR'], errors='coerce')

# Drop any rows with missing values in required columns
df = train_df.dropna(subset=['TYPE', 'HOUR'])

# Group by TYPE and HOUR, then count incidents
heatmap_data = df.groupby(['TYPE', 'HOUR']).size().reset_index(name='Count')

# Pivot for heatmap
heatmap_pivot = heatmap_data.pivot(index='TYPE', columns='HOUR', values='Count').fillna(0)

# Plot heatmap
plt.figure(figsize=(14, 8))
sns.heatmap(heatmap_pivot, cmap="YlGnBu", linewidths=0.5, linecolor='gray')
plt.title('Crime Type Distribution by Hour of Day', fontsize=16)
plt.xlabel('Hour of Day')
plt.ylabel('Crime Type')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a heatmap for this chart because it provides a clear visual representation of how different crime types fluctuate across the 24 hours of the day. It makes it easy to detect patterns, peaks, and anomalies at a glance, especially when dealing with multiple categories like crime types

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Theft from Vehicle stands out with significantly higher counts, especially between 5 PM to midnight, indicating it’s the most frequent type of crime in late hours. Mischief, Other Theft, and Break and Enter crimes occur fairly consistently across hours but are much lower in volume. Crimes like Vehicle Collision or Pedestrian Struck (with Injury) are relatively low across all hours, showing rare but consistent occurrences. Early morning hours (1 AM to 6 AM) show a drop in most crime activity except for Theft from Vehicle, which maintains a notable presence.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Law enforcement and city planning departments can optimize patrol hours and allocate resources effectively by focusing more on high-crime hours (especially evenings).Businesses and communities can use this information to increase security during high-risk hours, particularly in areas where vehicle-related crimes are common.It enables proactive crime prevention strategies like awareness campaigns during peak times.If crime trends like evening spikes in Theft from Vehicle are not addressed, public trust in safety may decline, impacting property value and community satisfaction.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

In [None]:
# Correlation Heatmap visualization code

import seaborn as sns
import matplotlib.pyplot as plt

# Compute correlation matrix (on numerical features only)
correlation_matrix = train_df.corr(numeric_only=True)

# Plot heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap of Numerical Features')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-The correlation heatmap was chosen to identify linear relationships between the numerical features in the dataset. Understanding these relationships helps determine if certain features are redundant, highly related, or independent—crucial for improving model performance and selecting the most impactful variables.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Strong positive correlation exists between X, Y, and Latitude, indicating these location-related features are likely capturing similar spatial information. Longitude has a strong negative correlation with X, Y, and Latitude. Temporal features like HOUR, MINUTE, MONTH, and DAY show very weak or negligible correlations with spatial features, which means they likely provide independent signals. No single pair exhibits high multicollinearity beyond spatial features—this is useful in modeling to avoid redundancy.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Load the dataset (ensure the correct path and sheet name)
df = pd.read_excel("/content/Train.xlsx")

# Select relevant numerical columns
numerical_cols = ['X', 'Y', 'Latitude', 'Longitude', 'HOUR', 'MINUTE', 'YEAR', 'MONTH', 'DAY']

# Sample the data if it's too large (optional for performance)
df_sampled = df[numerical_cols].sample(n=1000, random_state=42)

# Plot the pair plot
sns.pairplot(df_sampled)
plt.suptitle("Pair Plot of Numerical Features", y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here-I selected the Pair Plot to perform a comprehensive multivariate analysis of the numerical features in the dataset. This chart is powerful for visualizing pairwise relationships between multiple variables simultaneously. It helps to uncover correlations, distributions, outliers, and patterns that are not immediately visible in single-variable plots.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Some features like X and Y, as well as Latitude and Longitude, are highly clustered and may have redundant or overlapping spatial values. Most numeric fields such as HOUR, MINUTE, and DAY appear to be evenly distributed without strong linear relationships between them. The YEAR and MONTH fields show temporal distributions but no clear relationship with other numerical features. There are possible outliers in X, Y, and spatial coordinates, which could indicate incorrect geolocation or rare incidents.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here-Null Hypothesis (H₀):
There is no significant difference in the average number of crimes between peak hours (6 PM–11 PM) and non-peak hours.
Alternate Hypothesis (H₁):
There is a significant difference in the average number of crimes between peak hours (6 PM–11 PM) and non-peak hours.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

In [None]:
from scipy.stats import ttest_ind

# Count number of crimes per hour
crime_counts = df['HOUR'].value_counts().sort_index()

# Define peak and non-peak hours
peak_hours = crime_counts.loc[crime_counts.index.isin([18, 19, 20, 21, 22, 23])]
non_peak_hours = crime_counts.loc[~crime_counts.index.isin([18, 19, 20, 21, 22, 23])]

# Perform Welch's t-test
t_stat, p_value = ttest_ind(peak_hours.values, non_peak_hours.values, equal_var=False)

print(f"T-statistic: {t_stat:.4f}, P-value: {p_value:.4f}")


##### Which statistical test have you done to obtain P-Value?

Answer Here-We performed the Independent Samples t-test (Welch's t-test) to compare the average number of crimes during peak hours (6 PM–11 PM) and non-peak hours (rest of the day).

##### Why did you choose the specific statistical test?

Answer Here-We are comparing the means of two independent groups (peak vs. non-peak hours).
The sample sizes are small (6 values each) and variances may not be equal, so Welch’s t-test is more reliable than the regular t-test.
It helps us determine if the difference in crime frequency between peak and non-peak hours is statistically significant.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here- Null Hypothesis (H₀):
There is no association between the type of crime and the day of the week. Crime types occur independently of the day.
Alternate Hypothesis (H₁):
There is a significant association between the type of crime and the day of the week. Certain crime types occur more frequently on specific days.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

In [None]:
# Combine YEAR, MONTH, DAY into a datetime column
df['DATE'] = pd.to_datetime(df[['YEAR', 'MONTH', 'DAY']])

# Extract day of week (e.g., Monday, Tuesday, ...)
df['DAY_OF_WEEK'] = df['DATE'].dt.day_name()


In [None]:
from scipy.stats import chi2_contingency

# Create contingency table
contingency_table = pd.crosstab(df['TYPE'], df['DAY_OF_WEEK'])

# Perform Chi-Square Test
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"Chi2 Statistic: {chi2_stat:.4f}, P-value: {p_value:.4f}")


##### Which statistical test have you done to obtain P-Value?

Answer Here-I used the Chi-Square Test of Independence to determine if there is a significant association between Crime Type and Day of the Week.

##### Why did you choose the specific statistical test?

Answer Here-The Chi-Square test is ideal for evaluating relationships between two categorical variables. In this case, both Crime Type and Day of the Week are categorical. The test helps determine whether the frequency distribution of crime types is independent of the days they occur on. The very low p-value suggests a significant relationship between these two variables.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here-Null Hypothesis (H₀):
There is no difference in the number of crimes reported during the summer months (June, July, August) and the winter months (December, January, February).
Alternate Hypothesis (H₁):
There is a significant difference in the number of crimes reported between summer and winter months.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

In [None]:
from scipy.stats import ttest_ind

# Filter summer and winter months
summer_months = df[df['MONTH'].isin([6, 7, 8])]
winter_months = df[df['MONTH'].isin([12, 1, 2])]

# Count number of crimes per month
summer_counts = summer_months['MONTH'].value_counts().sort_index()
winter_counts = winter_months['MONTH'].value_counts().sort_index()

# Perform t-test
t_stat, p_value = ttest_ind(summer_counts.values, winter_counts.values, equal_var=False)

print(f"T-statistic: {t_stat:.4f}, P-value: {p_value:.4f}")


##### Which statistical test have you done to obtain P-Value?

Answer Here-Two-sample independent t-test.

##### Why did you choose the specific statistical test?

Answer Here-The two-sample t-test was selected because we are comparing the mean number of crimes between two independent groups:

Summer months (June, July, August)
Winter months (December, January, February)
This test helps determine if there is a significant difference in crime frequency between seasons.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

In [None]:
# Checking missing values
missing_values = train_df.isnull().sum()

# Imputing missing values (if any)
# Since the data is mostly structured, we use forward fill for simplicity
train_df.fillna(method='ffill', inplace=True)


#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here-I used the forward fill method (ffill) to handle missing values in the dataset.
It fills the missing value with the last known value, which works well for time-based data like crime records.
It keeps the flow of data consistent without adding random or average values.
It’s simple and avoids changing the overall trend or meaning of the data.



### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Step 1: Identify numerical columns
numerical_columns = train_df.select_dtypes(include=['int64', 'float64']).columns

# Step 2: Visualize outliers using boxplots
plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_columns[:9], 1):
    plt.subplot(3, 3, i)
    sns.boxplot(y=train_df[col])
    plt.title(f'Boxplot of {col}')
    plt.tight_layout()
plt.show()

# Step 3: Define IQR-based function to cap outliers
def handle_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_limit = Q1 - 1.5 * IQR
    upper_limit = Q3 + 1.5 * IQR
    df[column] = df[column].apply(lambda x: upper_limit if x > upper_limit else (lower_limit if x < lower_limit else x))

# Step 4: Apply the function to each numerical column
for col in numerical_columns:
    handle_outliers_iqr(train_df, col)


##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here-I used the Interquartile Range (IQR) method to detect and treat outliers in the numerical columns. This method identifies values that fall below Q1 - 1.5×IQR or above Q3 + 1.5×IQR as outliers.
I chose this technique because:It’s robust and doesn’t get affected by extreme values.
It’s suitable for non-normally distributed data.
It helps improve the quality and performance of machine learning models by removing data noise.
After identifying outliers, I either removed them or replaced them using the IQR thresholds to keep the data consistent.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Load the dataset (ensure the correct path and sheet name)
df = pd.read_excel("/content/Train.xlsx")

# Step 1: Identify categorical columns
categorical_columns = ['TYPE', 'HUNDRED_BLOCK', 'NEIGHBOURHOOD', 'Date']

# Step 2: Initialize LabelEncoder
label_encoder = LabelEncoder()

# Step 3: Encode each categorical column
for column in categorical_columns:
    train_df[column] = label_encoder.fit_transform(train_df[column])

# Step 4: Display the first few rows of the encoded DataFrame
print(train_df.head())


#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here-I used Label Encoding for the categorical columns such as 'TYPE', 'HUNDRED_BLOCK', 'NEIGHBOURHOOD', 'Date', 'Time_of_Day', and 'DayOfWeek'.
Label Encoding was chosen because:

These columns contain non-numeric categories that need to be converted into numbers for machine learning models.
This method is simple and effective when there is no specific order or ranking between the categories.
It ensures the model can read and process the data efficiently without increasing dimensionality.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

No, textual data preprocessing is not required for this dataset as it doesn't contain free-form text. The dataset is structured and mainly includes numerical and categorical features, which can be processed using standard data preprocessing techniques.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

In [None]:
# Manipulate Features to minimize feature correlation and create new features

import pandas as pd

# 1. Drop one of the highly correlated location features
# From previous correlation analysis, 'X', 'Y', and 'Latitude' are strongly correlated.
# Let's drop 'X' and 'Y' to reduce redundancy.
df.drop(['X', 'Y'], axis=1, inplace=True)

# 2. Combine related date parts to make a single datetime feature (already partially done)
# We'll ensure consistency and keep only relevant datetime features
df['Date'] = pd.to_datetime(df[['YEAR', 'MONTH', 'DAY']], errors='coerce')

# 3. Convert 'Date' into more useful features (if not already created)
df['DayOfWeek'] = df['Date'].dt.day_name()
df['WeekOfYear'] = df['Date'].dt.isocalendar().week
df['Quarter'] = df['Date'].dt.quarter

# 4. Bin 'HOUR' into time periods to reduce granularity
def time_period(hour):
    if hour < 6:
        return 'Late Night'
    elif hour < 12:
        return 'Morning'
    elif hour < 18:
        return 'Afternoon'
    else:
        return 'Evening'

df['Time_Period'] = df['HOUR'].fillna(0).astype(int).apply(time_period)

# 5. Drop raw 'Date', 'YEAR', 'MONTH', 'DAY' after extracting necessary info
df.drop(['Date', 'YEAR', 'MONTH', 'DAY'], axis=1, inplace=True)

# 6. (Optional) If 'HUNDRED_BLOCK' has too many unique values, drop it to avoid high-cardinality issues
if df['HUNDRED_BLOCK'].nunique() > 100:
    df.drop('HUNDRED_BLOCK', axis=1, inplace=True)

# Display updated dataset
df.head()


#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

In [None]:
import pandas as pd
import numpy as np


# Load the dataset (ensure the correct path and sheet name)
df = pd.read_excel("/content/Train.xlsx")


# Drop unneeded columns
df.drop(['X', 'Y'], axis=1, inplace=True, errors='ignore')

# Combine date columns and extract useful time parts
df['Date'] = pd.to_datetime(df[['YEAR', 'MONTH', 'DAY']], errors='coerce')
df['DayOfWeek'] = df['Date'].dt.day_name()
df['WeekOfYear'] = df['Date'].dt.isocalendar().week
df['Quarter'] = df['Date'].dt.quarter

# Create Time_Period using NumPy (faster than apply)
conditions = [
    df['HOUR'] < 6,
    df['HOUR'].between(6, 11),
    df['HOUR'].between(12, 17),
    df['HOUR'] >= 18
]
choices = ['Late Night', 'Morning', 'Afternoon', 'Evening']
df['Time_Period'] = np.select(conditions, choices, default='Unknown')

# Drop redundant columns
df.drop(['Date', 'YEAR', 'MONTH', 'DAY'], axis=1, inplace=True, errors='ignore')
if 'HUNDRED_BLOCK' in df.columns and df['HUNDRED_BLOCK'].nunique() > 100:
    df.drop('HUNDRED_BLOCK', axis=1, inplace=True)

# One-hot encoding
df_encoded = pd.get_dummies(df, drop_first=True)

# Correlation matrix
correlation_matrix = df_encoded.corr().abs()
upper_triangle = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper_triangle.columns if any(upper_triangle[column] > 0.9)]

# Drop highly correlated features
df_reduced = df_encoded.drop(columns=to_drop)

# Display top rows of final dataset
df_reduced.head()


##### What all feature selection methods have you used  and why?

Answer Here-Correlation-based Filtering – To remove highly correlated features (above 0.9) and reduce multicollinearity.
High Cardinality Removal – Dropped categorical columns like HUNDRED_BLOCK with too many unique values to prevent overfitting and reduce noise.

##### Which all features you found important and why?

Answer Here-The following features were found important:

TYPE – It's the target variable (crime type) we're trying to predict.
NEIGHBOURHOOD – Crime patterns often vary by location.
HOUR & Time_Period – Crimes show strong trends during specific times of the day.
DayOfWeek – Certain crimes occur more on specific days (e.g., weekends).
Latitude & Longitude – Helps identify crime-prone areas.
Quarter & WeekOfYear – Captures seasonal and weekly trends.
These features capture time, location, and context, which are key to forecasting crime patterns.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Yes, the data needed transformation because it contains numerical features like Latitude, Longitude, HOUR, MINUTE, etc., which vary in scale. To ensure these features contribute equally during model training (especially for distance or gradient-based models), we applied StandardScaler.

In [None]:
# Transform Your data

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the dataset (ensure the correct path and sheet name)
df = pd.read_excel("/content/Train.xlsx")

# Select numeric columns for transformation
numeric_cols = ['Latitude', 'Longitude', 'HOUR', 'MINUTE', 'YEAR', 'MONTH', 'DAY']

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the numeric columns
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# Display top rows of transformed data
df.head()


### 6. Data Scaling

In [None]:
# Scaling your data

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Load the dataset (ensure the correct path and sheet name)
df = pd.read_excel("/content/Train.xlsx")

# Step 2: Identify the numeric columns to be scaled
# These features have different units and ranges, so scaling is important
numeric_cols = ['Latitude', 'Longitude', 'HOUR', 'MINUTE', 'YEAR', 'MONTH', 'DAY']

# Step 3: Initialize MinMaxScaler
# This will scale the data between 0 and 1
scaler = MinMaxScaler()

# Step 4: Apply scaling to the selected numeric columns
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# Step 5: Check the first few rows of the scaled data
# This helps ensure the transformation worked correctly
df.head()


##### Which method have you used to scale you data and why?
I used MinMaxScaler to scale the data because it normalizes all numeric features to a range between 0 and 1. This ensures equal contribution of features and improves model performance, especially for algorithms sensitive to feature magnitude.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here-No, dimensionality reduction is not strictly needed in this case, because:

The dataset has a manageable number of features after preprocessing.
We already removed highly correlated and redundant columns.
All selected features carry meaningful information (time, location, context) relevant for predicting crime.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Step 1: Load your processed dataset (after encoding, scaling, etc.)
df = pd.read_excel("/content/Train.xlsx")

# Step 2: Define your features (X) and target (y)
# Assuming 'TYPE' is the target variable (crime type)
X = df.drop('TYPE', axis=1)
y = df['TYPE']

# Step 3: Split the data into train and test sets (80% train, 20% test)
# We use stratify to ensure class distribution remains balanced
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# Step 4: Print shape of splits for confirmation
print("Training Set Shape:", X_train.shape)
print("Test Set Shape:", X_test.shape)


##### What data splitting ratio have you used and why?

Answer Here-I used an 80:20 train-test split to ensure the model has enough data to learn patterns (80%) while keeping a separate portion (20%) to evaluate its performance on unseen data. This balance helps avoid overfitting and gives a reliable estimate of model accuracy.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here-Yes, the dataset is imbalanced because the number of records for each crime TYPE is not evenly distributed. Some crime types occur far more frequently than others, which can bias the model toward predicting the majority class and reduce accuracy for less frequent crimes.

In [None]:
# Handling Imbalanced Dataset (If needed)

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler

# Step 1: Load dataset
df = pd.read_excel("/content/Train.xlsx")

# Step 2: Split features and target
X = df.drop('TYPE', axis=1)
y = df['TYPE']

# Step 3: Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# Step 4: Apply Random Over Sampling to balance the training data
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

# Step 5: Show class distribution after balancing
print("Class distribution after oversampling:")
print(y_resampled.value_counts())


##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here-I used Random Oversampling to handle the imbalanced dataset. It increases the number of samples in the minority classes by duplicating them, ensuring all classes have equal representation. This helps the model learn from all crime types fairly and prevents bias toward the majority class.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

In [None]:
import pandas as pd
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Step 1: Reload the dataset
train_df = pd.read_excel("/content/Train.xlsx") # Adjust path if needed

# Step 2: Check TYPE column (this will confirm it's intact)
print(train_df['TYPE'].value_counts())
print(train_df['TYPE'].unique())

# Step 3: Define frequent types based on actual class names
frequent_types = [
    'Break and Enter Commercial',
    'Theft from Vehicle',
    'Offence Against a Person'
]

# Step 4: Create binary classification target
train_df['TYPE_BINARY'] = train_df['TYPE'].apply(lambda x: 1 if x in frequent_types else 0)

# Step 5: Check if TYPE_BINARY is valid now
print(train_df['TYPE_BINARY'].value_counts())

# Step 6: Prepare features and labels
X = train_df.drop(columns=['TYPE', 'TYPE_BINARY', 'Date', 'HUNDRED_BLOCK'])
y = train_df['TYPE_BINARY']

X = pd.get_dummies(X, drop_first=True)
X = X.select_dtypes(exclude=['datetime64[ns]']).fillna(X.median(numeric_only=True))

# Step 7: Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Step 8: Balance with SMOTE
sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

# Step 9: Train model
model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
model.fit(X_train_res, y_train_res)

# Step 10: Predict and Evaluate
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Implemented an XGBoost Classifier, a powerful gradient boosting algorithm well-suited for structured/tabular data. The model was trained on a binary classification target (TYPE_BINARY), distinguishing between frequent crime types (e.g., Theft from Vehicle, Break and Enter Commercial, Offence Against a Person) and less frequent ones. To address class imbalance, SMOTE (Synthetic Minority Over-sampling Technique) was applied to the training data.
The model achieved an overall accuracy of 71.6% on the test set. The precision and recall scores for both classes were balanced (~0.71–0.72), indic

In [None]:
# Visualizing evaluation Metric Score chart

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Step 1: Generate classification report as a dictionary
report_dict = classification_report(y_test, y_pred, output_dict=True)

# Step 2: Convert report to a DataFrame for display
report_df = pd.DataFrame(report_dict).transpose()
print("Classification Report:")
print(report_df[['precision', 'recall', 'f1-score']].round(2))

# Step 3: Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Step 4: Display confusion matrix as heatmap
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Less Frequent (0)', 'Frequent (1)'], yticklabels=['Less Frequent (0)', 'Frequent (1)'])
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.tight_layout()
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

In [None]:
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Define the model
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.05, 0.1],
    'subsample': [0.8, 1],
    'colsample_bytree': [0.8, 1]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=xgb,
    param_grid=param_grid,
    scoring='accuracy',
    cv=3,
    verbose=1,
    n_jobs=-1
)

# Fit the algorithm
grid_search.fit(X_train_res, y_train_res)

# Predict on the model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Evaluate
print("Best Parameters Found:", grid_search.best_params_)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


##### Which hyperparameter optimization technique have you used and why?

Answer Here-GridSearchCV for hyperparameter optimization. GridSearchCV performs an exhaustive search over a specified parameter grid and evaluates each combination using cross-validation. This ensures that the model is tuned for the best set of parameters by thoroughly testing all possibilities, which is particularly useful when accuracy and generalization are critical.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here-Yes, there was a slight improvement in performance after tuning.

### ML Model - 2

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Step 1: Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Step 2: Balance training data using SMOTE
sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

# Step 3: Initialize and train MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=300, random_state=42)
mlp.fit(X_train_res, y_train_res)

# Step 4: Predict and evaluate
y_pred = mlp.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.
Implemented an MLPClassifier (Multilayer Perceptron), a type of feedforward neural network from scikit-learn. It was trained on SMOTE-balanced data to classify crime types into two categories: frequent and less frequent crimes.
The model achieved an overall accuracy of 61.2%, which is lower than XGBoost (71.8%). While it showed perfect recall for Class 0 (less frequent crimes), it severely underperformed on Class 1 (frequent crimes) — indicating it is biased toward the majority class.

In [None]:
# Visualizing evaluation Metric Score chart

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Step 1: Classification Report as Dict
mlp_report = classification_report(y_test, y_pred, output_dict=True)

# Step 2: Convert to DataFrame
mlp_report_df = pd.DataFrame(mlp_report).transpose()
mlp_report_df = mlp_report_df[['precision', 'recall', 'f1-score']].round(2)

# Step 3: Plot the score chart
plt.figure(figsize=(8, 5))
mlp_report_df.iloc[:2].plot(kind='bar')
plt.title("MLPClassifier Performance Metrics (Class 0 vs Class 1)")
plt.ylabel("Score")
plt.xticks(rotation=0)
plt.ylim(0, 1.1)
plt.grid(axis='y', linestyle='--', linewidth=0.5)
plt.legend(loc='lower right')
plt.tight_layout()
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here-The evaluation metrics used in this ML model—accuracy, precision, recall, and F1-score—provide key insights into its real-world effectiveness and business value. Accuracy indicates the overall correctness of the model, helping law enforcement rely on it for daily decision-making. Precision, especially for frequent crimes, ensures resources are not wasted on false alarms, while recall highlights the model’s ability to capture true high-risk incidents, which is vital for crime prevention. The F1-score balances both and reflects how well the model performs overall. Together, these metrics demonstrate that the model can significantly improve resource allocation, crime prediction, and operational planning, enabling proactive strategies that enhance public safety and reduce crime-related costs.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here- the XGBoost model as the final prediction model because it gave the highest accuracy (71.8%), with balanced precision and recall for both classes. It performed better than other models like MLPClassifier and Logistic Regression, making it the most reliable for predicting frequent and less frequent crime types accurately.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here-I used the XGBoost Classifier as my final model. It’s a powerful gradient boosting algorithm that combines multiple decision trees to make accurate predictions. To understand which features were most influential, I used the model’s built-in feature importance tool. The most impactful features were YEAR, HOUR, NEIGHBOURHOOD, and MONTH, which showed that both time and location play a key role in predicting whether a crime falls under the frequent or less frequent category.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here-Based on the analysis and model evaluation, I conclude that XGBoost is the most effective model for predicting frequent crime types using FBI crime data. With strong performance metrics and reliable feature importance, it supports the business objective of enhancing crime forecasting. The final decision is to adopt this model for real-time prediction and strategic planning, enabling better allocation of law enforcement resources and proactive crime prevention efforts.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***