# Vancouver Theft Incidents Analysis

## Project Overview
This project aims to analyze the distribution of theft incidents in different neighborhoods of Vancouver by utilizing spatiotemporal data mining techniques and building predictive models using machine learning algorithms. By identifying high-crime areas and peak times, exploring potential crime patterns, and predicting the likelihood of theft occurrences at specific time-locations, the project seeks to provide valuable insights to enhance community safety strategies.

## Analysis Objectives
1. Analyze spatial distribution of theft incidents
2. Identify temporal patterns and peak times
3. Develop predictive models for risk assessment
4. Generate actionable insights for safety strategies
5. 

In [2]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import folium
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Visualization settings
plt.style.use('default')
sns.set_palette('husl')
pd.set_option('display.max_columns', None)

## 1. Data Preprocessing

In [4]:
# Load and prepare data
def load_theft_data():
    """Load theft-related crime data"""
    df = pd.read_csv('crime.csv')
    # Filter for theft incidents
    theft_df = df[df['TYPE'].str.contains('Theft', case=False, na=False)]
    return theft_df

# Load data
theft_df = load_theft_data()

# Display basic information
print("Dataset Overview:")
print(f"Total theft incidents: {len(theft_df)}")
print("\nSample of the data:")
theft_df.head()

Dataset Overview:
Total theft incidents: 289015

Sample of the data:


Unnamed: 0,TYPE,YEAR,MONTH,DAY,HOUR,MINUTE,HUNDRED_BLOCK,NEIGHBOURHOOD,X,Y,Latitude,Longitude
0,Other Theft,2003,5,12,16.0,15.0,9XX TERMINAL AVE,Strathcona,493906.5,5457452.47,49.269802,-123.083763
1,Other Theft,2003,5,7,15.0,20.0,9XX TERMINAL AVE,Strathcona,493906.5,5457452.47,49.269802,-123.083763
2,Other Theft,2003,4,23,16.0,40.0,9XX TERMINAL AVE,Strathcona,493906.5,5457452.47,49.269802,-123.083763
3,Other Theft,2003,4,20,11.0,15.0,9XX TERMINAL AVE,Strathcona,493906.5,5457452.47,49.269802,-123.083763
4,Other Theft,2003,4,12,17.0,45.0,9XX TERMINAL AVE,Strathcona,493906.5,5457452.47,49.269802,-123.083763


In [6]:
# Data preprocessing
def preprocess_data(df):
    """Preprocess theft data for analysis"""
    # First, let's identify the date column name
    date_columns = [col for col in df.columns if 'DATE' in col.upper() or 'TIME' in col.upper()]
    if date_columns:
        date_col = date_columns[0]  # Use the first date-related column
        
        # Convert datetime
        df[date_col] = pd.to_datetime(df[date_col])
        
        # Extract temporal features
        df['Year'] = df[date_col].dt.year
        df['Month'] = df[date_col].dt.month
        df['Day'] = df[date_col].dt.day
        df['Hour'] = df[date_col].dt.hour
        df['DayOfWeek'] = df[date_col].dt.day_name()
        df['TimeOfDay'] = pd.cut(df['Hour'], 
                                bins=[0,6,12,18,24], 
                                labels=['Night','Morning','Afternoon','Evening'])
    else:
        print("Warning: No date column found in the dataset")
    
    return df

# Process data
theft_df = preprocess_data(theft_df)
print("Data preprocessing completed!")
theft_df.head()

Data preprocessing completed!


Unnamed: 0,TYPE,YEAR,MONTH,DAY,HOUR,MINUTE,HUNDRED_BLOCK,NEIGHBOURHOOD,X,Y,Latitude,Longitude
0,Other Theft,2003,5,12,16.0,15.0,9XX TERMINAL AVE,Strathcona,493906.5,5457452.47,49.269802,-123.083763
1,Other Theft,2003,5,7,15.0,20.0,9XX TERMINAL AVE,Strathcona,493906.5,5457452.47,49.269802,-123.083763
2,Other Theft,2003,4,23,16.0,40.0,9XX TERMINAL AVE,Strathcona,493906.5,5457452.47,49.269802,-123.083763
3,Other Theft,2003,4,20,11.0,15.0,9XX TERMINAL AVE,Strathcona,493906.5,5457452.47,49.269802,-123.083763
4,Other Theft,2003,4,12,17.0,45.0,9XX TERMINAL AVE,Strathcona,493906.5,5457452.47,49.269802,-123.083763


## 2. Initial Data Exploration

In [7]:
# Basic statistics and data quality check
def explore_data_quality(df):
    """Examine data quality and basic statistics"""
    print("Dataset Shape:", df.shape)
    print("\nMissing Values:")
    print(df.isnull().sum())
    print("\nData Types:")
    print(df.dtypes)

explore_data_quality(theft_df)

Dataset Shape: (289015, 12)

Missing Values:
TYPE                0
YEAR                0
MONTH               0
DAY                 0
HOUR                0
MINUTE              0
HUNDRED_BLOCK       6
NEIGHBOURHOOD    1989
X                   0
Y                   0
Latitude            0
Longitude           0
dtype: int64

Data Types:
TYPE              object
YEAR               int64
MONTH              int64
DAY                int64
HOUR             float64
MINUTE           float64
HUNDRED_BLOCK     object
NEIGHBOURHOOD     object
X                float64
Y                float64
Latitude         float64
Longitude        float64
dtype: object
