# 🛡️ TravelAware – Real-Time Crime Urgency & Response Time Prediction

This notebook develops two machine learning models for the **TravelAware safety assistant app**, which helps **tourists**, **international students**, and **residents** stay informed about local crime risks in real time.



###  Objective

To train and evaluate two models using historical police data:

1. **Crime Urgency Classifier**
   - Predicts the urgency level of a reported crime (**Low**, **Medium**, **High**).
   - Based on features like time, type, location, and patrol zone.

2. **Response Time Estimator**
   - Predicts the expected **response time in minutes** based on past response behavior.

These models are integrated into the app to provide:
- Smarter, real-time crime report handling
- Realistic urgency flags and estimated police arrival times
- A more trustworthy and helpful safety assistant experience

###  Dataset Used


- **Source**: Data for this project was sourced from the **2023 WRPS Annual Data Extract CSV**, published by the **Waterloo Regional Police Service**:
[https://wrps.ca/resource/2023-wrps-annual-data-extract-csv](https://wrps.ca/resource/2023-wrps-annual-data-extract-csv)
- **Context**: Real-world police call and dispatch data from Waterloo Region
- **Size**: ~40,000 records with multiple time and location fields



###  Technologies Used

- Python (Pandas, Scikit-learn, Joblib)
- Random Forest Classifier & Regressor
- ColumnTransformer for preprocessing
- OneHotEncoding for categorical features
- Evaluation using accuracy, precision, RMSE, and R²



###  Integration

The trained models are saved as `.pkl` files:
- `crime_urgency_model.pkl`
- `response_time_model.pkl`

These are used in the **Streamlit-based TravelAware app** to power:
- The **“Report Crime”** form (urgency & ETA prediction)
- Smart alerts and user feedback based on AI reasoning


> ⚠️ This notebook focuses on modeling logic only. UI and user interaction features are handled in a separate Streamlit app.


### 📥 Step 1: Load and Inspect Dataset

We begin by loading the official crime dataset provided by the **Waterloo Regional Police Service** for the year 2023. This dataset is the foundation for building our urgency and response time prediction models.

**Tasks performed:**
- Load the dataset from the local path
- Check the dataset shape (rows × columns)
- View column names
- Count missing values in each column
- Preview the first few rows for structure understanding

These steps help identify:
- Which columns are useful for modeling
- How much preprocessing is required
- Any critical data quality issues

 **File loaded from:**  
`C:/Users/kittu/Desktop/Agile/WRPSAnnualDataExtract_2023.csv`


In [2]:
import pandas as pd

# Load dataset
file_path = r"C:\Users\kittu\Desktop\Agile\WRPSAnnualDataExtract_2023.csv"
df = pd.read_csv(file_path)

# Display basic structure for documentation
df_shape = df.shape
df_columns = df.columns.tolist()
df_nulls = df.isnull().sum().sort_values(ascending=False)

# Display first few rows for inspection
df_sample = df.head(3)


(df_shape, df_columns, df_nulls.head(10))


((372267, 23),
 ['occurrencefileno',
  'GeographicLocation',
  'NearestIntersectionLocation',
  'PatrolDivision',
  'PatrolZone',
  'Municipality',
  'ReportedDateandTime',
  'InitialCallType',
  'InitialCallTypeDescription',
  'FinalCallType',
  'FinalCallTypeDescription',
  'InitialPriority',
  'FinalPriority',
  'Disposition',
  'DispatchDateandTime',
  'ArrivalDateandTime',
  'ClearedDateandTime',
  'CallDispatchDelay',
  'CallTravelTime',
  'CallOnSceneTime',
  'CallResponseTime',
  'CallServiceTime',
  'TotalUnitServiceTime'],
 CallOnSceneTime               210630
 CallTravelTime                210627
 ArrivalDateandTime            210627
 CallResponseTime              210627
 Disposition                   193875
 Municipality                  192768
 InitialCallType               190750
 InitialPriority               190750
 InitialCallTypeDescription    190750
 TotalUnitServiceTime          190617
 dtype: int64)

###  Step 2: Data Cleaning & Feature Engineering

To prepare the data for modeling, we clean and enrich the dataset with time-based features and categorize the crime priority.

**Key steps performed:**

- Created a working copy of the original dataset to preserve raw data
- Removed rows with missing values in:
  - `FinalPriority`
  - `FinalCallTypeDescription`
  - `ReportedDateandTime`
- Converted the `ReportedDateandTime` column to datetime format
- Extracted new temporal features:
  - `Hour` (0–23)
  - `DayOfWeek` (e.g., Monday, Tuesday)
  - `Month` (1–12)

**Target Variable Preparation:**
- Created a new column `PriorityLevel` to classify crime urgency into:
  - **High**: Final priority 1 or 2
  - **Medium**: Final priority 3
  - **Low**: Final priority 4 or above

These transformations help make the data suitable for training a classification model to predict crime urgency levels.

 We also inspect the distribution of the new `PriorityLevel` classes to understand class balance.


In [3]:
# Copy of original data
df_clean = df.copy()

# Drop rows with null FinalPriority or FinalCallTypeDescription (required for Model 1)
df_clean = df_clean.dropna(subset=["FinalPriority", "FinalCallTypeDescription", "ReportedDateandTime"])

# Convert ReportedDateandTime to datetime
df_clean["ReportedDateandTime"] = pd.to_datetime(df_clean["ReportedDateandTime"], errors="coerce")

# Drop any rows where datetime conversion failed
df_clean = df_clean.dropna(subset=["ReportedDateandTime"])

# Extract useful time features
df_clean["Hour"] = df_clean["ReportedDateandTime"].dt.hour
df_clean["DayOfWeek"] = df_clean["ReportedDateandTime"].dt.day_name()
df_clean["Month"] = df_clean["ReportedDateandTime"].dt.month

# Bin FinalPriority into categorical target: Low, Medium, High
def priority_level(x):
    if x <= 2:
        return "High"
    elif x == 3:
        return "Medium"
    else:
        return "Low"

df_clean["PriorityLevel"] = df_clean["FinalPriority"].apply(priority_level)

# Check class distribution
priority_counts = df_clean["PriorityLevel"].value_counts()

# Preview cleaned structure
cleaned_preview = df_clean[[
    "FinalCallTypeDescription", "Municipality", "PatrolZone", "Hour", "DayOfWeek", "Month", "FinalPriority", "PriorityLevel"
]].head(3)


priority_counts


PriorityLevel
Low       264342
High       84110
Medium     23774
Name: count, dtype: int64

###  Step 3: Train Crime Urgency Classification Model

We build a machine learning model to **predict the urgency level of a reported crime** (High, Medium, or Low).

**Features selected:**
- Categorical: `FinalCallTypeDescription`, `Municipality`, `PatrolZone`, `DayOfWeek`
- Numeric: `Hour`, `Month`

**Steps performed:**

1. **Data Preparation**
   - Replaced missing values in features with `"Unknown"`
   - Split data into training and test sets (80/20 split with stratification)

2. **Preprocessing Pipeline**
   - Applied `OneHotEncoding` to categorical columns
   - Passed numeric features through unchanged

3. **Modeling**
   - Used a `RandomForestClassifier` with 100 trees
   - Set `class_weight='balanced'` to handle class imbalance in urgency levels

4. **Evaluation**
   - Generated predictions on the test set
   - Computed a `classification_report` to measure performance (precision, recall, F1-score)

 The results show how well the model predicts High, Medium, or Low priority crimes. This model can now be used in our TravelAware app to simulate urgency based on crime features.


In [4]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import numpy as np

# Select features and target
features = ["FinalCallTypeDescription", "Municipality", "PatrolZone", "Hour", "DayOfWeek", "Month"]
target = "PriorityLevel"

X = df_clean[features]
y = df_clean[target]

# Handle missing values in features by replacing with "Unknown"
X = X.fillna("Unknown")

# Define categorical and numeric columns
categorical_cols = ["FinalCallTypeDescription", "Municipality", "PatrolZone", "DayOfWeek"]
numeric_cols = ["Hour", "Month"]

# Create column transformer for encoding
preprocessor = ColumnTransformer([
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols)
], remainder='passthrough')

# Define pipeline with preprocessing and classifier
model_pipeline = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("classifier", RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced'))
])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Train model
model_pipeline.fit(X_train, y_train)

# Predict and evaluate
y_pred = model_pipeline.predict(X_test)
report = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report).transpose()


report_df.round(2)


Unnamed: 0,precision,recall,f1-score,support
High,0.99,0.98,0.98,16822.0
Low,0.99,0.99,0.99,52869.0
Medium,0.96,0.98,0.97,4755.0
accuracy,0.99,0.99,0.99,0.99
macro avg,0.98,0.98,0.98,74446.0
weighted avg,0.99,0.99,0.99,74446.0


###  Step 4: Save the Classification Model

After successfully training and evaluating the urgency classification model, we save it using `joblib` for later use in deployment e.g., inside the TravelAware Streamlit app.


In [5]:
import joblib
joblib.dump(model_pipeline, "crime_urgency_model.pkl")


['crime_urgency_model.pkl']

###  Step 5: Predicting Emergency Response Time

To make **TravelAware** more practical and timely, we developed a **regression model** that estimates how long authorities might take to respond to a reported crime.

This prediction helps users set realistic expectations and builds trust in the app’s safety system.

####  Key Details

- **Objective**: Predict the expected response time (in minutes) based on crime-related context.
- **Target Variable**: `CallResponseTime`
- **Features Used**:
  - Crime type (e.g., Assault, Robbery)
  - Municipality and Patrol Zone
  - Time of report (Hour, Day of Week, Month)

####  Approach

- Filtered the dataset to include only rows with valid emergency response times.
- Handled missing values and encoded categorical features.
- Used a **Random Forest Regressor** to model complex, non-linear relationships without overfitting.
- Data was split using an 80/20 training-to-test ratio to ensure performance generalization.

####  Why This Matters

With this model, TravelAware can go beyond alerting users to risks—it can **estimate how quickly help might arrive** in an emergency. This makes the system more **informative, responsive**, and **useful in real-world situations**.


In [None]:
# Filter for rows with CallResponseTime (target for regression)
df_reg = df_clean.dropna(subset=["CallResponseTime"])

# Select features (same as classification for consistency)
features_reg = ["FinalCallTypeDescription", "Municipality", "PatrolZone", "Hour", "DayOfWeek", "Month"]
target_reg = "CallResponseTime"

X_reg = df_reg[features_reg].fillna("Unknown")
y_reg = df_reg[target_reg]

# Use the same preprocessing pipeline (OneHotEncoder for categorical)
categorical_cols = ["FinalCallTypeDescription", "Municipality", "PatrolZone", "DayOfWeek"]
numeric_cols = ["Hour", "Month"]

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Preprocessing
preprocessor_reg = ColumnTransformer([
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols)
], remainder="passthrough")

# Model pipeline
reg_pipeline = Pipeline(steps=[
    ("preprocess", preprocessor_reg),
    ("regressor", RandomForestRegressor(n_estimators=100, random_state=42))
])

# Train/test split
Xr_train, Xr_test, yr_train, yr_test = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

# Fit model
reg_pipeline.fit(Xr_train, yr_train)

# Predict
yr_pred = reg_pipeline.predict(Xr_test)



###  Step 6: Evaluating the Regression Model

To assess the performance of our emergency response time predictor, we evaluated the model using three standard regression metrics:

####  Metrics Used

- **Mean Absolute Error (MAE)**: Measures the average difference between predicted and actual response times. Lower is better.
- **Root Mean Squared Error (RMSE)**: Penalizes larger errors more heavily. Useful when we want to avoid large under- or over-estimations.
- **R² Score (Coefficient of Determination)**: Indicates how well the model explains the variability of the target variable. A score closer to 1 means better fit.

These metrics provide a balanced view of how accurately and reliably the model predicts emergency response times, which is vital for building user trust in the TravelAware system.


In [9]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

mae = mean_absolute_error(yr_test, yr_pred)
mse = mean_squared_error(yr_test, yr_pred)
rmse = np.sqrt(mse)
r2 = r2_score(yr_test, yr_pred)

print("MAE:", mae)
print("RMSE:", rmse)
print("R²:", r2)


MAE: 11341.114363694778
RMSE: 69484.25337006342
R²: -0.037825572412713004


###  Step 6: Saving the Response Time Prediction Model

Once the regression model was trained and evaluated, we saved it to disk as `response_time_model.pkl` using `joblib`. This allows seamless integration with our TravelAware app.

#### Why This Matters:
- Ensures the trained model can be reused in real-time applications without retraining.
- Makes it easy to deploy and test across different environments.
- Supports modular development and separation of training vs. inference.

This saved model is now ready to be integrated into the Streamlit-based app to simulate and display estimated emergency response times for newly reported incidents.


In [10]:
import joblib
joblib.dump(reg_pipeline, "response_time_model.pkl")


['response_time_model.pkl']

---

##  Conclusion

In this notebook, we built two key machine learning models to power the **TravelAware** real-time safety assistant:

1. **Crime Urgency Classification Model**  
   Predicts the urgency level (High / Medium / Low) of a reported crime based on time, location, and type.

2. **Emergency Response Time Regression Model**  
   Estimates the expected response time (in minutes) from authorities for a reported incident.

#### Key Takeaways:
- Data cleaning and time-based feature engineering were crucial for meaningful insights.
- Categorical variables were transformed via encoding for compatibility with tree-based models.
- Both models used **Random Forest** for robust, interpretable results without extensive tuning.
- The trained models were exported using `joblib` for easy deployment into a Streamlit application.

These models enhance the app’s ability to provide users with **real-time, data-driven safety insights**, making it a valuable tool for tourists, students, and locals navigating unfamiliar or risky areas.

---

###  Next Steps
- Integrate these models into the final Streamlit app interface.
- Add geolocation-triggered alerts, SOS functionality, and personalized safety feeds.
- Continuously monitor and retrain models using fresh data for improved accuracy over time.

Thank you for exploring this project.
