# CRISP-DM Framework for Predicting Hostel Prices on UCC Campus


## 1. Business Understanding

* **Objective:** To develop a predictive pricing model for hostels on the University of Cape Coast (UCC) campus, enabling students and stakeholders to make data-driven accommodation decisions.
* **Business Goals:**

  * Understand the determinants of hostel pricing.
  * Provide a reliable tool to forecast hostel prices based on selected features.
  * Support equitable pricing and informed budget planning for students.


## 2. Data Understanding

In [4]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import joblib
import streamlit as st


In [7]:
#import dataset
df = pd.read_csv('../data/hostel_prices.csv')
df.head()


Unnamed: 0,gender,age_group,level_of_study,lecture_location,accommodation_type,faculty,off_campus_duration,room_category,annual_rent,includes_water,...,room_size,furnished_bed,furnished_table,furnished_chairs,has_access_controls,has_janitorial_services,required_deposit,recent_rent_increase,avg_rent_nearby,hostel_location
0,Male,25-30,Postgraduate,Science,2 in a room,Faculty of Social Sciences,Less than 6 months,Shared washroom- shared kitchen,4000.0,No,...,20 - 30 sqm,Yes,Yes,Yes,Yes,No,2500.0,500.0,5000.0,Kwaprow
1,Male,35-40,Postgraduate,New Site,Private room(1 in a room),School of Business,3 years or more,Full self contain,5000.0,Yes,...,15 - 20 sqm,Yes,,,No,No,2500.0,1500.0,4000.0,Amamoma
2,Male,25-30,Postgraduate,Science,Private room(1 in a room),Faculty of Social Sciences,6 months to 1 year,Full self contain,4500.0,No,...,15 - 20 sqm,Yes,Yes,Yes,No,No,2500.0,1000.0,3500.0,Domeabra
3,Male,26-30,Post Graduate,Science,5+ in a room,Faculty of Education,Between a year and 2 years,Shared washroom- No kitchen,5000.0,Yes,...,20 - 30 sqm,Yes,No,No,No,No,,,,
4,Male,26-30,Fourth year,Science,2 in a room,Faculty of Education,Between a year and 2 years,Full self contain,3000.0,No,...,20 - 30 sqm,No,No,No,No,No,,,,Apewosika


In [8]:
# Summary of dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 502 entries, 0 to 501
Data columns (total 30 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   gender                      480 non-null    object 
 1   age_group                   502 non-null    object 
 2   level_of_study              500 non-null    object 
 3   lecture_location            500 non-null    object 
 4   accommodation_type          501 non-null    object 
 5   faculty                     501 non-null    object 
 6   off_campus_duration         501 non-null    object 
 7   room_category               500 non-null    object 
 8   annual_rent                 500 non-null    float64
 9   includes_water              500 non-null    object 
 10  includes_electricity        499 non-null    object 
 11  includes_waste_disposal     501 non-null    object 
 12  has_running_water           500 non-null    object 
 13  has_extra_storage           494 non

### 🧹 **Data Quality Insights**

1. **Missing Values**
   Several columns have missing (null) values:

   * `gender`: 22 missing
   * `level_of_study`: 2 missing
   * `lecture_location`: 2 missing
   * `accommodation_type`: 1 missing
   * `faculty`: 1 missing
   * `room_category`: 2 missing
   * `annual_rent`: 2 missing
   * `includes_water`: 2 missing
   * `includes_electricity`: 3 missing
   * `has_running_water`: 2 missing
   * `has_extra_storage`: 8 missing
   * `has_wifi_internet`: 16 missing
   * `has_study_area`: 9 missing
   * `has_security_services`: 13 missing
   * `has_generator_backup_power`: 9 missing
   * `furnished_table`: 2 missing
   * `furnished_chairs`: 3 missing
   * `has_access_controls`: 12 missing
   * `has_janitorial_services`: 10 missing
   * `required_deposit`: 2 missing
   * `recent_rent_increase`: 8 missing
   * `avg_rent_nearby`: 5 missing
   * `hostel_location`: 3 missing

2. **Inconsistent Data Types**

   * Some columns that represent numerical values (like `hostel_distance_minutes`) are stored as `object`, which could indicate inconsistent formatting (e.g., "10 minutes" instead of a numeric value).

3. **Potential Categorical Misclassification**

   * Many `object` columns may be better treated as categorical for analysis and memory optimization.

4. **Non-Standardized Entries Likely**

   * Free-text or categorical columns (e.g., `room_category`, `commute_mode`, `hostel_location`) may contain inconsistent spellings or capitalization, which can affect grouping and analysis.

5. **Data Completeness**

   * Most columns are well-filled, but columns with over 10 missing entries (e.g., `gender`, `has_wifi_internet`, `has_security_services`) may require imputation or exclusion depending on analysis goals.




In [5]:



* **Data Sources:**

  * Hostel listing platforms, student accommodation databases, and manual surveys.
  * Variables include: historical prices, room types, amenities, proximity to campus, occupancy rates, internet access, furnishing level, and security features.
* **Initial Data Exploration:**

  * Summary statistics for numeric and categorical variables.
  * Identification of outliers and anomalies.
  * Correlation analysis between price and other features.

## 3. Data Preparation

* **Data Cleaning:**

  * Impute or drop missing values using domain-relevant strategies.
  * Remove duplicates and standardize entries (e.g., consistent room types).
* **Feature Engineering:**

  * Calculate distance to key locations (e.g., lecture halls, markets).
  * Group rooms by size or facility level (e.g., standard vs. executive).
  * Create binary features for amenities (e.g., Wi-Fi available: yes/no).
* **Encoding & Transformation:**

  * Apply label or one-hot encoding for categorical variables.
  * Normalize or scale numeric values where necessary.
* **Data Splitting:**

  * Split the dataset into training (70–80%) and testing (20–30%) sets, ensuring stratified sampling if needed.

## 4. Modeling

* **Model Selection:**

  * Begin with baseline Linear Regression for interpretability.
  * Explore ensemble models such as Random Forest and XGBoost for improved accuracy.
* **Pipeline Creation:**

  * Construct a complete machine learning pipeline that includes preprocessing steps (imputation, encoding, scaling) and the regression model.
  * This ensures that the entire workflow is reproducible and maintainable.
* **Training and Tuning:**

  * Perform hyperparameter tuning using GridSearchCV or RandomizedSearchCV.
  * Use cross-validation to prevent overfitting.
* **Performance Metrics:**

  * Root Mean Squared Error (RMSE)
  * Mean Absolute Error (MAE)
  * R² Score

## 5. Evaluation

* **Model Validation:**

  * Compare predictions with actual prices from the test set.
  * Use residual plots and error distributions to assess model robustness.
* **Interpretation and Insights:**

  * Examine feature importances to understand pricing drivers.
  * Identify actionable insights for pricing strategies or student decision-making.
* **Business Review:**

  * Verify whether the model aligns with user needs and institutional goals.
  * Ensure predictions are reasonable and not biased.

## 6. Deployment

* **Model Persistence:**

  * Save the finalized pipeline using serialization tools (e.g., `joblib` or `pickle`) to ensure the model can be reused without retraining.
* **Streamlit App Integration:**

  * Build a user-friendly Streamlit web application that loads the persisted pipeline and allows users to input features (e.g., room type, amenities, distance to campus) to receive instant price predictions.
  * The app will serve as a decision support tool for students, hostel managers, and administrators.
* **Monitoring and Maintenance:**

  * Regularly retrain the model as new data becomes available.
  * Implement drift detection to ensure model reliability over time.
* **Documentation:**

  * Maintain clear documentation covering model development, data sources, assumptions, and usage instructions.
  * Provide user manuals and technical notes for all stakeholders.

---

**Conclusion:**
By following the CRISP-DM methodology, this project not only ensures a structured and reliable approach to predicting hostel prices but also operationalizes the solution by deploying a usable and interactive web application. The integration of a full machine learning pipeline with a persisted model ensures scalability, reproducibility, and user accessibility through the Streamlit interface.

---

Let me know if you'd like a visual version of this in a Word doc or slide format for presentation or submission.


SyntaxError: invalid character '–' (U+2013) (183795848.py, line 28)