# Using the CRISP-DM Method for MLN601 (Machine Learning)


## Assessment 3: ML Project 

#### By: Hassan Ya'u Hamisu - A00133000

##### Due Date: Wednesday, 21/08/2024 @ 11:55 PM





# Stage 1: Business Understandig

## 1.1 Determining Business Objectives

### 1.1.1 Background
Bike-sharing systems have gained significant traction in urban areas as a sustainable and convenient means of transportation. Predicting bike rental demand is crucial for optimizing the allocation of bikes, ensuring availability, improving overall operational efficiency, and satisfying customers’ needs (Torrens University, 2024).

Factors affecting bike share ridership in Toronto, Canada was conducted, utilizing year-round trip data, the research highlights the impact of socio-demographic attributes, land use, built environment, and weather on bike-sharing demand. It reveals that road network configuration, bike infrastructure, and temperature significantly influence ridership (El-Assi et al., 2017).



### 1.1.2 Business Objectives
Machine learning, a subset of Artificial Intelligence, has significantly enhanced inventory automation, cutting costs associated with misplaced items by 40% to 60% (Bokrantz et al., 2023). However, without meticulous planning, execution, and ongoing oversight of the AI model, achieving success remains uncertain, as 75 to 85 percent of ML projects fail to fulfill their sponsors' expectations (Studer et al., 2021). Presently, CRISP-DM stands as the most effective methodology for managing ML projects (Studer et al., 2021), and it will be utilized to ensure the success of this project. 

<br>
The data for this project, sourced from (Hadi, 2009), includes datasets of a bike-sharing company called Capital-Bikeshare15 in Washington D.C. including the weather and seasonal information. The project will be structured around the CRISP-DM phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment (Leaper, 2024).
<br>

The primary objective is to develop a robust predictive model that accurately forecasts daily bike rental demand based on historical data, weather conditions, calendar information, and user type, aiding in efficient bike distribution and minimizing shortages or surpluses at docking stations.
<br>

To achieve the optimal model and avoid the bias associated with relying on a single technique, I will develop four distinct models using four different algorithms, compare their performances, and select the best one.

### 1.1.3 Business Success Criteria
Success will be evaluated by the accuracy of the selected predictive model, reflected by metrics such as Mean Absolute Error (MAE), R-squared as a Percentage & Mean Absolute Percentage Error (MAPE), and its ability to improve resource allocation, leading to increased satisfaction among stakeholders (customers, entrepreneurs, and government) and enhanced operational efficiency. These daily model predictions help us identify peak usage days.


## 1.2 Assessing Situation

### 1.2.1 Inventory of Resources, Requirements, Assumptions, and Constraints
Blomster and Koivumäki (2022) emphasize that every entity needs diverse resources, capabilities, and competencies to advance a machine-learning project effectively. As a result, the resources for this project encompass:
<br>
- **Personnel:** Artificial Intelligence/Machine Learning Engineer (myself).
- **Resources:** Historical bike rental data, weather data, and calendar information.
- **Requirements:** Ensure precise forecasting of bike rentals and deploy the model for operational use. Create a predictive model that achieves high accuracy in its forecasts.
- **Assumptions:** The data reflects future trends; however, external factors like promotions and roadworks are not considered. Historical patterns are expected to persist without significant interruptions.
- **Constraints:** Limited to historical data and available features; computational resources for training models.

### 1.2.2 Risk and Contingencies
Krishnan (2015) underscores the critical role of risk identification in machine learning projects. For this project, the following risks have been identified:
- **Data Quality Issues:** Missing or inaccurate data can impact model performance. Mitigation includes data cleaning and validation steps.
- **Model Overfitting:** Models might overfit training data and perform poorly on unseen data. Cross-validation and regularization techniques will be used.

### 1.2.3 Cost and Benefit
The financial feasibility and long-term viability of healthcare investments using a machine learning approach to panel regression have been assessed and shown to be a highly effective choice for ML projects (R & N, 2024). Consequently, the cost and benefits of this project are evaluated.
<br>
- **Cost:** Computational resources, data acquisition, and personnel time.
- **Benefit:** Optimized bike distribution, improved stakeholder satisfaction, and operational efficiency.


## 1.3 Determine Machine Learning (ML) Goals

### 1.3.1 ML Goals
I need to create models for daily predictions using Linear Regression, Random Forest, Gradient Boosting, and Support Vector Machine (SVM) algorithms to build and compare these models to predict bike rental demand. These models will be trained and evaluated for the daily bike rental demand predictions.

### 1.3.2 ML Success Criteria
Entities invest heavily in Big Data and Machine Learning (ML) projects, yet many of these projects are forecasted to fail (Supakkul et al., 2020). Therefore, it is crucial to define the ML success criteria before making investments and commitments. The main success criterion is to achieve the best Mean Absolute Error (MAE), R-squared as a Percentage, and Mean Absolute Percentage Error (MAPE),  demonstrating high predictive accuracy on the test data. Moreover, the selected model should be interpretable and offer actionable insights for operational decision-making.<br>

### 1.4.1 Project Plan
The CRISP-DM methodology, as detailed by Leaper (2024), will be implemented as follows:
- **Business Understanding and Data Collection:** Gather data, comprehend features, and visualize relationships.
- **Data Understanding:** Analyze the data to gain insights and identify patterns.
- **Data Preparation:** Cleanse and preprocess the data, address missing values, and create relevant features.
- **Modeling:** Choose suitable algorithms, divide the data, train models, and perform hyperparameter optimization.
- **Evaluation:** Evaluate the model's performance using relevant metrics.
- **Deployment:** Ready the model for deployment and review the lessons learned.

### 1.4.2 Initial Assessment of Tools and Techniques
- **Tools:** Google Colab Environment, Python (Scikit-learn, Statsmodels, TensorFlow, PyTorch, XGBoost, LightGBM, LibSVM, LIBLINEAR, SVM, GridSearch)
- **Techniques:** Adopted the same steps for activities from stage 2 to stage 3. However, the tabular from Stage 4 follows unique patterns as required.



# Stage 2. Data Understanding

## 2.1 Initial Data Acquisition
The data for this project was obtained from the UCI Machine Learning Repository, specifically the Bike Sharing Dataset from a company known as Capital-Bikeshare15 in Washington D.C. This dataset comprises historical records of bike rentals, supplemented with weather and calendar information.

<br>

A challenge encountered was the absence of the dataset from the three data sources listed in the assessment document (Torrens University Australia, 2024), as illustrated in Figure 1.
<br>
To resolve this issue, I conducted a manual search for the dataset (Capital-Bikeshare15 Bike Sharing Dataset at UCI) using Google. After locating the data, I confirmed its authenticity and relevance with the learning facilitator and classmates. The dataset is accessible via the UCI Machine Learning Repository: [Bike Sharing Dataset](https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset)
<br>
- The zipped file was named “bike+sharing+dataset”, unzipped, and the folder was renamed to “bike_sharing_dataset”. 
- The files contained are “day.csv”, “hour.csv” and "Readme.txt". 
- The directory for execution is: “bike_sharing_dataset/day.csv” and “bike_sharing_dataset/hour.csv”. 
- All required packages were imported before reading the files.
<br><br>
The data obtained includes historical records of bike rentals, weather conditions, user demographics, and rental durations, providing a comprehensive foundation for the analysis which is read as shown in Figure 2.

<br>
Since the daily dataset can help prevent bias and is more comprehensive, I’ll use it for the model creation to have the best daily predictions that can be enhanced to hourly in the future.



  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

## 2.2 Data Description


  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

## 2.3 Verifying Data Quality

  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

### 2.3.2 Checking & Removing Outliers
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

### 2.3.3 Checking & Handling NaN Values
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

### 2.3.4 Checking & Handling Infinite Values
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

### 2.3.5 Randomize the Datasets
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

## 2.4  Data Exploration

### 2.4.1 Data Exploration Report
Data exploration involves understanding the relationships between features and the target variable cnt. Conducting initial data exploration using querying, data visualization, and reporting techniques.

#### 2.4.1.1 Distributions
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

#### 2.4.1.2 Correlations
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

# Stage 3: Data Preparation

## 3.1 Selecting Data

### 3.1.1 Rationale for Inclusion/Exclusion
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

## 3.2 Cleaning Data

### 3.2.1 Data Cleaning Report

  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

## 3.3 Construct Data

### 3.3.1 Derived Features/Inputs

  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

### 3.3.2 Generated Targets/Outputs
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

## 3.4 Integrating Data

### 3.4.1 Merged Data
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

## 3.5 Formatting Data


### 3.5.1 Reformatted Data
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

# Stage 4: Modelling

## 4.1 Selecting Modelling Technique

### 4.1.1 Model Technique
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

### 4.1.2 Model Technique Assumptions
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

## 4.2 Generating Test Design


### 4.2.1 Test Design
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

## 4.3 Building Model Parameter Settings
### 4.3.1 Train the Models

- Initialize the Model:
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

### 4.3.2 Models Description
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

## 4.4 Assessing Model
### 4.4.1 Model Assessment
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

### 4.4.2 Revised Parameter - Hyperparameter Turning Level 1
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

### 4.4.3 Revised Parameter - Hyperparameter Turning Level 2
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

## 4.5 Model Comparison
### 4.5.1 Comparing Models
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

### 4.5.2 Model Evaluation
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation


# Stage 5: Evaluation
## 5.1 Evaluating Results
### 5.1.1 Unseen Dataset Result
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation
      
### 5.1.2 Model’s Metrics Accuracy (R-squared as a Percentage & Mean Absolute Percentage Error (MAPE))
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

### 5.1.3 Does it align with the Business Success Criteria
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

## 5.2 Approved Models
### 5.2.1 Review Process
The results from the holdout data show that the Gradient Boosting model is performing well on new, unseen data, with consistent and strong metrics. The slightly higher R² and lower MAE on the holdout set are positive signs that the model generalizes well and maintains accuracy when applied to data it hasn't seen before. The slight increase in MAPE is not a major concern, as the other metrics still indicate strong performance.
<br>

Overall, this suggests that your Gradient Boosting model is robust and well-suited for making predictions on new data.

### 5.2.2 Export Model for Use (Save the model)
Exported the model as “.pkl” for use. It can be implemented as an application or used by users via an Application Programming Interface (API).


### 5.2.3 Test the Exported Model 
- Load the saved models
- Tested prediction with the models


## 5.3 Determining Next Steps
### 5.3.1 List of Actions Required
- Develop the web interface. 
- Integrate the interface with the selected model. 
- Deploy the website and model. 
- Monitor the model's performance. 
- Implement a maintenance plan.

### 5.3.2 List of Possible Decisions
- Move forward with deployment since the model meets the success criteria. 
- Return to the modeling stage if further enhancements are necessary.


# Stage 6: Deployment

## 6.1 Create a Simple Website
- Created the interface using HTML and CSS to serve as the user interface for the application.

## 6.2 Link the Exported Model with the Website
- Exported the models as “.pkl” files and stored them in the app directory under a folder named "models" for use within the application.

## 6.3 Host the Website for Real-world Use
- Set up a Flask app engine to capture users’ inputs, perform daily predictions, and return the prediction results.

## 6.4 Test the Website/Model 
  - #### Kindly
      - refer to each model (LinearRegression, RandomForest, GradientBooster, & SVM) for the coding with comments
      - refer to the Project Report for description and explanation

## 6.5 Azure Deployment & Lessons Learned
### 6.5.1 Azure Deployment
#### 6.1.1 Setup Azure Environment
Deploying an AI model on Azure requires careful configuration, an active subscription, and practical use of JSON (Siddhardhan, 2024). The steps I followed to deploy the model are outlined below:
- Create an Azure Account or Sign in:
    - If you don't have an Azure account, create one at Azure at https://azure.microsoft.com/en-us/ or https://portal.azure.com/#home.
- Create a Resource Group:
    - In the Azure portal, navigate to "Resource groups" and click "Create."
    - Select your subscription, provide a name for the resource group (e.g., "MLN601"), and choose a region.
    - Alternatively, you can name the resource group "bikepredictor" and select the appropriate region.

### 6.5.2 Reflections

**Challenges Encountered:** Acquiring the dataset was challenging, as discussed in Section 2 of this document. The limited time frame made it difficult to explore alternative options and develop a more refined application interface.
<br>
**Successes:** A significant amount of learning occurred through the implementation of models in real-world scenarios. I successfully transformed the model into a web application, which was both an impressive and rewarding experience.


### 6.5.3 Recommendations
#### Future Work: 
- **Enhance Model and Interface:** The model and interface should be designed to handle multiple predictions at once (can upload files as well), providing users with results and charts of the predicted demand.
- **Geographic Flexibility:** The model should be adaptable to different locations, accounting for demand variations across regions.
- **API Integration:** Implement the API in a real-world setting to assess how it interacts with other applications.
- **Automated Data Fetching:** Integrate forecasts from external APIs to eliminate the need for manual data entry. Users should be able to select a location and date, with all other necessary features being automatically retrieved for prediction.
- **User Data Tracking:** The app should track user inputs and responses, enabling enhancements and providing a personalized experience for future use.
- **Hourly Prediction:** The updated version should be able to give a breakdown of hourly demand per day to ensure consistent business operations.


# REFERENCES
- Blomster, M., & Koivumäki, T. (2022). Exploring the resources, competencies, and capabilities needed for successful machine learning projects in digital marketing. Information Systems and E-Business Management, 20(1), 123–169. https://doi.org/10.1007/s10257-021-00547-y
- Bokrantz, J., Subramaniyan, M., & Skoogh, A. (2023). Realising the promises of artificial intelligence in manufacturing by enhancing CRISP-DM. Production Planning & Control, 1–21. https://doi.org/10.1080/09537287.2023.2234882
- El-Assi, W., Salah Mahmoud, M., & Nurul Habib, K. (2017). Effects of built environment and weather on bike sharing demand: a station level analysis of commercial bike sharing in Toronto. Transportation, 44(3), 589–613. https://doi.org/10.1007/s11116-015-9669-z
- Hadi, F.-T. (2009). Bike Sharing. UCI Machine Learning Repository. https://doi.org/https://doi.org/10.24432/C5W894
- Krishnan, V. (2015). Machine Learning Aided Efficient Tools for Risk Evaluation and Operational Planning of Multiple Contingencies. In A. T. (Ed.), Chaos Modeling and Control Systems Design (pp. 291–317). Springer. https://doi.org/10.1007/978-3-319-13132-0_12
- Leaper, N. (2024, July 12). A Visual Guide to CRISP-DM Methodology. Design for Experiences Notes and Ideas for Experience Design.
- R, M., & N, S. B. (2024). Healthcare Cost-Benefit Analysis Using Machine Learning: A Panel Data Modelling Approach. 2024 IEEE International Conference on Interdisciplinary Approaches in Technology and Management for Social Innovation (IATMSI), 1–6. https://doi.org/10.1109/IATMSI60426.2024.10502671
- Siddhardhan. (2024, January 7). Deploying Machine Learning Model on Azure with Python | Step-by-Step Guide | ML - Azure Deployment [Video recording]. https://www.youtube.com/watch?v=VfTVIXiffBU
- Studer, S., Binh, T. B., Drescher, C., Hanuschkin, A., Winkler, L., Peters, S., & Müller, K.-R. (2021). Towards CRISP-ML(Q): A Machine Learning Process Model with Quality Assurance Methodology. A Machine Learning Process Model with Quality Assurance.
- Supakkul, S., Ahn, R., Junior, R. G., Villarreal, D., Zhao, L., Hill, T., & Chung, L. (2020). Validating Goal-Oriented Hypotheses of Business Problems Using Machine Learning: An Exploratory Study of Customer Churn (pp. 144–158). https://doi.org/10.1007/978-3-030-59612-5_11
- Torrens University Australia. (2024). ASSESSMENT 3 BRIEF Subject Code and Title MLN 601 Machine Learning. https://mylearn.torrens.edu.au/courses/6815/assignments/89323
 
