# Business and data understanding
------------


## Terminology


#### 1. Business terminology
A table or paragraph contains all the business related terms to be used in the project.

* Key Performance Indicators (KPIs): Type of performance measurement that evaluates the success of an organization or of a particular activity in which it engages. In taxi business it could be average trip duration, fleet utilization rate, customer satisfaction scores, and number of rides per day.
* Fare: Fee paid by a passenger for use of a public transport system.
Surge Pricing: Dynamic pricing method where prices are temporarily increased as a reaction to increased demand and mostly limited supply..
* Fleet: A group of taxis owned or operated by a single company or organization.
Service Area: Geographical region where a business or organization provides its services or product delivery.
* Cab:  Type of vehicle for hire with a driver, used by a single passenger or small group of passengers. The same as taxi.
* Surge multiplier: Multiplier that is in effect in the geofence or the region in which the trip ends.


#### 2. ML terminology
A table or paragraph contains all the ML related terms to be used in the project.

* Feature: Individual measurable property or characteristic of a phenomenon. For instance, time, location, and weather conditions are features.
* Regression:  Technique that is to find relationship between dependent and independent variables to predict continuous values. For example, to find a dependence between features and price.
* Data Preprocessing: Process of evaluating, filtering, manipulating, and encoding data so that a machine learning algorithm can understand it and use the resulting output. For example, encode categorical features such as source point, destination point, etc.
* Model Evaluation: Process of using different evaluation metrics to understand a machine learning model’s performance, as well as its strengths and weaknesses.
For example, to assess the model accuracy MSE can be used.
* Cross-Validation: Technique that is used to train and evaluate our model on a portion of our database, before re-portioning our dataset and evaluating it on the new portions.
* Feature Engineering: Process of creating new features or transforming existing features to improve the performance of a machine-learning model.
In our case, instead of using the day as a number it is better to convert it to the day of week feature (monday, tuesday, and so on).
* Feature Importance: Step in building a machine learning model that involves calculating the score for all input features in a model to establish the importance of each feature in the decision-making process.
* Normalization: Process of transforming the data into a common format, so it can be used in analytics and machine learning algorithms.
* Overfitting: ML model that matches the training data too closely, losing its ability to classify and predict new data.
For example, if the model is too complex in comparison with the problem that it is going to solve the overrfitting might occur. The model will just remember data without learning.
* Underfitting: ML model that is simple to capture patterns in data.
For instance, if the model is too simple in comparison with the problem that it is going to solve the underfitting might occur. The model will not be able to learn the relationships between features and the target.
* Mean Squared Error (MSE): Evaluation metric used to measure the amount of error in statistical models. It assesses the average squared difference between the observed and predicted values.
In our task we will calculate MSE between target (correct) price and predicted on.
* Hyperparameters: Model parameters that are specified before training a model.


## Scope of the project
----------


#### 1. Background

“QuickRide” is a new taxi company in the USA. They already have service areas in small cities in America. QuickRide has a mobile application which allows clients to order a ride. Now they calculate fare price only depending on mileage and don't consider rush hours and  weather conditions and want to make their algorithm more flexible.

#### 2. Business problem
QuickRide taxi company is losing a lot of money because many users are leaving. Customers are unhappy with the prices, but the company cannot just lower them or it will go bankrupt.
They need ML model that will predict fare prices based on time, location, and weather conditions. The price should be balanced between revenue and good customer satisfaction.

#### 3. Business objectives

#### Primary business goal:
Predict taxi fare pricing based on time, location and weather conditions to maximize revenue and improve customer satisfaction.
<!-- Create a ML model that will predict fare prices based on time, location and weather conditions. This model should ensure taxi prices fairness for customers while maintaining profitability for QuickRide to reduce or stop users leaving. -->

#### Related business questions:

* How do peak hours (e.g., rush hours) affect fare prices and how can the price adjust for these variations?
* What is the impact of different weather conditions (e.g., rain, snow) on taxi fare pricing?
* How does the distance between source and destination locations influence price of the ride?
* How can the customer determine a fair perception of fare pricing by customers and still achieve QuickRide's desired revenue targets?


#### 4. ML objectives

* Predict the fare prices for taxi rides based on time of day, location (source and destination points), and weather conditions (including precipitation, temperature).
* Conduct an analysis to determine the importance of each feature (e.g., time, location, weather conditions) in predicting fare prices. This will help refine the model by focusing on the most impactful features.
* Implement fairness metrics for the model evaluation to ensure that predicted fares are perceived as fair by customers, balancing the need for profitability with customer satisfaction.
* Ensure the model is scalable and can generalize well to different datasets


## Success Criteria
-------------

#### 1. Business success criteria

- Achieve an 8% increase in customer satisfaction ratings within the next quarter by providing more accurate and fair fare estimates.

- Reduce average waiting time for customers by 20% within six months by better predicting demand and optimizing fleet allocation.



#### 2. ML success criteria
* Achieve $R^2$ (>90%) and MAPE ($<$20%) in fare predictions on the test dataset.


## Data collection

#### 1. Data collection report

* **Data Source:**  The CSV file with information about Uber and Lyft rides and weather conditions at the time of the ride.
* **Data Type:** The data consist of numberical values representing price, distance, temperature, humidity, etc.; categorical values representing cab type, timezone, etc.; text values representing source, destination, etc.; time values representing the time of a ride.
* **Data size:** The dataset contains 138614 customer records, with 57 features each.
* **Data collection method:**: The data is collected from publicly available information about Uber and Lyft rides on Kaggle website.




#### 2. Data version control report
* **Data Version:** The current data version is v1.0, which was updated on June 19, 2024"
* **Data Change Log:** The data will be updated 4 times by adding new portion of records.
* **Data Backup:** We have a backup of each dataset stored on our local machine"
* **Data Archiving:** The company archives data older than one week to a cloud storage service for enabling the data to be used for company purposes.
* **Data Access Control:** The company uses role-based access control to ensure that only authorized personnel can access and modify the data.

## Data quality verification


#### 1. Data description

- The data acquired for this project includes a dataset of 138614 records with 57 fields each. The fieds include geographical, weather, time and cab ride conditions. The data is in a CSV format and is stored in a local database.

- Table of description of the data features.


**Numerical features:**

| Feature name | Datatype | Description |
| --- | --- | --- |
| id | object | Unique Identifier for each record |
| hour | int64 | The hour of the day when the ride happened |
| day | int64 | The day of the month when the ride happened |
| month | int64 | The month when the ride happened. |
| distance | float64 | The total distance of the requested ride |
| surge_multiplier | float64 | The surge pricing multiplier applied to the transaction |
| latitude | float64 | The latitude coordinate of the transaction location |
| longitude | float64 | The longitude coordinate of the transaction location |
| temperature | float64 | The temperature at the time and location of the ride |
| apparentTemperature | float64 | The perceived temperature at the time and location of the ride |
| precipIntensity | float64 | The intensity of precipitation at the time of the ride |
| precipProbability | float64 | The probability of precipitation at the time of the ride |
| humidity | float64 | The humidity level at the time of the ride |
| windSpeed | float64 | The wind speed at the time of the ride |
| windGust | float64 | The wind gust speed at the time of the ride. |
| visibility | float64 | The visibility level at the time of the transaction |
| temperatureHigh | float64 | The highest temperature recorded at the time of the ride |
| temperatureLow | float64 | The lowest temperature recorded at the time of the ride. |
| apparentTemperatureHigh | float64 | The highest perceived temperature recorded at the time of the ride |
| apparentTemperatureLow | float64 | The lowest perceived temperature recorded at the time of the ride |
| dewPoint | float64 | The dew point at the time of the transaction |
| pressure | float64 | The atmospheric pressure at the time of the ride |
| windBearing | int64 | The direction of the wind at the time of the ride |
| cloudCover | float64 | The cloud cover percentage at the time of the transaction |
| uvIndex | int64 | he UV index at the time of the transaction |
| visibility.1 | float64 | Visibility measurement at the time of the transaction (duplicate of visibility column) |
| ozone | float64 | The ozone level at the time of the ride |
| moonPhase | float64 | The phase of the moon on the day of the ride |
| precipIntensityMax | float64 | The maximum precipitation intensity at the time of the ride. |
| temperatureMin | float64 | The minimum temperature recorded at the time of the ride. |
| temperatureMax | float64 | The maximum temperature recorded at the time of the ride |
| apparentTemperatureMin | float64 | The minimum perceived temperature recorded at the time of the ride |
| apparentTemperatureMax | float64 | The maximum perceived temperature recorded at the time of the ride |


**Categorical features**:

| Feature name | Datatype | Description |
| --- | --- | --- |
| cab type | object | The type of taxi company (Uber or Lyft) |
| name | object | Category of taxi ride |
| product_id | object | Id of the category of taxi ride |
| icon | float64 | An icon representing the weather condition at the time of the ride |
| timezone | object | The timezone in which the ride happened
 |

 **Text features**:

 | Feature name | Datatype | Description |
| --- | --- | --- |
| source | object | The start point of the ride |
| destination | object | The end point of the ride |
| short_summary | object | A brief weather summary at the time of the ride |
| long_summary | object | A detailed weather summary at the time of the transaction
 |

 **Time features**:

 | Feature name | Datatype | Description |
| --- | --- | --- |
| datetime | object | The date of the ride|
| windGustTime | int64 | The time when the wind gust occurred |
| temperatureHighTime | int64 | The time when the highest temperature was recorded |
| temperatureLowTime | int64 | The time when the lowest temperature was recorded |
| apparentTemperatureHighTime | int64 | The time when the highest perceived temperature was recorded |
| apparentTemperatureLowTime | int64 | The time when the lowest perceived temperature was recorded |
| sunriseTime | int64 | The time of sunrise on the day of the transaction |
| sunsetTime | int64 | The time of sunset on the day of the ride |
| uvIndexTime | int64 | The time when the UV index was recorded |
| temperatureMinTime | int64 | The time when the minimum temperature was recorded. |
| temperatureMaxTime | int64 | The time when the maximum temperature was recorded |
| apparentTemperatureMinTime | int64 | The time when the minimum perceived temperature was recorded |
| apparentTemperatureMaxTime | int64 | The time when the maximum perceived temperature was recorded |

#### 2. Data exploration
- During data exploration, several interesting patterns and correlations were discovered. For example, there is a dependence of the distance of the ride and the price. As length of the trip increases, the cost of the trip increases too. Moreover, day of the week, hour, product_id, name, weather summary and source and destination points of the ride also influence the price. Surprisingly, the temperature does not affect the price of a trip significantly. Perhaps this is due to the fact that the data only contains information for two months. These finfdings demonstarate that the data has enough informative features for solvig price prediction problem. However, more data collected over different months is needed for a more precise result.

 #### Charts to present findings:

- The relationship between distance of the ride and price:

![](https://drive.google.com/uc?export=download&id=13Wegvp2dmzX-pSFzGYGcI_CBdDCyeTci)

- The relationship between day of week of the ride and price:
![](https://drive.google.com/uc?export=download&id=185nDNsfAAdtRXXPPqXBavKH-fd6iGF2h)

- The relationship between hour of the ride and price:
![](https://drive.google.com/uc?export=download&id=1_ca9KPSrWUxRJIHc7wQmM5sUA5KqAJIs)

- The relationship between product_id (id of the category of ride) and price:
![](https://drive.google.com/uc?export=download&id=1qsWkCOs9vGVAIwLtpLfbc3RAlZXvkpUQ)

- The relationship between name (categoty of ride) and price:
![](https://drive.google.com/uc?export=download&id=13moh4OodFJ9dGwVeASMC8BHoWkyPshXH)

- The relationship between short weather summary and price:
![](https://drive.google.com/uc?export=download&id=1EFSm3i9Haw1EUY-JoOvQjXuRloTRjOU_)

- The relationship between source and destination points and price:
![](https://drive.google.com/uc?export=download&id=15MglQl8eApOaSB16l_jU59sw0DVMLprT)

- Temperature and price correlation matrix:
![](https://drive.google.com/uc?export=download&id=1zF0m3PWDrUYEcx9nIUfYoIh0JjLjL7qz)


#### 3. Data requirements
The data requirements for this project are defined as follows:
- id: Id should be unique, not null and follow a pattern (424553bb-7174-41ea-aeb4-fe06d4f4b9d7).
- Cab type: Cab type should by lyft or uber.
- Price: The price of the ride should be greater than 0 and of type that can be converted to float.
- Distance: The distance of the ride should be greater than 0, not null and of type that can be converted to float.
- Datetime: The distance of the ride should be not null, in the format Y-M-D H:M:S and the datetime is greater than 11-26-2018 (starting point of measurements).
- Hour: The hour of the ride should be not null, with value between 0 and 24 and of type that can be converted to int.
- Day: The day of the ride should be not null, with value between 0 and 31 and of type that can be converted to int.
- Month: The hour of the ride should be not null, with value between 0 and 12 and of type that can be converted to int.

#### 4 Data quality verification report
- Completeness: The data is complete in the sense that it covers all the required cases. Id, cab type, price, distance, datetime, hour, day and month columns satisfoied all the requirements mentioned above.

- Correctness: The data appears to be correct, with no obvious errors. However, a manual review of the data is recommended to ensure that there are no errors.

- Missing Values: There are some missing values in the price column but it will be handled during data analysis step.

- Overall, the data quality is good, and the data is suitable for analysis and modeling. However, a manual review of the data is recommended to ensure that there are no errors, anomalies and outliers.

## Project feasibility
-------------

#### 1. Inventory of resources

* The project developer team consists of Data Engineer, Data Scientists, and ML Engineer
* Access to the Uber and Lyft dataset of the taxi rides
* Kaggle computing resources to train the model and process the data using the GPU P100 accelerator
* Tensorflow, dvc, hydra, scikit-learn, pandas, numpy, matplotlib.


#### 2. Requirements, assumptions and constraints


##### Requirements
* The project is planned to be completed by the end of July
* The solution should be scalable
* The predictions made by ML model should have low MSE
* Factors, which affect ride prices should be comprehensible for the end users
* Open dataset, leading to lack of legal issues


###### Assumptions
* Peak hours are highly influencing the price of the taxi ride
* Weather conditions will influence the predictions. For example, rain will increase the demand

##### Constraints

* Only 30 hours of Kaggle resources with GPU to the week
* Performanced and available system
* Lack of real-time data


#### 3. Risks and contingencies

* Problem with model, data, and code versioning. SOLUTION: Use MLops principles.
* Expiring Kaggle resources availability. SOLUTION: Create new accounts or use local machines to train model and process data.
* Degradation of the model in production. SOLUTION: Retrain it.


#### 4. Costs and benefits
![](https://drive.google.com/uc?export=download&id=1Z0QYGJI677RSonJLLXIYBa_vNzNIYicH)

The potential benefits of the project compared to costs is very high (3/1 = 1). Since our solution outputs the company from crysis and leads to profit.

#### 5. Feasibility report

We build a proof-of-concept (POC) model. It showed a good result in acceptable limits (MSE < 3.4 and R^2 > 96%). Based on previous explorations our project is feasible, due to availability of dataset and resources. Based on the POC model we can assume that a full scaled model will solve the main business problem.


## Project plan
----------------

#### Phase 1: Business and data understanding
* Duration: 1 week
* Resources: The whole team, Company history, Dataset.
* Input: Business goal, Dataset, Project scope
* Output: Project plan, ML canvas, Report
* Dependencies: Availability of dataset
* Tasks:
  * Define terminology
  *  Research scope of the project
  * Determine success criteria
  * Collect the data
  * Ensure the data quality
  * Analyze project feasibility
  * Produce project plan
  * Produce ML canvas
  * Write a report

#### Phase 2: Data engineering/preparation
* Duration: 1 week
* Resources: Data engineer, Data Scientist, Dataset, numpy, pandas.
* Input: Dataset
* Output: Dataset compatible with the model
* Dependencies: Data quality and accessibility
* Tasks:
  * Collect the data if necessary
  * Clean and proprocess data
  * Make an EDA analysis
  * Select and engineer features

#### Phase 3: Model engineering
* Duration: 1 week
* Resources: Data Scientist, ML engineer, preprocessed dataset, keras, scikit-learn.
* Input: Preprocessed data
* Output: Model with evaluation metrics
* Dependencies: Preprocessed dataset quality
* Tasks:
  * Choose the model to solve the problem
  * Split the data into train and test parts
  * Train the model
  * Evaluate it using metrics
  * Iterate through model tuning

#### Phase 4: Model validation
* Duration: 1 week
* Resources: Data Scientist, ML engineer, model, validation metrics, keras, scikit-learn.
* Input: Model from previous phase
* Output: Evaluated and upgraded model
* Dependencies: Model from previous phase
* Tasks:
  * Make a cross-validation
  * Analyze feature importance
  * Check for new metrics to ensure fairness
  * Iterate one more model refinement
  
####Phase 5: Model deployment
* Duration: 1 week
* Resources: ML engineer, model, github.
* Input: Final model
* Output: Deployed model
* Dependencies: Deployment services
* Tasks:
  * Develop API for model production
  * Deploy the model

#### Phase 6: Model monitoring and maintenance
* Duration: the exploitation period
* Resources: ML engineer, Data Scientist.
* Input: Metrics and live data
* Output: Performance assessment reports
* Dependencies: monitoring system and data
* Tasks:
  * Observe the model
  * Retrain if performance achieved threshold
  * Give performance reports to stakeholders
There could be many risks in the project that will delay the production of the model. For instance, integration challenges, model performance, resource limitations, e.t.c. All these issues will potentially cause business loss. Therefore, our aim is to use verified and fast methods to complete the product.

  [JIRA Gantt chart](https://id.atlassian.com/verify-email/sent?application=jira&continue=https%3A%2F%2Fndevelop16.atlassian.net%2Fjira&token=eyJraWQiOiJtaWNyb3Mvc2lnbi1pbi1zZXJ2aWNlL29xNGhyMDVqdDA3MmY0M2oiLCJhbGciOiJSUzI1NiJ9.eyJhdWQiOiJsaW5rLXNpZ25hdHVyZS12YWxpZGF0b3IiLCJuYmYiOjE3MTg5MDY0MjYsInNjb3BlIjoiZW1haWxWYWxpZGF0aW9uU2VudCIsImlzcyI6Im1pY3Jvcy9zaWduLWluLXNlcnZpY2UiLCJ2ZXJpZmljYXRpb25UeXBlIjoicmV2ZXJpZnkiLCJleHAiOjE3MTk1MTEyMjYsInVzZXJJZCI6IjcxMjAyMDpiYTk0MTA2MC1kYmMxLTQ2ZDUtYWI5Zi05ZTMwMzNkNGQxZDUiLCJpYXQiOjE3MTg5MDY0MjYsImp0aSI6ImY2MWRiNGQxLTkyNDMtNDIzOC04OTI0LTgzMjJjNDk2OWVlYSIsImNvbnZlcnNhdGlvbmFsTmFtZSI6ItCd0YPRgNC40YHQu9Cw0Lwg0JfQuNC90L3QsNGC0YPQu9C70LjQvSJ9.hRC18wy4p8UzlnCMUiVXXMrmYcMJhPnvhAsJsbR5EHQYW-pMK2Lg9xj-6X-UgUyUumkRvFnjCtct8yDRakKfG9BXl8dt3qbibed99gIRvzzuMtiGHGIZFMsV1ib19V-iGi7uKsxr1sNmK9IqJIZ0S2lUMhaftGahZcJUe9CdYN7j22rPkvgcG3y_1jaM6_JZxim1itmceLRdJ-bQ7kpZlhqvrgEzOF1pPo7EBABivojsaUw1xUhju2tMyeNlpiiKEuLM-kjSgpmRpQ7N_R9yq4KVg1cQuZQuSyW_MTs6dJzPy0LAGQPCTuIz050APXE0JJPwLE_xnyQTfPLbgs6cQA)


#### 2. ML project Canvas

![Canvas](https://drive.google.com/uc?export=download&id=1uJ0YhoJkw_COCyM4dyWPhU8GfzrLp-dB)