The most basic metric for all forms of transportation is trip length. Therefore, accurate trip-time prediction is essential for the development of Intelligent Transportation Systems (ITS) and traveller information systems. In order to forecast the duration of a journey, we used certain optimized data mining techniques to forecast the duration of trips involving rental bikes in the bike-sharing programme in Seoul. The forecast is made using a combination of weather data and Pickup and Drop location data.
There are 9.7 Million data instances and 26 features. The Data used include trip duration, trip distance, pickup-dropoff latitude and longitude, temperature, precipitation, wind speed, humidity, solar radiation, snowfall, ground temperature and 1-hour average dust concentration.
- Trip Duration: It represents the duration of a bike trip in some unit of time, such as minutes or seconds.
- Trip Distance: The distance covered during the bike trip, typically measured in kilometres or miles. This feature can help in understanding how to trip duration relates to the distance traveled.
- Pickup and Dropoff Latitude and Longitude: These are geographical coordinates representing the exact location where the bike trip starts (pickup) and ends (dropoff). Latitude and longitude are used to pinpoint specific locations on the Earth's surface.
- Temperature: The temperature at the time of the trip, usually measured in degrees Celsius or Fahrenheit. It can provide insights into how weather conditions affect trip duration.
- Precipitation: This feature indicates whether there was any form of precipitation (rain, snow, etc.) during the trip. It's often represented as a binary variable (0 for no precipitation, 1 for precipitation).
- Wind Speed: The speed of the wind at the time of the trip, measured in units like kilometres per hour (km/h) or meters per second (m/s). Wind speed can influence the ease of cycling.
- Humidity: The level of moisture or humidity in the air during the trip, often represented as a percentage. High humidity can affect comfort during cycling.
- Solar Radiation: Solar radiation measures the amount of energy received from the sun during the trip. It's typically measured in watts per square meter (W/m²) and can impact temperature and weather conditions.
- Snowfall: Indicates whether there was snowfall during the trip, usually represented as a binary variable (0 for no snowfall, 1 for snowfall). Snowfall can significantly affect road conditions and trip duration.
- Ground Temperature: The temperature of the ground or road surface at the time of the trip. This can be important, especially in cold or icy conditions.
- 1-Hour Average Dust Concentration: This feature represents the concentration of particulate matter (dust) in the air, typically measured in micrograms per cubic meter (µg/m³). It provides information about air quality, which can affect health and comfort during the trip.
- The first step is to check a number of null values and either replace them or remove them.
- The second step is to remove the outliers from the dataset which are important and possibly have unwanted instances.
- The third step is to remove the correlated features by plotting the heatmap and removing the features with a correlation greater than 80%.
First of all, we plotted violin plots to understand the distribution of the features. Then we plotted different plots to understand the relationship between different features such as linear relationships which features are more important than others.
- The plots for the following can be found in the file.
- Certain important plots:
- We performed two data transformation techniques :
- Standardization using z-score
- Normalization using min-max score
Then as the dataset has been transformed and the unwanted features have been removed, we decided that we would predict the duration of the trip as that would benefit the bike rental about
Different Models that we trained
- Linear Regression
- Polynomial Regression
- Random Forest Regressor
- Gradient Boosting Linear Regression
- XGB Booster
- We even did some feature engineering and optimization techniques where we tried considering the weather, trained the model and obtained a good RMSE score, absolute error and accuracy.