## **1. Loading libraries and data**

In [3]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [8]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LinearRegression, Lasso, Ridge
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

pd.set_option('display.max_columns',200)
np.random.seed(24)

In [None]:
df = pd.read_csv(
    "/kaggle/input/seoul-bike-trip-duration-prediction/For_modeling.csv",
    dtype={
        'Duration' : 'int8',
        'Distance' : 'int8',
        'PLong' : 'float32',
        'PLatd' : 'float32',
        'DLong' : 'float32',
        'DLatd' : 'float32',
        'Haversine' : 'float32',
        'Pmonth' : 'int8',
        'Pday' : 'int8',
        'Phour' : 'int8',
        'Pmin' : 'int8',
        'PDweek' : 'int8',
        'Dmonth' : 'int8',
        'Dday' : 'int8',
        'Dhour' : 'int8',
        'Dmin' : 'int8',
        'DDweek' : 'int8',
        'Temp' : 'float32',
        'Precip' : 'float32',
        'Wind' : 'float32',
        'Humid' : 'float32',
        'Solar' : 'float32',
        'Snow' : 'float32',
        'GroundTemp' : 'float32',
        'Dust' : 'float32'
    },
    index_col=0
).sample(frac=1)
df.head()

In [None]:
df.shape

In [None]:
df.info()

## **2. Exploratory Data Analysis**

In [None]:
df.columns

In [None]:
df.describe()

**Observations:**
1. Distance: there are data points where the distance value is negatice (-ve)
2. Haversine: there are data points where the haversine value is 0. 

**Observation analysis 1. Distance**  
Exploring the data points where the distance is negative.

In [None]:
df.shape

In [None]:
df[df['Distance']<0].shape

In [None]:
df[df['Distance']<0].head()

In [None]:
# transforming all the -ve distance to +ve distance
df['Distance'] = df['Distance'].apply(lambda x: abs(x))

In [None]:
df[df['Distance']<0].shape

**Observation analysis 2. haversine**    
Exploring the data points where the haversine value is 0.   
It should mean that pick up and drop off location (longitude and latitude) are the same.

In [None]:
df.shape

In [None]:
df[df['Haversine']==0].shape

In [None]:
df[df['Haversine']==0].reset_index().drop(columns=['index']).head()

In [None]:
df[df['Haversine']==0].reset_index().drop(columns=['index']).describe()

- Despite having a haversine distance of 0, indicating that the pick-up and drop-off locations are identical, the duration of trips exhibits a minimum value of 2 minutes and a maximum value of 119 minutes. 
- This suggests that bicycles may have traveled round-trip journeys, returning to the pick-up location after visiting other destinations. 
- The removal of such data points could potentially impact the performance of the model. However we dont have any additional info about round trips. Hence we can remove them for the time being. 

- **NOTE**: In future we can add new feature to record the round trips or better an new long,latd coumns to record the co-ordinates of the midpoint of the entire journey. Which will help us in confirming about round trips.

In [None]:
df[df['Distance']==0].shape

In [None]:
df[df['Distance']==0].head()

In [None]:
filtered_df = df.loc[(df['Distance'] == 0) &
                      (df['Haversine'] == 0) &
                      (df['Pmonth'] == df['Dmonth']) &
                      (df['Pday'] == df['Dday']) &
                      (df['Phour'] == df['Dhour']) &
                      (df['Pmin'] == df['Dmin']) &
                      (df['PDweek'] == df['DDweek'])]

# Print the filtered DataFrame
filtered_df

In [None]:
filtered_df.shape

If both the Haversine distance and the actual distance are 0, then it is likely that the bicycle has not moved at all. This could be due to a number of reasons, such as:

- The bicycle was parked for the entire duration of the trip.
- The bicycle was moved a very short distance, but not enough to register on the GPS device.
- There was an error in the GPS data.

Removing these data points would be good for having better model performance

In [None]:
df = df.drop(filtered_df.index)

In [None]:
filtered_df_H0 = df.loc[df['Haversine']==0]
df = df.drop(filtered_df_H0.index)

In [None]:
df.shape

In [None]:
filtered_df1 = df.loc[(df['Distance'] == 0) & (df['Haversine'] == 0)]
filtered_df1.shape

In [None]:
num_vars = ['Duration', 'Distance', 'Haversine','Temp','Precip', 'Wind', 'Humid', 'Solar', 'Snow','GroundTemp', 'Dust'] 

fig, axes = plt.subplots(nrows=3,ncols=4) # create figure and axes

axes = axes.flatten() # Flatten the axes array for easy iteration

for i,col in enumerate(num_vars):
    ax = axes[i]
    box = ax.boxplot(df[col], patch_artist=True)
    box['boxes'][0].set_facecolor('#7BFB74')
    ax.set_xticklabels([])
    ax.set_title(col)
    ax.yaxis.grid(True)

fig.set_size_inches(18.5,14)
plt.tight_layout()
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
sns.boxplot(
    data=df.select_dtypes(include=[np.number]),
    ax=ax
)
plt.xticks(rotation=45)
plt.show()

In [9]:
# Total pick ups every months in a year
pivot_table1 = df.pivot_table(index='Pmonth', aggfunc={'Pmonth': 'count'})

month_map ={1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun', 7:'Jul', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'}
pivot_table1.index = pivot_table1.index.map(month_map)

# Get the pick-up months and trip counts
pick_up_months = pivot_table1.index.to_numpy()
trips_count = pivot_table1['Pmonth'].to_numpy()

plt.figure(figsize=(10,6))
sns.barplot(x=pick_up_months,y=trips_count)
plt.xlabel("Months")
plt.ylabel("Counts")
plt.title("Total pick ups every months in a year")
plt.show()

NameError: name 'df' is not defined

In [None]:
# Calculate average trip duration by month
df_avg_duration_by_month = df.groupby(['Pmonth'])['Duration'].mean()

# Plot average trip duration by month
plt.figure(figsize=(10, 6))
plt.plot(df_avg_duration_by_month.index, df_avg_duration_by_month.values,'b-o')
plt.xlabel('Month')
plt.xticks(range(len(df_avg_duration_by_month.index)+1))
plt.ylabel('Average Trip Duration')
plt.title('Average Trip Duration by Month')
plt.show()

In [None]:
# Calculate average trip duration by day of the month
df_avg_duration_by_day_of_month = df.groupby(['Pday'])['Duration'].mean()

# Plot average trip duration by day of the month
plt.figure(figsize=(10, 6))
plt.plot(df_avg_duration_by_day_of_month.index, df_avg_duration_by_day_of_month.values,'g-o')
plt.xlabel('Day of Month')
plt.xticks(range(len(df_avg_duration_by_day_of_month.index)+1))
plt.ylabel('Average Trip Duration')
plt.title('Average Trip Duration by Day of Month')
plt.show()

In [None]:
# Calculate average trip duration by day of the week
df_avg_duration_by_day_of_week = df.groupby(['PDweek'])['Duration'].mean()

# Plot average trip duration by day of the week
plt.figure(figsize=(10, 6))
plt.plot(df_avg_duration_by_day_of_week.index, df_avg_duration_by_day_of_week.values,'r-o')
plt.xlabel('Day of Week')
plt.xticks(range(len(df_avg_duration_by_day_of_week.index)))
plt.ylabel('Average Trip Duration')
plt.title('Average Trip Duration by Day of Week')
plt.show()

In [None]:
# Calculate average trip duration by hour of the day
df_avg_duration_by_hour_of_day = df.groupby(['Phour'])['Duration'].mean()

# Plot average trip duration by hour of the day
plt.figure(figsize=(10, 6))
plt.plot(df_avg_duration_by_hour_of_day.index, df_avg_duration_by_hour_of_day.values,'y-o')
plt.xlabel('Hour of Day')
plt.xticks(range(len(df_avg_duration_by_hour_of_day.index)))
plt.ylabel('Average Trip Duration')
plt.title('Average Trip Duration by Hour of Day')
plt.show()

**Observation:**
1. From Total pick ups and Average Trip Duration every months in a year plots
- The data shows a clear trend of increased pick-ups during the late summer and early fall months. The highest pick-ups occur in September and October, followed by June, July, and August.
- The total pick ups and average trip duration is low in months: noverber, december and january. Since, This is the period of time in which south korea experiences winter season.
- the data suggests that the average trip duration is highest in May and September.
- due to a combination of factors, including increased congestion, increased tourism, and favorable weather conditions.
2. From Average Trip Duration by Day of Week plot, We can observe that the trip duration is higer during weekends.
3. From Average Trip Duration by Hour of Day plot, The average trip duration is higer from 15th to 20th hour of a day. Typically after working hours or evening time.

__From the above observation we can say that temperature and time are influential factors for the estimation of the trip duration.__

In [None]:
plt.figure(figsize = (25,12))
sns.heatmap(df.corr(),annot=True,center=0)
plt.title("correlation plot")
plt.show()

**Observations:**
- Plong and Dlong are highly correlated. but they represent the pick up and drop off longitudes.
- Platd and Dlatd are highly correlated. but they represent the pick up and drop off lattitudes.
- Temp and GroundTemp are highly correlated, and Temp has more correlation with Duration (target data) when compared with GroundTemp.

___