## Hello and welcome to my Notebook

Myself - Priyanshu Jain | 22f2001329
***

### Introduction
#### Today we will explore the world of taxis using this dataset and understand what goes into predicting the prices for taxis when we try to book them and how historical data can play a role in making better and accurate predictions. 

#### We will understand the dataset then move to creating visuals to better that understanding. We will use statistical methods to know nature of dataset and fix missing values or outliers effecting our data. After that we will build different models and pick the one that works best for our data.
***

In [None]:
# Loading libraries
import numpy as np # linear algebra and array
import pandas as pd # to handle and read datasets
import datetime # to handle time and date data

# Loading visualization libs
import matplotlib.pyplot as plt # to visualize data
from pandas.plotting import scatter_matrix

# Loading Feature Engineering libs
from sklearn.impute import KNNImputer, SimpleImputer # to fill missing values
from sklearn.preprocessing import OneHotEncoder, LabelEncoder # to encode data which is ordered
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, QuantileTransformer

# Loading model_selection libs
from sklearn.model_selection import train_test_split, GridSearchCV

# Loading models
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor
from sklearn.ensemble import GradientBoostingRegressor, AdaBoostRegressor
from sklearn.neural_network import MLPRegressor

import xgboost
import lightgbm

# Import Metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, explained_variance_score

# suppressing warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Let us read the data from train.csv file
train_data = pd.read_csv('/kaggle/input/taxi-fare-guru-total-amount-prediction-challenge/train.csv')
test_data = pd.read_csv('/kaggle/input/taxi-fare-guru-total-amount-prediction-challenge/test.csv')

In [None]:
# We will start with inspecting the data
print(train_data.shape) # dataset contains 1,75,000 rows and 17 columns with target_amount as the 'y'
train_data.head(5)

In [None]:
# Inspecting the test.csv on which we will make the predictions
print(test_data.shape) # dataset contains 50,000 rows and 16 columns
test_data.head(5)

In [None]:
# From understanding of the dataset and description focus will be on predicting the total_amount.
# Before we get onto that let us understand the dataset more.

print("Columns: ", train_data.columns.values)

### From the analysis and the problem statement we can divide the columns into
#### Features
1. VendorID
2. tpep_pickup_datetime
3. tpep_dropoff_datetime
4. passenger_count
5. trip_distance
6. RatecodeID
7. store_and_fwd_flag
8. PULocationID
9. DOLocationID
10. payment_type
11. extra
12. tip_amount
13. tolls_amount
14. improvement_surcharge
15. congestion_surcharge
16. Airport_fee

#### Label
1. total_amount

In [None]:
print("Columns: ", test_data.columns.values)
print(test_data.info()) 

# Null-values: We can see that passenger_count, RatecodeID, store_and_fwd_flag, 
# congestion_surcharge, Airport_fee have 1799 null values each

# We can observe that pickup_datetime and dropoff_datetime will need to be converted to
# datetime datatype and payment_type and store_and_fwd_flag will need to be encoded into categories

In [None]:
# Let us understand the nature of train data
train_data.describe()

# Few observations - Outliers and abnormalities
# It seems trip distance's max value is way too high reaching 135182 which seems abnormal, 
# Tip amount also reaches 484 and total amount has large negative value also.
# Improvement surcharge, extra, tolls_amount, congestion_surcharge and airport fee also have negative values.

In [None]:
# Similarly inspecting the nature of test data
test_data.describe()

In [None]:
# Checking duplicate entries in training data
train_data.duplicated().value_counts()

In [None]:
# Checking duplicate entries in test data
test_data.duplicated().value_counts()

In [None]:
# Let us inspect the data further.
# Passenger count
print("Passenger counts unique values: ", train_data.passenger_count.value_counts())
print("Passenger counts unique values: ", test_data.passenger_count.value_counts())

# Both train and test data has entries where 0 passenger count is also updated in the dataset, interesting.

In [None]:
plt.hist(train_data.passenger_count)

In [None]:
plt.hist(test_data.passenger_count)

In [None]:
train_data.store_and_fwd_flag.value_counts()

# We will use LabelEncoder here as there are just two categories

In [None]:
train_data.payment_type.value_counts()

# We will need to use OneHotEncoder to encode this column as there is no order in these to use OrdinalEncoder or LabelEncoder

In [None]:
plt.boxplot(train_data['trip_distance'])

# There are clearly 4 outliers in the trip distance in train_data

In [None]:
Q1 = np.percentile(train_data['trip_distance'], 25, method='midpoint')
Q3 = np.percentile(train_data['trip_distance'], 75, method='midpoint')
IQR = Q3 - Q1
print(IQR)

upper=Q3+1.5*IQR
upper_array=np.array(train_data['trip_distance']>=upper)
print("Upper Bound:",upper)
print(upper_array.sum())
 
#Below Lower bound
lower=Q1-1.5*IQR
lower_array=np.array(train_data['trip_distance']<=lower)
print("Lower Bound:",lower)
print(lower_array.sum())
 
# Create arrays of Boolean values indicating the outlier rows
upper_array_idx = upper_array[0]
lower_array_idx = lower_array[0]

# We can see that there are 24133 outliers. This could be due to the fact 
# that the passenger count has lots of variety ranging from 0-9
# with 1 being the highest in number hence most data is skewed to that

In [None]:
plt.boxplot(test_data['trip_distance'])

# There is 1 outlier in the test_data trip_distance

In [None]:
print("RatecodeID unique: ", train_data['RatecodeID'].value_counts())
print("RatecodeID unique: ", test_data['RatecodeID'].value_counts())

# RatecodeID are discrete ranging from 1-6 and 99. 
# 6 is not present in training data and appears only once in test data.
# We will use OneHotEncoding for these

### No null values in:
1. extra, 
2. tip_amount, 
3. tolls_amount, 
4. improvement_surcharge

### Null values in:
5. congestion_surcharge, 
6. Airport_fee

Presense of null values in congestion and airport fee could be possible as there may not always be rides that go to airport or have pickup. Also, congestion charges could be related to specific time of the day or specific to locations.

We will examine `congestion_surcharge` in more detail after we transform pickup date and time. For now let's explore `extra`, `tip_amount`, `tolls_amount`, `improvement_surcharge` and `airport_fee`

In [None]:
attribute_list = ['extra','tip_amount','tolls_amount','improvement_surcharge','total_amount']
scatter_matrix(train_data[attribute_list])

In [None]:
print(sorted(test_data.extra.value_counts().index))

# In case of extra charges they are absent in 51247 cases and present in the rest. In some cases even negative.
# We can utilize this information and add new column defining posivite, absent or negative extra charges as the column.

In [None]:
test_data.extra.value_counts()

# In case of extra charges they follow a specific pattern as the increment in the value is in stages

In [None]:
plt.boxplot(test_data.extra)

# It seems there are outliers in extra charges also

In [None]:
train_data.tip_amount.value_counts()

# Unlike others tip_amount are unique and continuous in nature. They are amounts of tips.

In [None]:
plt.boxplot(train_data.tip_amount)

# Similarly there are outliers in tip_amount aswell.

In [None]:
train_data.tolls_amount.value_counts()
# tolls amount has a 0 value in 159328 of the cases and in others its present.
# Maybe we can create a column which is categorical and defines if tolls_amount is present or absent.

In [None]:
train_data.improvement_surcharge.value_counts()

# improvement surcharge is 1 in 173145 cases -1 in 1725 cases and 0.3 and 0 in very few cases.

In [None]:
test_data.improvement_surcharge.value_counts()

# similar four distinct values exist for the test data improvement surcharge.
# It can be observed that these surcharges are in dollars

In [None]:
train_data.congestion_surcharge.value_counts()

# Similarly congestion_surcharges are either present, absent or negative. 
# These figures seem to be percentage charges.

In [None]:
plt.scatter(train_data.congestion_surcharge, train_data.PULocationID)

# Congestion surcharge is evenly distributed with Locaiton ID

In [None]:
plt.scatter(train_data.congestion_surcharge, train_data.DOLocationID)

In [None]:
train_data.Airport_fee.value_counts()

# Here either the airport fee is in dollars or it is in percentage of total_amount. 
# Lets look at total_amount to come to a conclusion.

In [None]:
plt.scatter(train_data.Airport_fee, train_data.PULocationID)

In [None]:
plt.scatter(train_data.Airport_fee, train_data.DOLocationID)

# There is no specific correlation between Airport_fee and the pickup or dropoff location ID

In [None]:
train_data.total_amount.describe()

# Based on the data it is still unclear why some of the values are negative let us explore these values more.

In [None]:
train_data[train_data.Airport_fee<0].total_amount.describe()

# It seems the values where Airport fee is negative total_amount is also negative

In [None]:
train_data[train_data.congestion_surcharge<0].total_amount.describe()

# Similar story is there in case when congestion_surcharge is negative.
# Could it be that the values are marked negative as the ride was offered entirely at discount to the customer.
# Or it could be that refund was issued for the ride to customer.

In [None]:
print("Total count total_amount with negative values in training data:", (train_data.total_amount<0).sum())

# In only 1725 cases total_amount was negative in the entire dataset.
# We can experiment in these cases by changing the signs for these values and same thing for 
# the charges that are associated with these cases. Our model is for predicting the total_amount 
# and not for taking discount or refund into consideration.

### Explore at refund possibility in the total_amount

## Data Exploration Summary

### About Dataset
1. We have 16 columns for prediction and 1 target variable called `total_amount`.
2. Training dataset has 175000 rows and 17 columns in total.
3. Test dataset has 50000 rows and 16 columns in total.

### Null Values
1. passenger_count, RatecodeID, store_and_fwd_flag, congestion_surcharge, Airport fee. These are the columns with null values which are equal in number for each.
2. Total null value count 6077 which makes up about 3.47% of training data.
3. Test data has similar story for the same columns with 1779 null rows making about 3.55% of data.

### Outliers
1. There are outliers in `trip_distance` which will be handled using QuantileTransformer or RobustScaler
2. `extra` and `tip_amount` also has outliers which need to handled.

### Required Feature Engineering
1. `tpep_pickup_datetime` and `tpep_dropoff_datetime` will we converted to datetime
2. `trip_duration` will be calculated
3. `weekday` will be calculated
4. `pick_time_of_day` and `drop_time_of_day` the hour specifically will be extracted and used for prediction. Data Exploration using Tableau gave this insight that if a polynomial of degree 8 is used for the prediction it leads to higher correlation. We will see how we can use this.
5. `store_and_fwd_flag` will be encoded into categories
6. `payment_type` will be encoded using LabelEncoder
7. `extra` can be used to create column with present or absent extra charges categorical column
8. `tolls_amount` can be used to create column with tolls_present where 0 means no toll and vice versa.
9. `improvement_surcharge` negative values to be changed to positive and used as percentages. <i>(This experiment was un-successful in improving model)</i>
10. `congestion_surcharge` negative values to be chaged to positive and used as percentages. <i>(This experiment was un-successful in improving model)</i>
11. `congestion_surcharge` null values to be filled, here we can explore the correlation of pickup location with congestion, pickup time with congestion, trip duration and distance with congestion to understand which better correlates with congestion. Based on that we may take a call on how to fill the missing values.
12. `Airport_fee` negative values to be changed to positive values. <i>(This experiment was un-successful in improving model)</i>
13. `Airport_fee` can also be used to create a categorical column with present or absent airport fee.
14. `Airport_fee` can be considered 0 when the airport fee is missing. There is no correlation between fee and location.
15. We will also experiment with `total_amount` and convert the negative values to positive in order to make prediction as it could lead to better 
    prediction. <i>(This experiment was un-successful in improving model)</i>
16. Alternate approach - that can be explored is to calculate the `gross_charges` which are solely calculated based on the ride distance and probably duration. Other charges such as `extra`, `tolls_amount`, `improvement_surcharge`, `congestion_surcharge`, `Airport_fee` can then be added to that to calculate the `total_amount`.

### Required scaling
1. We will definitely need to use scaling to scale the data it will remove the effect of outliers also in the data.
2. `trip_distance` has outliers which can be fixed using RobustScaler or QuantileTransformer
3. `extra` and `tip_amount` has outliers which can be fixed only if we are going to use them for prediction. But if we use alternate approach then we do not need to scale them.

In [None]:
# Let's run and test few models before enginnering data. We will treat these as base for the engineering ahead

# The choice of XGBRegressor is based on the experience of running the script before multiple times
score_list = {}
count = 0
def test(activity):
    global count
    X_train, X_test, y_train, y_test = train_test_split(train_data.drop(['total_amount',
                                                                         'tpep_pickup_datetime',
                                                                        'tpep_dropoff_datetime', 
                                                                         'payment_type','store_and_fwd_flag'], axis=1), 
                                                        train_data.total_amount, test_size = 0.2, random_state=42)
    xgbr_model = xgboost.XGBRegressor()
    xgbr_model.fit(X_train, y_train)
    score_list[activity] = [count, xgbr_model.score(X_test, y_test)]
    count += 1
    return score_list[activity]
print(test('base'))
# Now we will try to improve the model by feature engineering

## Feature Transformation
### Calculate trip_duration
Pickup time and Dropoff time can give us very good information
1. Information on trip duration by taking difference of dropoff and pickup time
2. Time of day when trip was started and ended also may have some correlation to the prices charged
3. We can use duration to calculate speed of travel as we have the trip distance. It can give us idea of whether traffic or waiting times have correlation to charges. Slow average speeds could mean more overall charges or congestion_surcharge.

train_data.payment_type.value_counts()

In [None]:
# Saving a copy of train and test 
raw_train = train_data.copy()
raw_test = test_data.copy()

In [None]:
# Converting datetime strings to datetime objects in both train and test data
train_data.tpep_pickup_datetime = pd.to_datetime(train_data.tpep_pickup_datetime)
train_data.tpep_dropoff_datetime = pd.to_datetime(train_data.tpep_dropoff_datetime)

# Doing the same for test set
test_data.tpep_pickup_datetime = pd.to_datetime(test_data.tpep_pickup_datetime)
test_data.tpep_dropoff_datetime = pd.to_datetime(test_data.tpep_dropoff_datetime)

In [None]:
# calculating the duration of trip and converting to minutes using datetime.timedelta.total_seconds
train_data['trip_duration'] = pd.DataFrame(map(datetime.timedelta.total_seconds,(train_data.tpep_dropoff_datetime - train_data.tpep_pickup_datetime)))/60

# Performing same for test data
test_data['trip_duration'] = pd.DataFrame(map(datetime.timedelta.total_seconds,(test_data.tpep_dropoff_datetime - test_data.tpep_pickup_datetime)))/60


print("Train Data negative trip durations", train_data.trip_duration[train_data.trip_duration<0].count())
# Strangely in 65674 cases time duration is negative

print("Test Data negative trip durations", test_data.trip_duration[test_data.trip_duration<0].count())
# Similarly in 18578 cases time duration is negative
print(test('added_trip_duration'))
# After this engineering model performed worst than earlier

In [None]:
# Replacing the negative values by positive values by reversing the signs for them.
train_data.loc[train_data['trip_duration'] < 0,'trip_duration'] = train_data.loc[train_data['trip_duration'] < 0,'trip_duration']*(-1)

# # Repeating same for the test data values
test_data.loc[test_data['trip_duration'] < 0,'trip_duration'] = test_data.loc[test_data['trip_duration'] < 0,'trip_duration']*(-1)
print(test('mod_trip_duration')) 
# The step of convering negative time to positive improved the model accuracy
# But the accuracy is still lower than earlier accuracy without trip_duration

### Calculate pick_time_of_day and drop_time_of_day

In [None]:
# Extracting time of the day only the hour from datetime data for pickup and dropoff

train_data['pick_time_of_day'] = train_data['tpep_pickup_datetime'].dt.hour
train_data['drop_time_of_day'] = train_data['tpep_dropoff_datetime'].dt.hour

# Performing same for test data
test_data['pick_time_of_day'] = test_data['tpep_pickup_datetime'].dt.hour
test_data['drop_time_of_day'] = test_data['tpep_dropoff_datetime'].dt.hour
print(test('add_pickup_drop'))

In [None]:
def replace_count(data):
    D = {}
    for i in data.value_counts().index:
        D[i] = data.value_counts()[i]
    for key, value in D.items():
        data.replace(key, value, inplace=True)

replace_count(train_data.PULocationID)
replace_count(train_data.DOLocationID)
replace_count(test_data.PULocationID)
replace_count(test_data.DOLocationID)

    
# D_train_PU = {}
# for i in train_data['PULocationID'].value_counts().index:
#     D_train_PU[i] = train_data['PULocationID'].value_counts()[i]

# D_train_DO = {}
# for i in train_data['DOLocationID'].value_counts().index:
#     D_train_DO[i] = train_data['DOLocationID'].value_counts()[i]

# D_test_PU = {}
# for i in test_data['PULocationID'].value_counts().index:
#     D_test_PU[i] = test_data['PULocationID'].value_counts()[i]
# D_train
print(test('add_Loc_ID_count'))

### Calculate weekend, weekdays column

In [None]:
train_data['weekdays'] = train_data['tpep_pickup_datetime'].dt.dayofweek
test_data['weekdays'] = test_data['tpep_pickup_datetime'].dt.dayofweek
print(test('add_weekday'))

In [None]:
print(test_data.weekdays.value_counts())
plt.scatter(train_data.weekdays, train_data.total_amount)

### Converting negative to positive in charges

<i>(This experiment was un-successful in improving model)</i>

In [None]:
# # extra
# train_data.loc[train_data['extra'] < 0,'extra'] = train_data.loc[train_data['extra'] < 0,'extra']*(-1)
# test_data.loc[test_data['extra'] < 0,'extra'] = test_data.loc[test_data['extra'] < 0,'extra']*(-1)
# print(test('mod_extra'))

# # # tolls_amount
# train_data.loc[train_data['tolls_amount'] < 0,'tolls_amount'] = train_data.loc[train_data['tolls_amount'] < 0,'tolls_amount']*(-1)
# test_data.loc[test_data['tolls_amount'] < 0,'tolls_amount'] = test_data.loc[test_data['tolls_amount'] < 0,'tolls_amount']*(-1)
# print(test('mod_tolls_amount'))

# # improvement_surcharge
# train_data.loc[train_data['improvement_surcharge'] < 0,'improvement_surcharge'] = train_data.loc[train_data['improvement_surcharge'] < 0,'improvement_surcharge']*(-1)
# test_data.loc[test_data['improvement_surcharge'] < 0,'improvement_surcharge'] = test_data.loc[test_data['improvement_surcharge'] < 0,'improvement_surcharge']*(-1)
# print(test('mod_imp_surcharge'))

In [None]:
# # # Replacing 0.3 with 0 for both train and test data
train_data.loc[train_data['improvement_surcharge'] == 0.3,'improvement_surcharge'] = 0
test_data.loc[test_data['improvement_surcharge'] == 0.3,'improvement_surcharge'] = 0

print(test('mod_imp_surcharge_.3'))
# Performance worsened after the run here

In [None]:
# # # congestion_surcharge
train_data.loc[train_data['congestion_surcharge'] < 0,'congestion_surcharge'] = train_data.loc[train_data['congestion_surcharge'] < 0,'congestion_surcharge']*(-1)
test_data.loc[test_data['congestion_surcharge'] < 0,'congestion_surcharge'] = test_data.loc[test_data['congestion_surcharge'] < 0,'congestion_surcharge']*(-1)
print(test('con_surcharge'))

# # # Airport_fee
train_data.loc[train_data['Airport_fee'] < 0,'Airport_fee'] = train_data.loc[train_data['Airport_fee'] < 0,'Airport_fee']*(-1)
test_data.loc[test_data['Airport_fee'] < 0,'Airport_fee'] = test_data.loc[test_data['Airport_fee'] < 0,'Airport_fee']*(-1)
print(test('airport_fee'))

In [None]:
# total_amount
# train_data.loc[train_data['total_amount'] < 0,'total_amount'] = train_data.loc[train_data['total_amount'] < 0,'total_amount']*(-1)
# print(test('mod_total_amount'))

### Filling null values with 0 for the required charges

In [None]:
train_data.Airport_fee.fillna(0, inplace=True)
test_data.Airport_fee.fillna(0, inplace=True)
print(test('impute_Airport_fee'))

train_data.congestion_surcharge.fillna(0, inplace=True)
test_data.congestion_surcharge.fillna(0, inplace=True)
print(test('impute_congestion_fee'))

In [None]:
# Checking the transformations for data

print(train_data.improvement_surcharge.value_counts())
print(test_data.improvement_surcharge.value_counts())
print("##############################################\n")
print(train_data.congestion_surcharge.value_counts())
print(test_data.congestion_surcharge.value_counts())
print("##############################################\n")
print(train_data.Airport_fee.value_counts())
print(test_data.Airport_fee.value_counts())
print("##############################################")

In [None]:
# This test overwrites the previous one as there is no need to drop columns inside the function as they have been modded.
target_total = train_data['total_amount']

def test(activity,L=[]):
    global count
    X_train, X_test, y_train, y_test = train_test_split(train_data.drop(L,
                                                                        axis=1),
                                                        target_total, 
                                                        test_size = 0.2, 
                                                        random_state=42)
    xgbr_model = xgboost.XGBRegressor()
    xgbr_model.fit(X_train, y_train)
    score_list[activity] = [count, xgbr_model.score(X_test, y_test)]
    count += 1
    return score_list[activity]

### Encoding

In [None]:
# store_and_fwd_flag
l_enc = LabelEncoder()
train_data.store_and_fwd_flag = l_enc.fit_transform(np.array(train_data.store_and_fwd_flag).reshape(-1,1))
test_data.store_and_fwd_flag = l_enc.transform(np.array(test_data.store_and_fwd_flag).reshape(-1,1))
print(test('encode_saf_flag',['tpep_pickup_datetime','tpep_dropoff_datetime','total_amount','payment_type']))

# payment_type
l_enc = LabelEncoder()
train_data.payment_type = l_enc.fit_transform(np.array(train_data.payment_type).reshape(-1,1))
test_data.payment_type = l_enc.transform(np.array(test_data.payment_type).reshape(-1,1))
print(test('payment_type',['tpep_pickup_datetime','tpep_dropoff_datetime','total_amount']))

### Filling passenger_count and RatecodeID using SimpleImputer

<i>Experiement with Gross Amount did not perform better than total_amount</i>

In [None]:
# Since there are seperate amounts also mentioned we will calculate and keep net_amount = total_amount -extra -tip_amount -tolls_amount -improvement_surcharge -congestion_surcharge -Airport_fee
# train_data['gross_amount'] = train_data['total_amount']-train_data['extra']-train_data['tip_amount']-train_data['tolls_amount']-train_data['improvement_surcharge']-train_data['congestion_surcharge']-train_data['Airport_fee']

# target_gross = train_data['gross_amount']



In [None]:
# Tried KNNImputer to fill in the filling missing values with n_neighbors = 5, but dropped in subsequent
# runs as the gain from KNN was not sufficient to overcome long runtime

imputer = SimpleImputer(strategy='median')
train_data.drop(['tpep_pickup_datetime','tpep_dropoff_datetime','total_amount'],axis=1, inplace=True)
train_data_knn = imputer.fit_transform(train_data)

# Using KNNImputer to fill in the filling missing values with n_neighbors = 5
test_data.drop(['tpep_pickup_datetime','tpep_dropoff_datetime'],axis=1, inplace=True)
test_data_knn = imputer.transform(test_data)

In [None]:
# creating a dataframe back from the output of the KNNImputer
train_data.iloc[:,:] = pd.DataFrame(train_data_knn)
print(test('impute_missing_values'))
# creating a dataframe back from the output of the KNNImputer
test_data.iloc[:,:] = pd.DataFrame(test_data_knn)

### Scaling
Let us scale trip_distance and try again. Since trip_distance has outliers we will experiment with three methods:

1. RobustScaler
2. QuantileTransformer

We will apply each and see in which case model performs the best

In [None]:
# Initializing scalers
# r_scaler = RobustScaler(with_centering=True, with_scaling=True, 
#                         quantile_range=(25.0, 75.0), copy=True, unit_variance=False)
# from sklearn.preprocessing import StandardScaler
# r_scaler = StandardScaler()

# Not using QuantileTransformer as it gave worst performance compared to r_scaler in previous run
# q_transformer = QuantileTransformer(n_quantiles=1000, output_distribution='uniform', 
#                                     ignore_implicit_zeros=False, subsample=10000, 
#                                     random_state=None, copy=True)

In [None]:
# # Initializing scalers and column transformer
# ColumnTransformer()

ct = ColumnTransformer([('r_scaler', RobustScaler(),['trip_distance',
                                                     'extra',
                                                    'tip_amount',
                                                    'tolls_amount',
                                                    'trip_duration',
                                                    'PULocationID',
                                                    'DOLocationID',
                                                    ])], remainder='passthrough')
train_data = ct.fit_transform(train_data)
test_data = ct.transform(test_data)

# Use of Q_transformer performed worst than r_scaler hence this was not used further
# # train_data.trip_distance = q_transformer.fit_transform(np.array(train_data.trip_distance).reshape(-1,1))
# # test_data.trip_distance = q_transformer.transform(np.array(test_data.trip_distance).reshape(-1,1))

### Saving processed data for running again and again

In [None]:
train_data.to_csv('train_data_processed.csv', index=False)
test_data.to_csv('test_data_processed.csv', index=False)
# target_gross.to_csv('train_gross_processed.csv', index=False)
# target_total.to_csv('train_total_processed.csv', index=False)

### Let's use all of them now in RandomForest to train

In [None]:
train_data.shape

In [None]:
test_data.shape

## Splitting the train_data

In [None]:
X = train_data
y = target_total

In [None]:
# splitting the data for training and testing


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model Building
### List of models to try
0. DummyRegressor
1. LinearRegression
2. KNeighborsRegressor
3. DecisionTreeRegressor
4. RandomForestRegressor 
5. BaggingRegressor
6. GradientBoostingRegressor 
7. AdaBoostRegressor
8. MLPRegressor
9. xgboost
10. lightgbm
11. SVR - (Tried but terminated the use due to very long run-time)

### Different Iterations
#### Round 1
Using all features. Trained using gross_amount
#### Round 2
Using all features. Trained using total_amount

In [None]:
# creating a dataframe to save the model metrics
metrics = pd.DataFrame(columns = ['Model','MAE', 'MSE', 'RMSE', 'R2_score', 'Explained_Variance_Score'])
def gen_metrics(idx, y_test, y_pred):
    models = ['DummyRegressor','LinearRegression',
              'KNeighborsRegressor', 'DecisionTreeRegressor',
              'RandomForestRegressor','BaggingRegressor',
             'GradientBoostingRegressor','AdaBoostRegressor',
             'MLPRegressor','XGBoost','LightBGM']
    metrics.loc[idx,'Model'] = models[idx]
    metrics.loc[idx, 'MAE'] = mean_absolute_error(y_test, y_pred)
    metrics.loc[idx, 'MSE'] = mean_squared_error(y_test, y_pred)
    metrics.loc[idx, 'RMSE'] = mean_squared_error(y_test, y_pred, squared=False)
    metrics.loc[idx, 'R2_score'] = r2_score(y_test, y_pred)
    metrics.loc[idx, 'Explained_Variance_Score'] = explained_variance_score(y_test, y_pred)

### 0. DummyRegressor

In [None]:
# first base model using DummyRegressor

dummy_regr = DummyRegressor(strategy="mean")
dummy_regr.fit(X_train,y_train)

gen_metrics(0, y_test, dummy_regr.predict(X_test))

print("R2_value: ", dummy_regr.score(X_test,y_test))

### 1. LinearRegression

#### Round 1
.56718 using gross_total

#### Round 2
.71029 using total_amount

In [None]:
lr_model = LinearRegression()
lr_model.fit(X_train,y_train)
gen_metrics(1, y_test, lr_model.predict(X_test))
print("R2_value: ", lr_model.score(X_test,y_test))

### 2. KNeighborsRegressor

In [None]:
knr_model = KNeighborsRegressor()
knr_model.fit(X_train, y_train)
gen_metrics(2, y_test, knr_model.predict(X_test))
print("R2_value: ", knr_model.score(X_test,y_test))

### 3. DecisionTreeRegressor

In [None]:
dtr_model = DecisionTreeRegressor()
dtr_model.fit(X_train, y_train)
gen_metrics(3, y_test, dtr_model.predict(X_test))
print("R2_value: ", dtr_model.score(X_test,y_test))

### 4. RandomForestRegressor



In [None]:
#### Round 1 
# gave us .93773 score with gross_total

#### Round 2 
# gave us .95969 score with total_amount

rfr_model = RandomForestRegressor(n_estimators=100, random_state=42)
rfr_model.fit(X_train,y_train)
gen_metrics(4, y_test, rfr_model.predict(X_test))
print("feature_importance: ", rfr_model.feature_importances_)
print("R2_value: ", rfr_model.score(X_test,y_test))
# 100 estimators - 0.9583965044644674
# 1000 estimators - 0.9583448385695726

### 5. BaggingRegressor

In [None]:
br_model = BaggingRegressor()
br_model.fit(X_train, y_train)
gen_metrics(5, y_test, br_model.predict(X_test))
print("R2_value: ", br_model.score(X_test,y_test))

### 6. GradientBoostingRegressor 


In [None]:
gbr_model = GradientBoostingRegressor()
gbr_model.fit(X_train, y_train)
gen_metrics(6, y_test, gbr_model.predict(X_test))
print("R2_value: ", gbr_model.score(X_test,y_test))

### 7. AdaBoostRegressor

In [None]:
abr_model = AdaBoostRegressor()
abr_model.fit(X_train, y_train)
gen_metrics(7, y_test, abr_model.predict(X_test))
print("R2_value: ", abr_model.score(X_test,y_test))

### 8. MLPRegressor

In [None]:
mlpr_model = MLPRegressor()
mlpr_model.fit(X_train, y_train)
gen_metrics(8, y_test, mlpr_model.predict(X_test))
print("R2_value: ", mlpr_model.score(X_test,y_test))

### 9. xgboost

In [None]:
xgbr_model = xgboost.XGBRegressor()
xgbr_model.fit(X_train, y_train)
gen_metrics(9, y_test, xgbr_model.predict(X_test))
print("R2_value: ", xgbr_model.score(X_test,y_test))

### 10. lightgbm

In [None]:
lgbmr_model = lightgbm.LGBMRegressor()
lgbmr_model.fit(X_train, y_train)
gen_metrics(10, y_test, lgbmr_model.predict(X_test))
print("R2_value: ", lgbmr_model.score(X_test,y_test))

### 11. SVR

In [None]:
# svr_model = SVR()
# svr_model.fit(X_train, y_train)
# print("R2_value: ", svr_model.score(X_test,y_test))
# print("coefficients: ", knr_model.coef_)

### Evaluating Untuned Models

In [None]:
metrics

### RESULT
Based on the metrics `XGboost` and `RandomForestRegressor` have performed the best with minimum MAE, MSE, RMSE and maximum R2_score and Explained_Variance_Score

## Final_Submission

In [None]:
# X_train, X_test, y_train, y_test = train_test_split(train_data.drp('', ), target_total, test_size = 0.2, random_state=42)
# xgbr_model = xgboost.XGBRegressor()
# xgbr_model.fit(X_train, y_train)

In [None]:
xgbr_model = xgboost.XGBRegressor(booster='gbtree',
                                  colsample_bytree=0.9,
                                  enable_categorical=False, 
                                  learning_rate=0.15, 
                                  max_depth=6,
                                  min_child_weight=3, 
                                  n_estimators=300, 
                                  n_jobs=-1, 
                                  objective='reg:squarederror')

xgbr_model.fit(train_data, target_total)
submission = xgbr_model.predict(test_data)
my_submission = pd.DataFrame({'ID': range(1,50001), 'total_amount': submission})

# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)

In [None]:
from sklearn.metrics import r2_score
r2_score(pd.read_csv('submission.csv').total_amount, pd.read_csv('/kaggle/input/xgb-94564/xgb_94564.csv').total_amount)

In [None]:
# from sklearn.ensemble import ExtraTreesRegressor

# etr_model = ExtraTreesRegressor(n_jobs=-1, random_state=123)
# etr_model.fit(train_data, target_total)
# submission = etr_model.predict(test_data)
# my_submission = pd.DataFrame({'ID': range(1,50001), 'total_amount': submission})

# # you could use any filename. We choose submission here
# my_submission.to_csv('submission.csv', index=False)

In [None]:
# from sklearn.metrics import r2_score
# r2_score(pd.read_csv("submission.csv").total_amount, pd.read_csv('/kaggle/input/xgb-94564/xgb_94564.csv').total_amount)

In [None]:
# rfr_model = RandomForestRegressor()
# rfr_model.fit(train_data, target_total)
# submission = rfr_model.predict(test_data)
# my_submission = pd.DataFrame({'ID': range(1,50001), 'total_amount': submission})

# # you could use any filename. We choose submission here
# my_submission.to_csv('submission.csv', index=False)

In [None]:
# submission = xgbr_model.predict(test_data)
# my_submission = pd.DataFrame({'ID': range(1,50001), 'total_amount': submission})

# # you could use any filename. We choose submission here
# my_submission.to_csv('submission.csv', index=False)

In [None]:
# Score improved 10 times after using StandardScaler
# Let's drop the unnecessary columns

In [None]:
# train_data = pd.read_csv('/kaggle/input/taxi-fare-guru-total-amount-prediction-challenge/train.csv')
# test_data = pd.read_csv('/kaggle/input/taxi-fare-guru-total-amount-prediction-challenge/test.csv')
# train_data.columns

### Hyperparameter tuning for xgboost model

In [None]:
# xgbr_model = xgboost.XGBRegressor(base_score=None, booster='gbtree', callbacks=None,
#              colsample_bylevel=None, colsample_bynode=None,
#              colsample_bytree=0.9, device='cpu', early_stopping_rounds=None,
#              enable_categorical=False, eval_metric=None, feature_types=None,
#              gamma=None, grow_policy=None, importance_type=None,
#              interaction_constraints=None, learning_rate=0.15, max_bin=None,
#              max_cat_threshold=None, max_cat_to_onehot=None,
#              max_delta_step=None, max_depth=7, max_leaves=None,
#              min_child_weight=3, missing=np.nan, monotone_constraints=None,
#              multi_strategy=None, n_estimators=290, n_jobs=-1,
#              num_parallel_tree=None, objective='reg:squarederror')


# clf = GridSearchCV(xgb_model,{'max_bin': [128, 256, 512],
#     'max_leaves': [0, 1, 2],  # Only for grow_policy='lossguide'
#     'verbosity': [0, 1, 2]},verbose=1,n_jobs=2)

# xgbr_model = xgboost.XGBRegressor(max_depth = 6, n_estimators=50)
# clf.fit(X_train, y_train)
# xgbr_model.fit(X_train, y_train)
# xgbr_model.fit(train_data, target_total)
# print("R2_value: ", xgbr_model.score(X_test, y_test))

In [None]:
# clf.best_params_

In [None]:
# param_grid = {
#     'learning_rate': [0.01, 0.1, 0.2],
#     'n_estimators': [100, 200, 300],
#     'max_depth': [3, 5, 7],
#     'subsample': [0.8, 0.9, 1.0],
#     'colsample_bytree': [0.8, 0.9, 1.0],
#     'reg_alpha': [0, 0.1, 0.5],
#     'reg_lambda': [0, 0.1, 0.5],
#     'sample_type': ['uniform', 'weighted'],
#     'normalize_type': ['tree', 'forest'],
#     'rate_drop': [0, 0.1, 0.5],
#     'max_bin': [128, 256, 512],
#     'max_leaves': [0, 1, 2],  # Only for grow_policy='lossguide'
#     'verbosity': [0, 1, 2],
#     'seed': [42],
#     'nthread': [4],
#     'interaction_constraints': [None, '[[0, 1], [2, 3]]']
# }

# xgb_model = xgboost.XGBRegressor(tree_method='hist')  # Adjust booster or other parameters as needed
# grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, scoring='neg_mean_squared_error', cv=5)
# grid_search.fit(X_train, y_train)

# print("Best parameters found: ", grid_search.best_params_)

In [None]:
max(score_list.values())

In [None]:
score_list

In [None]:
# final we got .96039
# highest was .9631

# mod extra .9630