<a href="https://colab.research.google.com/github/pranay2310/NYC-Taxi-Trip-Time-Prediction---Capstone-Project.ipynb/blob/main/NYC_Taxi_Trip_Time_Prediction_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Taxi trip time Prediction : Predicting total ride duration of taxi trips in New York City</u></b>

## <b> Problem Description </b>

### Your task is to build a model that predicts the total ride duration of taxi trips in New York City. Your primary dataset is one released by the NYC Taxi and Limousine Commission, which includes pickup time, geo-coordinates, number of passengers, and several other variables.

## <b> Data Description </b>

### The dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). The data was sampled and cleaned for the purposes of this project. Based on individual trip attributes, you should predict the duration of each trip in the test set.

### <b>NYC Taxi Data.csv</b> - the training set (contains 1458644 trip records)


### Data fields
* #### id - a unique identifier for each trip
* #### vendor_id - a code indicating the provider associated with the trip record
* #### pickup_datetime - date and time when the meter was engaged
* #### dropoff_datetime - date and time when the meter was disengaged
* #### passenger_count - the number of passengers in the vehicle (driver entered value)
* #### pickup_longitude - the longitude where the meter was engaged
* #### pickup_latitude - the latitude where the meter was engaged
* #### dropoff_longitude - the longitude where the meter was disengaged
* #### dropoff_latitude - the latitude where the meter was disengaged
* #### store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
* #### trip_duration - duration of the trip in seconds

#Data preprocessing

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt
import seaborn as sns
from sklearn import preprocessing
import math

In [None]:
#load NYC Taxi trip time dataset
data = pd.read_csv('/content/drive/MyDrive/Copy of NYC_Taxi_Data.csv')
data.head()

In [None]:
data.info()

In [None]:
data['pickup_datetime'] = pd.to_datetime(data['pickup_datetime'])
data['dropoff_datetime'] = pd.to_datetime(data['dropoff_datetime'])

In [None]:
data.describe().apply(lambda s: s.apply(lambda x: format(x, 'g'))) # to change scientific value to count value

Finding Distance by using pickup_lat,pickup_long,dropoff_lat,dropoff_long values

In [None]:
from geopy.distance import great_circle

In [None]:
def find_distance(pickup_lat,pickup_long,dropoff_lat,dropoff_long):
 
 start=(pickup_lat,pickup_long)
 end=(dropoff_lat,dropoff_long)
 
 return great_circle(start,end).km

In [None]:
#finding distance travel in each trip
data['distance'] = data.apply(lambda x: find_distance(x['pickup_latitude'],
                                                      x['pickup_longitude'],
                                                      x['dropoff_latitude'],
                                                      x['dropoff_longitude'] ), axis=1)

#feature creation

AVG SPEED OF VEHICLE

In [None]:
data['avg_speed'] = (data.distance/(data.trip_duration/3600))

In [None]:
data['pickup_weekday']=data['pickup_datetime'].dt.day_name()
data['dropoff_weekday']=data['dropoff_datetime'].dt.day_name()
data['pickup_weekday_num']=data['pickup_datetime'].dt.weekday
data['pickup_hour']=data['pickup_datetime'].dt.hour
data['month']=data['pickup_datetime'].dt.month

divided time zone into four main catogory <br>


*   Morning (6AM to 12PM)
*   Afternoon (12 PM to 4 PM)
*   Evening (4PM to 10 PM)
*   Late night (10PM to 6AM)





In [None]:
# at which time customer board taxi 
def time_of_day(x):
    if x in range(6,12):
        return 'Morning'
    elif x in range(12,16):
        return 'Afternoon'
    elif x in range(16,22):
        return 'Evening'
    else:
        return 'Late night'

In [None]:
data['pickup_timeofday']=data['pickup_hour'].apply(time_of_day)

In [None]:
data.describe().apply(lambda s: s.apply(lambda x: format(x, 'g')))

In [None]:
data.shape

In [None]:
data.columns

# Univariate Analysis

In [None]:
plt.rcParams["figure.figsize"] = [10,6]

In [None]:
data.head()

##Column 1 and 2: column 'id' and 'vendor_id' <br>
There are only two vendor who provided service

In [None]:
sns.countplot(data.vendor_id)
plt.xlabel('Vendor ID')
plt.ylabel('Count')
plt.show()

## **column 3 and column 4**: from'pickup_datetime' and 'dropoff_datetime' column we created new feature column like '**pickup_weekday**', '**dropoff_weekday**' '**pickup_weekday_num**', '**pickup_hour 	month**', and '**pickup_timeofday**'.

##**column 5: Passenger_count** <br>
from the graph, we can see that there are few trips with passenger broad a taxi is too low. it may be outliers so going to reomve entries with 0, 7, 8, 9 passenger per trip.


In [None]:
data.passenger_count.value_counts().plot(kind="bar")
plt.xlabel("number of passenger")
plt.ylabel("No of entries")
plt.title("Maximum number of passenger  per trip")
plt.show()



## column 6: pickup_latitude 	pickup_longitude <br>


In [None]:
import folium
from folium.plugins import HeatMap
from folium import plugins


In [None]:
map_NYC = folium.Map([60,-120],zoom_start=7)

In [None]:
'''station = data[['pickup_longitude','pickup_latitude']]
# convert to (n, 2) nd-array format for heatmap
stationArr = station.values

# plot heatmap
map_NYC.add_child(plugins.HeatMap(stationArr, radius=15))
map_NYC'''

## column 8 & 9: dropoff_latitude 	dropoff_longitude <br>

In [None]:
map_NYC = folium.Map([40.80902,-73.94190],zoom_start=7)

In [None]:
'''station_drop = data[['dropoff_latitude', 'dropoff_longitude']]
# convert to (n, 2) nd-array format for heatmap
stationArr_drop = station_drop.values

# plot heatmap
map_NYC.add_child(plugins.HeatMap(stationArr_drop, radius=15))
map_NYC'''

## column 10: Store and forward flag <br>
only two observation  are included in this feature

In [None]:
sns.countplot(x='store_and_fwd_flag',data=data)
plt.ylabel('Count')
plt.xlabel('Store and forward flag')
plt.show()

 We can see that only about 0.5% of the trip details were stored in the vehicle memory first

##column 11: trip duration <br>


In [None]:
#create boxplot to check probable outliers
sns.boxplot(data.trip_duration)
plt.xlabel('Trip Duration')
plt.show()

In [None]:
#to check skewness of data 
sns.distplot(data['trip_duration'],norm_hist=True)
plt.title("Trip duration before normalization")
plt.show()

In [None]:
print(f" skew coefficient is {data['trip_duration'].skew()}")

data is right skewed. using log transformation we can remove skewness.

In [None]:
sns.distplot(np.log(data['trip_duration']))
plt.title("Trip duration after normalization")
plt.show()

In [None]:
np.log(data['trip_duration'])
data_trip_duration = np.log(data['trip_duration'])
print(f" skew coefficient went from {data['trip_duration'].skew()} to {data_trip_duration.skew()}")

## column 12: distance <br>

In [None]:
sns.boxplot(data.distance)
plt.xlabel('Distance')
plt.show()

In [None]:

sns.displot(data=data, x='distance', height=7,aspect=1)
plt.title("Distance travelled per trip (before normalization)")
plt.show()

In [None]:
print(f"Skew coefficient is {data['distance'].skew()}")

In [None]:
sns.displot(np.log(data["distance"]), height=7,aspect=1.5)
plt.title("Distance travelled per trip (after normalization)")
plt.show()

In [None]:
distance_log_skew = np.log(data['distance']).skew()
print(f" Skew coefficient went from {data['distance'].skew()} to {distance_log_skew}")

## column 13: Average speed

In [None]:
sns.boxplot(x="avg_speed", data=data)
plt.xlabel('Average Speed')
plt.show()

In [None]:

sns.displot(data=data, x='avg_speed', height=7,aspect=1)
plt.show()

In [None]:
print(f"Skew coefficient is {data['avg_speed'].skew()}")

In [None]:
speed_log_skew = np.log(data["avg_speed"]).skew()
print(f" Skew coefficient went from {data['avg_speed'].skew()} to {speed_log_skew}")

## column 14 and 15: pickup_weekday and drop off weekday


In [None]:
sns.countplot(data=data, x='pickup_weekday')
plt.ylabel('Number of trip')
plt.xlabel('Weekday')
plt.title('Number of pickup per day')
plt.show()

number of trips not much of changed in different days.


In [None]:
sns.countplot(data=data, x='dropoff_weekday')
plt.ylabel('Number of trip')
plt.xlabel('Weekday')
plt.title('Number of dropoff per day')
plt.show()

## column 16: pick up hour

In [None]:
sns.countplot(data=data, x='pickup_hour')
plt.ylabel('Number of trip')
plt.xlabel('Hour')
plt.title('Number of pickup per hour')
plt.show()

most number of taxi booked in evening session.

## column 17: month

In [None]:
sns.countplot(data=data, x='month')
plt.ylabel('Number of trip')
plt.xlabel('Month')
plt.title('Number of trip per month')
plt.show()

In [None]:
data.month.value_counts()

##column 18: pick up time of day

In [None]:
# bar plot for pickup time of a day
sns.countplot(data=data, x="pickup_timeofday")
plt.xlabel("Time of day")
plt.ylabel("Number of trip")
plt.title("Number of peakup's at time of day")
plt.show()


# Bivariate Analysis

### Trip Duration Vs Vendor

In [None]:
sns.set_style(style='whitegrid')
plt.figure(figsize = (7,5))
sns.barplot(data.vendor_id,data.trip_duration)
plt.xlabel('Vendor ID')
plt.ylabel('Trip Duration')
plt.show()

* Seems like there is not much difference between Vendoe_ID type 1 and Vendor_ID type 2.

### Trip Duration Vs Store and Forward Flag

In [None]:
plt.figure(figsize = (10,5))
sns.set_style(style='white')
sns.barplot(x=data.store_and_fwd_flag, y=data.trip_duration)
plt.xlabel('Store and Forward Flag')
plt.ylabel('Duration (seconds)')
plt.show()

* Y type Store and Forward Flag are relatively taking longer trip duration.
* Also we found that from univariate analysis only 1% data is having Y type Store and Forward Flag

### Trip Duration Vs Pickup Time

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey=True)
fig.suptitle('Trip Duration Vs Pickup Time')

sns.barplot(ax=axes[0], x='pickup_hour',y='trip_duration',data=data)
axes[0].set_title('Bar graph')

sns.lineplot(ax=axes[1], x='pickup_hour',y='trip_duration',data=data)
axes[1].set_title('Line Plot')

*   We see the trip duration is the maximum around 1 pm to 4 pm.
*   Trip duration is the lowest in morning aroud 6 am

### Trip Duration Vs Weekday

In [None]:
plt.figure(figsize = (10,5))
sns.barplot(x='pickup_weekday',y='trip_duration',data=data)
plt.ylabel('Duration (seconds)')
plt.xlabel('')
plt.show()

*   Trip duration on Wednesday is longest among all days.

### Trip Duration Vs Month

In [None]:
plt.figure(figsize = (10,5))
sns.barplot(x='month',y='trip_duration', data=data)
plt.ylabel('Duration (seconds)')
plt.xlabel('Month of Trip ')
plt.show()

* Trip duration gradually increasing from Jan to June

### Trip Duration Vs Distance

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 5), sharey=True)
fig.suptitle('Trip Duration Vs Distance')

sns.barplot(ax=axes[0], x=data.distance.groupby(pd.cut(data.distance,np.arange(0,60,10))).mean().round(),y='trip_duration'
                                                       ,data=data,ci=None)
axes[0].set_title('Bar graph')

sns.regplot(ax=axes[1],
            x='distance',
            y='trip_duration',
            data=data)
axes[1].set_title('Scatter plot')

* As the distance increases the trip time also increasing

### Distance Vs Hour

In [None]:
plt.figure(figsize = (10,5))
sns.lineplot(y='distance',x='pickup_hour',data=data)
plt.ylabel('Distance')
plt.xlabel('Pickup Hour')
plt.show()

*   Trip distance is highest during early morning hours.
*   From 8 am to night 8 pm distance people are riding short distance trips in the range of 3 - 4 kms.



### Distance vs Weekday

In [None]:
plt.figure(figsize = (10,5))
sns.barplot(x='pickup_weekday', y='distance',data=data)
plt.ylabel('Distance')
plt.xlabel('')
plt.show()

 Sunday being at the top may be due to outstation trips

### Distance Vs Month

In [None]:
plt.figure(figsize = (10,5))
sns.barplot(x='month', y='distance',data=data)
plt.ylabel('Distance')
plt.xlabel('Month of trip')
plt.show()

* There is not much differece in distance travelled in each month

In [None]:
data.info()

# **feature Engineering** <br>

## **Data cleaing and wrangling**

In [None]:
data.head()

In [None]:
# remove column  "id" and "vendor_id"
data.drop(["id",'pickup_datetime',"dropoff_datetime", "dropoff_longitude", "dropoff_latitude", "pickup_longitude", "pickup_latitude","pickup_hour"],axis=1,inplace= True)

In [None]:
data.shape

In [None]:
#remove rows with passenger count value =0,7,8,9
df = data.loc[~(data['passenger_count']==0)]
df = df.loc[~(df['passenger_count'] >= 7)]

In [None]:
df.passenger_count.value_counts()

In [None]:
#remove trip duration greater than 5800 second and less han 60 sec.
df = df.loc[~(df['trip_duration'] >= 4800)]
df = df.loc[~(df['trip_duration'] <= 60)]
#plot boxplot for filtered data
sns.boxplot(data=df,x='trip_duration')
plt.show()

In [None]:
df.shape

Since we removed the data having greater than 80 min(4800sec) so we can remove the distance travelled > 100 km, and we seen that 99 percentile of the distance travelled is about 24 km

In [None]:
#remove distance above 200 and avg_speed above 50 km/hr (As 2015, maximum speed limit was 48.28 kmph)
df = df.loc[~((df['distance']>=100) | (df["avg_speed"]>=50))]

In [None]:
df.shape

In [None]:
# remove rows if distance travel is zless less tham or equal to 1 (may be it outliers)
df = df[df.distance >= 1]

In [None]:
df.shape

#corelation 

In [None]:
 ## Correlation
plt.figure(figsize=(15,8))
correlation = df.corr()
sns.heatmap(abs(correlation), annot=True, cmap='coolwarm')
plt.show()

As we can see that,  <br>
"Avg_speed" is 56% corealted with distance. <br>
"trip_duration" is 78% corealted with distance.

In [None]:
#we can also remove "pickup_weekday_num" as "pickup_weekday" and "dropoff_weekday" information also avalibale
df.drop(['pickup_weekday_num'],axis=1,inplace=True)

##check multi collinearity

In [None]:
#Multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
#define function to call multicollinearity
def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

In [None]:
df.columns

In [None]:
df.info()

In [None]:
calc_vif(df[[i for i in df.describe().columns if i not in ["vendor_id",'passenger_count','store_and_fwd_flag', 'dropoff_weekday', 'pickup_weekday','month','pickup_timeofday']]])

we have also remove the Avg_speed feature from the predictor columns as it highhly  corelated with distance.

In [None]:
calc_vif(df[[i for i in df.describe().columns if i not in ['passenger_count','store_and_fwd_flag', 'dropoff_weekday', 'pickup_weekday','month','pickup_timeofday','avg_speed','vendor_id']]])

##hot encoding

In [None]:
df.columns

In [None]:
#drop column "pickup_hour"
#df.drop(['avg_speed'],axis=1,inplace=True)

In [None]:
# One hot encoding
final_data = pd.get_dummies(df, columns=["vendor_id",
                                         "passenger_count",
                                         "store_and_fwd_flag",
                                         "pickup_weekday",
                                         "dropoff_weekday",
                                         "month",
                                         "pickup_timeofday"])

In [None]:
final_data.head()

In [None]:
final_data.shape

Now we see corelation heatmap between each feature to choose best feature.

In [None]:
 ## Correlation
plt.figure(figsize=(15,15))
correlation_1 = final_data.corr()
sns.heatmap(abs(correlation_1), annot=True, cmap='coolwarm')
plt.show()

# build function

In [None]:
mean_sq_error_train = []
root_mean_sq_error_train = []
r2_list_train = []
adj_r2_list_train = []

mean_sq_error_test = []
root_mean_sq_error_test = []
r2_list_test = []
adj_r2_list_test = []

In [None]:
#score matrix (MSE, RMSE, r2, Adjusted r2)
def score_matrix(y_train, y_pred_train, y_test, y_pred_test):
  print(f'**Train dataset score**')
  print("\n")
  # Train performance
  MSE_train = mean_squared_error(y_train, y_pred_train)
  print(f'Mean squared error is: {MSE_train }')
  RMSE_train = math.sqrt(mean_squared_error(y_train, y_pred_train))
  print(f'Root Mean squared error is: {RMSE_train}')
  r2_train = r2_score(y_train, y_pred_train)
  print(f'r2: {r2_train}')
  adjusted_r2_train = 1-(1-r2_score(y_train, y_pred_train))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
  print(f'Adjusted r2: {adjusted_r2_train}')
  print("\n")
  #test _performace
  print(f'**Test dataset score**')
  print("\n")
  MSE_test = mean_squared_error(y_test, y_pred_test)
  print(f'Mean squared error is: {MSE_test}')
  RMSE_test = math.sqrt(mean_squared_error(y_test, y_pred_test))
  print(f'Root Mean squared error is: {RMSE_test}')
  r2_test = r2_score(y_test, y_pred_test)
  print(f'r2: {r2_test}')
  adjusted_r2_test = 1-(1-r2_score(y_test, y_pred_test))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
  print(f'Adjusted r2: {adjusted_r2_test}')

  mean_sq_error_train.append(MSE_train)
  root_mean_sq_error_train.append(RMSE_train)
  r2_list_train.append(r2_train)
  adj_r2_list_train.append(adjusted_r2_train)

  mean_sq_error_test.append(MSE_test)
  root_mean_sq_error_test.append(RMSE_test)
  r2_list_test.append(r2_test)
  adj_r2_list_test.append(adjusted_r2_test)

In [None]:
import matplotlib.pyplot as plt

# Plot the validation and training data separately
def plot_loss_curves(history):
  """
  Returns separate loss curves for training and validation metrics.
  """ 
  loss = history.history['loss']
  val_loss = history.history['val_loss']

  accuracy = history.history['mse']
  val_accuracy = history.history['val_mse']

  epochs = range(len(history.history['loss']))

  # Plot loss
  plt.plot(epochs, loss, label='training_loss')
  plt.plot(epochs, val_loss, label='val_loss')
  plt.title('Loss')
  plt.xlabel('Epochs')
  plt.legend()

  # Plot accuracy
  plt.figure()
  plt.plot(epochs, accuracy, label='training_accuracy')
  plt.plot(epochs, val_accuracy, label='val_accuracy')
  plt.title('MSE')
  plt.xlabel('Epochs')
  plt.legend();

#build model 

## train and test spilt


In [None]:
#importing the libraries
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split 

In [None]:
final_data.head()

In [None]:
final_data.shape

In [None]:
y = final_data['trip_duration']
y.head()

In [None]:
from scipy.stats import zscore
#select feature
#minmaxscaler
X = final_data.loc[:,final_data.columns != 'trip_duration']

In [None]:
X.head()

In [None]:
X.shape

In [None]:
X.info()

In [None]:
#split train and test data
X_train, X_test, y_train, y_test = train_test_split( X,y , test_size = 0.2, random_state = 0) 
print(X_train.shape)
print(X_test.shape)

In [None]:
# Transforming data
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#linear regression

In [None]:
from sklearn.linear_model import LinearRegression

linear_regression = LinearRegression().fit(X_train,y_train)

In [None]:
linear_regression.score(X_train,y_train)

In [None]:
linear_regression.coef_

In [None]:
linear_regression.intercept_

In [None]:
y_pred_train =linear_regression.predict(X_train)

In [None]:
y_pred = linear_regression.predict(X_test)

In [None]:
X_train.shape

In [None]:
d = pd.DataFrame({"actual":y_test,"predicted":y_pred})

d.head()

In [None]:
score_matrix(y_train=y_train, y_pred_train = y_pred_train, y_test = y_test, y_pred_test=y_pred )

#implementing Lasso regression

In [150]:
from sklearn.linear_model import Lasso
lasso  = Lasso(alpha=0.01 , max_iter= 3000)

lasso.fit(X_train, y_train)

Lasso(alpha=0.01, max_iter=3000)

In [None]:
lasso.score(X_train, y_train)

cross validation

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
### Cross validation
lasso = Lasso()
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100]}
lasso_regressor = GridSearchCV(lasso, parameters, scoring='neg_mean_squared_error', cv=3)
lasso_regressor.fit(X_train, y_train)

In [None]:
print("The best fit alpha value is found out to be :" ,lasso_regressor.best_params_)
print("\nUsing ",lasso_regressor.best_params_, " the negative mean squared error is: ", lasso_regressor.best_score_)

In [None]:
y_pred_train_lasso = lasso_regressor.predict(X_train)

In [None]:
y_pred_lasso = lasso_regressor.predict(X_test)

In [None]:
score_matrix(y_train=y_train, y_pred_train = y_pred_train_lasso, y_test = y_test, y_pred_test=y_pred_lasso )

#implementng Ridge regression

In [None]:
from sklearn.linear_model import Ridge

ridge  = Ridge(alpha=0.01)

In [None]:
ridge.fit(X_train,y_train)

In [None]:
ridge.score(X_train, y_train)

In [None]:
y_pred_train_ridge = ridge.predict(X_train)

In [None]:
y_pred_ridge = ridge.predict(X_test)

In [None]:
score_matrix(y_train=y_train, y_pred_train = y_pred_train_ridge, y_test = y_test, y_pred_test=y_pred_ridge )

In [None]:
##cross validation
ridge = Ridge()
parameters = {'alpha': [1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1,5,10,20,30,40,45,50,55,60,100]}
ridge_regressor = GridSearchCV(ridge, parameters, scoring='neg_mean_squared_error', cv=3)
ridge_regressor.fit(X_train,y_train)

In [None]:
print("The best fit alpha value is found out to be :" ,ridge_regressor.best_params_)
print("\nUsing ",ridge_regressor.best_params_, " the negative mean squared error is: ", ridge_regressor.best_score_)

In [None]:
#Model Prediction
#train 
y_pred_ridge_cv_train = ridge_regressor.predict(X_train)
y_pred_ridge_cv = ridge_regressor.predict(X_test)


In [None]:
score_matrix(y_train=y_train, y_pred_train = y_pred_ridge_cv_train, y_test = y_test, y_pred_test=y_pred_ridge_cv )

#desicion tree

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
dt_reg = DecisionTreeRegressor( max_leaf_nodes=10, random_state=0)
dt_reg.fit(X_train, y_train)

In [None]:
dt_y_predicted_train = dt_reg.predict(X_train)

In [None]:
#prediction on test set
dt_y_predicted =dt_reg.predict(X_test