# Homework week 1

## Q1. Downloading the data

We'll use the same NYC taxi dataset, but instead of "Green Taxi Trip Records", we'll use "Yellow Taxi Trip Records".

Download the data for January and February 2022.

Data source: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

In [1]:
!python -V

Python 3.10.11


In [2]:
import pandas as pd

In [3]:
df = pd.read_parquet('./data/yellow_tripdata_2022-01.parquet')

How many columns are there?

In [4]:
df.info()
df.shape[1]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2463931 entries, 0 to 2463930
Data columns (total 19 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int64         
 1   tpep_pickup_datetime   datetime64[ns]
 2   tpep_dropoff_datetime  datetime64[ns]
 3   passenger_count        float64       
 4   trip_distance          float64       
 5   RatecodeID             float64       
 6   store_and_fwd_flag     object        
 7   PULocationID           int64         
 8   DOLocationID           int64         
 9   payment_type           int64         
 10  fare_amount            float64       
 11  extra                  float64       
 12  mta_tax                float64       
 13  tip_amount             float64       
 14  tolls_amount           float64       
 15  improvement_surcharge  float64       
 16  total_amount           float64       
 17  congestion_surcharge   float64       
 18  airport_fee           

19

>There are 19 coloumns

## Q2 Computing duration 

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes.

What's the standard deviation of the trips duration in January?

In [5]:
df['duration'] = df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']
df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)


In [6]:
df['duration'].std()

46.44530513776802

>Standard deviation is 46.45  

## Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

In [7]:
df_jan = df[(df.duration >= 1) & (df.duration <= 60)]
 

In [8]:
df_jan.shape[0] / df.shape[0] * 100

98.27547930522405

>98% of trips take a duration of between 1 - 60 minutes.

## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer
* Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

In [9]:
from sklearn.feature_extraction import DictVectorizer

In [10]:
categorical = ['PULocationID', 'DOLocationID']
numerical = ['trip_distance']

df_jan[categorical] = df_jan[categorical].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_jan[categorical] = df_jan[categorical].astype(str)


In [11]:
df_jan.dtypes

VendorID                          int64
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                 float64
trip_distance                   float64
RatecodeID                      float64
store_and_fwd_flag               object
PULocationID                     object
DOLocationID                     object
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
airport_fee                     float64
duration                        float64
dtype: object

In [12]:
train_dicts = df_jan[categorical + numerical].to_dict('records')
dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)
X_train.shape

(2421440, 516)

>Dimentionality is 516

## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model.

* Train a plain linear regression model with default parameters
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

In [13]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [14]:
target = 'duration'
y_train = df_jan[target].values

lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_train)

mean_squared_error(y_train, y_pred, squared=False)

7.001496179431534

>RMSE is 7.00 on training

## Q6: What's the RMSE on validation?

Now let's apply this model to the validation dataset (February 2022).

What's the RMSE on validation?

In [15]:
df_feb = pd.read_parquet('./data/yellow_tripdata_2022-02.parquet')
df_feb['duration'] = df_feb.tpep_dropoff_datetime - df_feb.tpep_pickup_datetime
df_feb.duration = df_feb.duration.apply(lambda td: td.total_seconds() / 60)
df_feb = df_feb[(df_feb.duration >= 1) & (df_feb.duration <= 60)]
df_feb[categorical] = df_feb[categorical].astype(str)

In [16]:
val_dicts = df_feb[categorical].to_dict('records')
X_val = dv.transform(val_dicts)

In [17]:
y_val = df_feb[target].values

In [18]:
y_pred = lr.predict(X_val)

In [19]:
mean_squared_error(y_val, y_pred, squared=False)

7.795617853444456

>The RMSE on validation is 7.79