In [27]:
# !wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet
# !wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-02.parquet

## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "**Green** Taxi Trip Records", we'll use "**Yellow** Taxi Trip Records".

Download the data for January and February 2022.

Read the data for January. How many columns are there?

* 16
* 17
* 18
* 19

In [28]:
import pandas as pd

# Read the Parquet file into a DataFrame
df = pd.read_parquet('yellow_tripdata_2022-01.parquet')

# Perform operations on the DataFrame
# For example, you can print the first few rows
len(df.columns)


19

## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the standard deviation of the trips duration in January?

* 41.45
* 46.45
* 51.45
* 56.45


In [29]:
df['duration'] = df['tpep_dropoff_datetime'] -df['tpep_pickup_datetime']
df['duration'] = df.duration.dt.total_seconds() / 60
df['duration'].std()

46.44530513776499

## Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

* 90%
* 92%
* 95%
* 98%

In [30]:
temp = df['duration'].size

df.drop(df[ df['duration']<1 | (df['duration']>60)].index,inplace = True)

new_size = df['duration'].size
new_size/temp * 100



98.78231167999428

## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

* 2
* 155
* 345
* 515
* 715

#### I did but is not the wanted answer


In [31]:
# from sklearn.preprocessing import OneHotEncoder
# new_df = df[['PULocationID', 'DOLocationID']].astype(str)
# encoder = OneHotEncoder()
# X = encoder.fit_transform(new_df.values)
# X.shape

### right answer

In [32]:
from sklearn.feature_extraction import DictVectorizer


categorical = ['PULocationID', 'DOLocationID']
df[categorical] = df[categorical].astype(str)
train_dicts = df[categorical].to_dict(orient='records')
dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)
print(f'Feature matrix size: {X_train.shape}')

Feature matrix size: (2433928, 517)


## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 6.99
* 11.99
* 16.99
* 21.99

## Q6. Evaluating the model

Now let's apply this model to the validation dataset (February 2022). 

What's the RMSE on validation?

* 7.79
* 12.79
* 17.79
* 22.79