## Q1. Refactoring

Before we can start covering our code with tests, we need to 
refactor it. We'll start by getting rid of all the global variables. 

* Let's create a function `main` with two parameters: `year` and
`month`.
* Move all the code (except `read_data`) inside `main`
* Make `categorical` a parameter for `read_data` and pass it inside `main`

Now we need to create the "main" block from which we'll invoke
the main function. How does the `if` statement that we use for
this looks like? 


Hint: after refactoring, check that the code still works. Just run it e.g. for March 2023 and see if it finishes successfully. 

To make it easier to run it, you can write results to your local
filesystem. E.g. here:

```python
output_file = f'taxi_type=yellow_year={year:04d}_month={month:02d}.parquet'
```

The 'if' statement should look like...
```python
if __name__=="__main__":
    year = int(sys.argv[1])
    month = int(sys.argv[2])
    create_parquet(year, month)
```

In [4]:
#Just check the output file
import pandas as pd
df = pd.read_parquet("./yellow_tripdata_2023-03.parquet")
df.head()

Unnamed: 0,ride_id,predicted_duration
0,2023/03_0,16.245906
1,2023/03_1,26.134796
2,2023/03_2,11.884264
3,2023/03_3,11.99772
4,2023/03_4,10.234486


## Q2. Installing pytest

Now we need to install `pytest`:

```bash
pipenv install --dev pytest
```

Next, create a folder `tests` and create two files. One will be
the file with tests. We can name it `test_batch.py`. 

What should be the other file? 

Hint: to be able to test `batch.py`, we need to be able to
import it. Without this other file, we won't be able to do it.

You need to create an `__init__.py` file

## Q3. Writing first unit test

Now let's cover our code with unit tests.

We'll start with the pre-processing logic inside `read_data`.

It's difficult to test right now because first reads
the file and then performs some transformations. We need to split this 
code into two parts: reading (I/O) and transformation. 

So let's create a function `prepare_data` that takes in a dataframe 
(and some other parameters too) and applies some transformation to it.

(That's basically the entire `read_data` function after reading 
the parquet file)

Now create a test and use this as input:

```python
data = [
    (None, None, dt(1, 1), dt(1, 10)),
    (1, 1, dt(1, 2), dt(1, 10)),
    (1, None, dt(1, 2, 0), dt(1, 2, 59)),
    (3, 4, dt(1, 2, 0), dt(2, 2, 1)),      
]

columns = ['PULocationID', 'DOLocationID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime']
df = pd.DataFrame(data, columns=columns)
```

Where `dt` is a helper function:

```python
from datetime import datetime

def dt(hour, minute, second=0):
    return datetime(2023, 1, 1, hour, minute, second)
```

Define the expected output and use the assert to make sure 
that the actual dataframe matches the expected one.

Tip: When you compare two Pandas DataFrames, the result is also a DataFrame.
The same is true for Pandas Series. Also, a DataFrame could be turned into a list of dictionaries.  

How many rows should be there in the expected dataframe?

* 1
* 2
* 3
* 4

In [5]:
from datetime import datetime

def dt(hour, minute, second=0):
    return datetime(2023, 1, 1, hour, minute, second)

In [8]:
data = [
    (None, None, dt(1, 1), dt(1, 10)),
    (1, 1, dt(1, 2), dt(1, 10)),
    (1, None, dt(1, 2, 0), dt(1, 2, 59)),
    (3, 4, dt(1, 2, 0), dt(2, 2, 1)),      
]

columns = ['PULocationID', 'DOLocationID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime']
df_test = pd.DataFrame(data, columns=columns)
df_test.to_parquet("./test.parquet")
df_test

Unnamed: 0,PULocationID,DOLocationID,tpep_pickup_datetime,tpep_dropoff_datetime
0,,,2023-01-01 01:01:00,2023-01-01 01:10:00
1,1.0,1.0,2023-01-01 01:02:00,2023-01-01 01:10:00
2,1.0,,2023-01-01 01:02:00,2023-01-01 01:02:59
3,3.0,4.0,2023-01-01 01:02:00,2023-01-01 02:02:01


In [12]:
categorical = ['PULocationID', 'DOLocationID']

df_test['duration'] = df_test.tpep_dropoff_datetime - df_test.tpep_pickup_datetime
df_test['duration'] = df_test.duration.dt.total_seconds() / 60

df_test = df_test[(df_test.duration >= 1) & (df_test.duration <= 60)].copy()

df_test[categorical] = df_test[categorical].fillna(-1).astype('int').astype('str')
df_test

Unnamed: 0,PULocationID,DOLocationID,tpep_pickup_datetime,tpep_dropoff_datetime,duration
0,-1,-1,2023-01-01 01:01:00,2023-01-01 01:10:00,9.0
1,1,1,2023-01-01 01:02:00,2023-01-01 01:10:00,8.0


In [18]:
df_test.to_dict()

{'PULocationID': {0: '-1', 1: '1'},
 'DOLocationID': {0: '-1', 1: '1'},
 'tpep_pickup_datetime': {0: Timestamp('2023-01-01 01:01:00'),
  1: Timestamp('2023-01-01 01:02:00')},
 'tpep_dropoff_datetime': {0: Timestamp('2023-01-01 01:10:00'),
  1: Timestamp('2023-01-01 01:10:00')},
 'duration': {0: 9.0, 1: 8.0}}

There are 2 rows in the expected dataframe

In [15]:
temp = df_test.copy()
temp.at[0,'PULocationID'] = "hello"
temp

Unnamed: 0,PULocationID,DOLocationID,tpep_pickup_datetime,tpep_dropoff_datetime,duration
0,hello,-1,2023-01-01 01:01:00,2023-01-01 01:10:00,9.0
1,1,1,2023-01-01 01:02:00,2023-01-01 01:10:00,8.0


In [22]:
import pickle
year_test = 2023
month_test = 1 
with open('model.bin', 'rb') as f_in:
        dv, lr = pickle.load(f_in)
df_test['ride_id'] = f'{year_test:04d}/{month_test:02d}_' + df_test.index.astype('str')
dicts = df_test[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = lr.predict(X_val)
df_result = pd.DataFrame()
df_result['ride_id'] = df_test['ride_id']
df_result['predicted_duration'] = y_pred
df_result

Unnamed: 0,ride_id,predicted_duration
0,2023/01_0,23.197149
1,2023/01_1,13.080101


In [23]:
df_result.to_dict()

{'ride_id': {0: '2023/01_0', 1: '2023/01_1'},
 'predicted_duration': {0: 23.19714924577506, 1: 13.08010120625567}}