# Best practices for development using Notebooks

This notebook demonstrates some best practices to enable productive development using Jupyter notebooks.

## Load some data

We will use some public data from BigQuery as an example.

In [1]:
%load_ext google.cloud.bigquery

In [2]:
%%bigquery df

SELECT * FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips` LIMIT 1000

Query complete after 0.07s: 100%|█████████████| 2/2 [00:00<00:00, 583.07query/s]
Downloading: 100%|███████████████████████| 1000/1000 [00:02<00:00, 379.98rows/s]


In [3]:
df.head()

Unnamed: 0,trip_id,subscriber_type,bike_id,bike_type,start_time,start_station_id,start_station_name,end_station_id,end_station_name,duration_minutes
0,4523505,Local365,452,classic,2015-04-17 10:39:33+00:00,2540,17th/Guadalupe,,Ready for deployment,6
1,4519813,Local365,188,classic,2015-04-16 20:30:59+00:00,2568,East 11th/Victory Grill,2569.0,East 11th/San Marcos,3
2,9225085,Local30,550,classic,2016-03-21 15:31:42+00:00,3381,East 7th & Pleasant Valley,2536.0,Waller & 6th St.,12
3,8956930,Local365,107,classic,2016-03-02 17:50:51+00:00,3381,East 7th & Pleasant Valley,2536.0,Waller & 6th St.,11
4,9073016,Local365,984,classic,2016-03-11 15:01:42+00:00,3381,East 7th & Pleasant Valley,2536.0,Waller & 6th St.,11


## Feature engineering in the notebook

Notebooks are great for experimentation, so you probably want to start trying things here. As an example, we introduce a new feature that is "day of week", and we want it to be a string so that our model will treat it as a categorical feature.

In [8]:
df['start_day_of_week'] = df.start_time.dt.dayofweek.apply(str)

In [9]:
df.head()

Unnamed: 0,trip_id,subscriber_type,bike_id,bike_type,start_time,start_station_id,start_station_name,end_station_id,end_station_name,duration_minutes,start_day_of_week
0,4523505,Local365,452,classic,2015-04-17 10:39:33+00:00,2540,17th/Guadalupe,,Ready for deployment,6,4
1,4519813,Local365,188,classic,2015-04-16 20:30:59+00:00,2568,East 11th/Victory Grill,2569.0,East 11th/San Marcos,3,3
2,9225085,Local30,550,classic,2016-03-21 15:31:42+00:00,3381,East 7th & Pleasant Valley,2536.0,Waller & 6th St.,12,0
3,8956930,Local365,107,classic,2016-03-02 17:50:51+00:00,3381,East 7th & Pleasant Valley,2536.0,Waller & 6th St.,11,2
4,9073016,Local365,984,classic,2016-03-11 15:01:42+00:00,3381,East 7th & Pleasant Valley,2536.0,Waller & 6th St.,11,4


In [12]:
df.dtypes

trip_id                            object
subscriber_type                    object
bike_id                            object
bike_type                          object
start_time            datetime64[ns, UTC]
start_station_id                    int64
start_station_name                 object
end_station_id                     object
end_station_name                   object
duration_minutes                    int64
start_day_of_week                  object
dtype: object

OK, done! We have defined a new feature.

## Best practice step 1: Move the code into a function

Moving your code into a function will make it more modular and easier to reuse in different parts of your notebook. 

In [14]:
import pandas as pd

def get_day_of_week_feature(df: pd.DataFrame):
   return df.start_time.dt.dayofweek.apply(str)


In [15]:
# Try it out
df['start_day_of_week_fun'] = get_day_of_week_feature(df)

In [16]:
df.head()

Unnamed: 0,trip_id,subscriber_type,bike_id,bike_type,start_time,start_station_id,start_station_name,end_station_id,end_station_name,duration_minutes,start_day_of_week,start_day_of_week_fun
0,4523505,Local365,452,classic,2015-04-17 10:39:33+00:00,2540,17th/Guadalupe,,Ready for deployment,6,4,4
1,4519813,Local365,188,classic,2015-04-16 20:30:59+00:00,2568,East 11th/Victory Grill,2569.0,East 11th/San Marcos,3,3,3
2,9225085,Local30,550,classic,2016-03-21 15:31:42+00:00,3381,East 7th & Pleasant Valley,2536.0,Waller & 6th St.,12,0,0
3,8956930,Local365,107,classic,2016-03-02 17:50:51+00:00,3381,East 7th & Pleasant Valley,2536.0,Waller & 6th St.,11,2,2
4,9073016,Local365,984,classic,2016-03-11 15:01:42+00:00,3381,East 7th & Pleasant Valley,2536.0,Waller & 6th St.,11,4,4


## Best practice step 2: Move your code into a source code file

Keeping code in notebooks makes it difficult to track changes, because any change in the notebook can change the entire file. It's then very difficult to understand what was changed. And just running a cell in a notebook constitutes a "change": Running just one cell can be enough to make very many changes in the notebook file.

If the code is in a separate source code file, then this file will contain only the text of the source code, and the file will only change when we change the source code.

Move the function into a source file, for example `src/features.py`:

In [17]:
!cat src/features.py

import pandas as pd

def get_day_of_week_feature(df: pd.DataFrame):
   return df.start_time.dt.dayofweek.apply(str)


Now we want to use this function in our notebook to continue our work. We need to tell the Python kernel where the source code is:

In [18]:
import sys  
sys.path.insert(1, 'src')

And we need to tell it that the source code can change. By default, once a file or package is imported, Python will ignore further imports, because it has already loaded the file. But if we are working on the code, we want to be able to change it, and reload it, so we need to change this default behaviour:

In [19]:
%load_ext autoreload
%autoreload 2

Now we can load our function from this file (I'm loading it as `get_dow_feature_from_file` so that you can see I'm using the code loaded from the file):

In [21]:
from features import get_day_of_week_feature as get_dow_feature_from_file

In [22]:
df['start_day_of_week_file'] = get_dow_feature_from_file(df)

In [23]:
df.head()

Unnamed: 0,trip_id,subscriber_type,bike_id,bike_type,start_time,start_station_id,start_station_name,end_station_id,end_station_name,duration_minutes,start_day_of_week,start_day_of_week_fun,start_day_of_week_file
0,4523505,Local365,452,classic,2015-04-17 10:39:33+00:00,2540,17th/Guadalupe,,Ready for deployment,6,4,4,4
1,4519813,Local365,188,classic,2015-04-16 20:30:59+00:00,2568,East 11th/Victory Grill,2569.0,East 11th/San Marcos,3,3,3,3
2,9225085,Local30,550,classic,2016-03-21 15:31:42+00:00,3381,East 7th & Pleasant Valley,2536.0,Waller & 6th St.,12,0,0,0
3,8956930,Local365,107,classic,2016-03-02 17:50:51+00:00,3381,East 7th & Pleasant Valley,2536.0,Waller & 6th St.,11,2,2,2
4,9073016,Local365,984,classic,2016-03-11 15:01:42+00:00,3381,East 7th & Pleasant Valley,2536.0,Waller & 6th St.,11,4,4,4


In [24]:
## Best practice step 3: Bring your code under version control

Add your file to git:

```
git add src/features.py
git commit -m"Adding feature engineering code"
git push
```

SyntaxError: invalid syntax (1487811370.py, line 3)