# Convert a notebook project to Kedro

This page describes how to convert your notebook project to use Kedro in increments. 

It starts with a version of the spaceflights example with does NOT use Kedro that you can run inside a notebook. The example converts portions of the code to use Kedro features while remaining runnable from within a notebook.

## Spaceflights in a notebook

If you are unfamiliar with the spaceflights example, it is used to introduce the basics of Kedro in a tutorial that runs exclusively as a Kedro project, that is, as a set of `.py` files rather than in a notebook. The premise is as follows:

_It is 2160, and the space tourism industry is booming. Globally, thousands of space shuttle companies take tourists to the Moon and back. You have been able to source data that lists the amenities offered in each space shuttle, customer reviews, and company information._

_Project: You want to construct a model that predicts the price for each trip to the Moon and the corresponding return flight._

Run the cells in this section to experiment with the spaceflights example within a notebook. 

In [None]:
import pandas as pd

In [None]:
# This code needs the data directory set up and three files downloaded. 
# Is there a way to do this in code from the notebook to save the reader the manual task?
# Either download the file or use an OS data fabricator to make 3 files?

companies = pd.read_csv('data/companies.csv')
reviews = pd.read_csv('data/reviews.csv')
shuttles = pd.read_excel('data/shuttles.xlsx', engine='openpyxl')

In [None]:
companies.head()

In [None]:
reviews.head()

In [None]:
shuttles.head()

In [None]:
companies["iata_approved"] = companies["iata_approved"] == "t"
companies["company_rating"] = companies["company_rating"].str.replace("%", "").astype(float)

companies.head()

In [None]:
shuttles["d_check_complete"] = shuttles["d_check_complete"] == "t"
shuttles["moon_clearance_complete"] = shuttles["moon_clearance_complete"] == "t"
shuttles["price"] = shuttles["price"].str.replace("$", "").str.replace(",", "").astype(float)

shuttles.head()

In [None]:
rated_shuttles = shuttles.merge(reviews, left_on="id", right_on="shuttle_id")
rated_shuttles.head()

In [None]:
model_input_table = rated_shuttles.merge(companies, left_on="company_id", right_on="id")
model_input_table = model_input_table.dropna()
model_input_table.head()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = model_input_table[[
    "engines",
    "passenger_capacity",
    "crew",
    "d_check_complete",
    "moon_clearance_complete",
    "iata_approved",
    "company_rating",
    "review_scores_rating",
]]
y = model_input_table["price"]

In [None]:
X.head()

In [None]:
y.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=3)

In [None]:
len(X_train), len(X_test)

In [None]:
len(y_train), len(y_test)

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression()
model

In [None]:
model.fit(X_train, y_train)

In [None]:
model.predict(X_test)

In [None]:
from sklearn.metrics import r2_score

In [None]:
y_pred = model.predict(X_test)

In [None]:
r2_score(y_test, y_pred)

## Use Kedro for data management
Even if you’re not ready for a full Kedro project, you can still take advantage of its data handling solution in your existing project from within a notebook. This section shows you how.

Kedro’s Data Catalog is a registry of all data sources available for use by the project, and offers a separate place to declare details of the datasets your projects use. Kedro provides [built-in datasets for numerous file types and file systems](https://docs.kedro.org/en/stable/kedro_datasets.html), so you don’t have to write any of the logic for reading or writing data. 

Kedro offers a range of datasets, including CSV, Excel, Parquet, Feather, HDF5, JSON, Pickle, SQL Tables, SQL Queries, Spark DataFrames and more. They are supported with the APIs of pandas, spark, networkx, matplotlib, yaml and more. It relies on `[fsspec](https://filesystem-spec.readthedocs.io/en/latest/)` to read and save data from a variety of data stores including local file systems, network file systems, cloud object stores, and Hadoop. You can pass arguments in to load and save operations, and use versioning and credentials for data access.

To start using the Data Catalog, create a `catalog.yml` in the same directory as your notebook to define datasets that can be used when writing your functions. For example:

<!--This code needs the user to create a yaml file with this contents. Is there a piece of code we could offer that creates and writes one for them into the appropriate directory to save the reader the manual task? -->

```yml
companies:
  type: pandas.CSVDataSet
  filepath: data/companies.csv

reviews:
  type: pandas.CSVDataSet
  filepath: data/reviews.csv

shuttles:
  type: pandas.ExcelDataSet
  filepath: data/shuttles.xlsx
```

Then by using Kedro to load the `catalog.yml` file, you can reference the Data Catalog in your Jupyter notebook.

In [49]:
# Using Kedro's DataCatalog

from kedro.io import DataCatalog

import yaml
# load the configuration file 
with open("catalog.yml") as f:
    conf_catalog = yaml.safe_load(f)

# Create the DataCatalog instance from the configuration
catalog = DataCatalog.from_config(conf_catalog)

# Load the dataset and print the output
companies = catalog.load("companies")
reviews = catalog.load("reviews")
shuttles = catalog.load("shuttles")

In [50]:
companies.head()

Unnamed: 0,id,company_rating,company_location,total_fleet_count,iata_approved
0,35029,100%,Niue,4.0,f
1,30292,67%,Anguilla,6.0,f
2,19032,67%,Russian Federation,4.0,f
3,8238,91%,Barbados,15.0,t
4,30342,,Sao Tome and Principe,2.0,t


In [51]:
reviews.head()

Unnamed: 0,shuttle_id,review_scores_rating,review_scores_comfort,review_scores_amenities,review_scores_trip,review_scores_crew,review_scores_location,review_scores_price,number_of_reviews,reviews_per_month
0,63561,97.0,10.0,9.0,10.0,10.0,9.0,10.0,133,1.65
1,36260,90.0,8.0,9.0,10.0,9.0,9.0,9.0,3,0.09
2,57015,95.0,9.0,10.0,9.0,10.0,9.0,9.0,14,0.14
3,14035,93.0,10.0,9.0,9.0,9.0,10.0,9.0,39,0.42
4,10036,98.0,10.0,10.0,10.0,10.0,9.0,9.0,92,0.94


In [52]:
shuttles.head()

Unnamed: 0,id,shuttle_location,shuttle_type,engine_type,engine_vendor,engines,passenger_capacity,cancellation_policy,crew,d_check_complete,moon_clearance_complete,price,company_id
0,63561,Niue,Type V5,Quantum,ThetaBase Services,1.0,2,strict,1.0,f,f,"$1,325.0",35029
1,36260,Anguilla,Type V5,Quantum,ThetaBase Services,1.0,2,strict,1.0,t,f,"$1,780.0",30292
2,57015,Russian Federation,Type V5,Quantum,ThetaBase Services,1.0,2,moderate,0.0,f,f,"$1,715.0",19032
3,14035,Barbados,Type V5,Plasma,ThetaBase Services,3.0,6,strict,3.0,f,f,"$4,770.0",8238
4,10036,Sao Tome and Principe,Type V2,Plasma,ThetaBase Services,2.0,4,strict,2.0,f,f,"$2,820.0",30342


The rest of the spaceflights notebook code from above can now run as previously. 

## Use Kedro for configuration
When writing exploratory code, it’s tempting to hard code values to save time, but it makes code harder to maintain in the longer-term. The example code above calls `sklearn.model_selection.train_test_split()`, passing in a model input table and outputs the test and train datasets. There are hard-code values supplied to `test_size` and `random_state`.

```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=3)
```

[Good software engineering practice](https://towardsdatascience.com/five-software-engineering-principles-for-collaborative-data-science-ab26667a311) suggests that we extract *‘magic numbers’* into named constants, sometimes defined at the top of a file, or outside in a utility file, storing it in a format such as yaml. 

```yml
# parameters.yml

model_options:
  test_size: 0.3
  random_state: 3
```  

By loading `parameters.yml`, you can reference the values with the notebook code. 

