# Palmer Penguins dataset
## kedro_example

### Loading your data

1. If you are using `kedro jupyter notebook` and the top right side of the screen matches the bname of your root folder, in this case, palmer_penguins, the data catalog and context can be accesed directly.

In [2]:
# this snippet will ognore all the error outputs on the notebook
import warnings
warnings.filterwarnings('ignore')

In [5]:
# contains info about the current project
context

<palmer_penguins.run.ProjectContext at 0x120d8b7d0>

In [6]:
# returns the list of all pipelines and nodes available on your context
context.pipelines

{'de': Pipeline([
 Node(make_scatter_plot, 'size_penguins', 'penguins_scatter_plot@matplotlib', None),
 Node(split_data, ['size_penguins', 'params:example_test_data_ratio'], {'train_x': 'example_train_x', 'train_y': 'example_train_y', 'test_x': 'example_test_x', 'test_y': 'example_test_y'}, None),
 Node(<lambda>, 'penguins_scatter_plot@byteform', 'penguins_scatter_plot_base64', None)
 ]),
 'ds': Pipeline([
 Node(train_model, ['example_train_x', 'example_train_y', 'parameters'], 'example_model', None),
 Node(predict, {'model': 'example_model', 'test_x': 'example_test_x'}, 'example_predictions', None),
 Node(report_accuracy, ['example_predictions', 'example_test_y'], None, None)
 ]),
 '__default__': Pipeline([
 Node(make_scatter_plot, 'size_penguins', 'penguins_scatter_plot@matplotlib', None),
 Node(split_data, ['size_penguins', 'params:example_test_data_ratio'], {'train_x': 'example_train_x', 'train_y': 'example_train_y', 'test_x': 'example_test_x', 'test_y': 'example_test_y'}, None),
 

In [None]:
#(a.k.a data connector) contains the datasets for the projecs
catalog

For details on kedro context and data catalog go here [add links].

In [None]:
# lists all the data catalogues available
catalog.list()

2. If your are using a <b>regular jupyter</b> notebook the code snipped bellow will acces the data context and catalog from your Kedro project.

In [None]:
# #only run if NOT USING `kedro jupyter notebook`
# from kedro.context import load_context
# #loads the context and catalog from the kedro project
# context = load_context('../')
# catalog = context.catalog

### The dataframe

In [None]:
#loads the data catalog
df = catalog.load('size_penguins')

In [None]:
df.head()

### Understanding the data

In [None]:
f"The size_penguins dataset has {df.shape} <rows, columns>"

In [None]:
# prints a detailed summary of the dataset
df.describe(include='all')

### Data Analysis

"Covariance indicates the direction of the linear relationship between variables." ([source](https://towardsdatascience.com/let-us-understand-the-correlation-matrix-and-covariance-matrix-d42e6b643c22#78e7))

In [None]:
print('Covariance:')
df.cov()

"Correlation measures both the strength and direction of the linear relationship between two variables." ([source](https://towardsdatascience.com/let-us-understand-the-correlation-matrix-and-covariance-matrix-d42e6b643c22#78e7))

In [None]:
print('Correlation:')
df.corr()

Show all the columns with missing values:

In [None]:
df.isnull().sum()

### Simple data visualisation

In [None]:
print('Number of samples per species:')
df['species'].value_counts()

#### Barplot

In [None]:
# barplot > find out how ;)
# df['species'].value_counts().pyplot(kind='bar')

#### Scatter plot

Shows the correlation with relation to other features. It helps to find out important features that account for the classification model.

In [None]:
import matplotlib.pyplot as plt

In [None]:
fig, ax = plt.subplots()

In [None]:
df.species.unique()

Scatter plot with Penguin's species

In [None]:
df[df["species"] == "Adelie"].plot.scatter(x="culmen_length_mm", y="culmen_depth_mm", label="Adelie", color="magenta", ax=ax)

In [None]:
df[df["species"] == "Chinstrap"].plot.scatter(x="culmen_length_mm", y="culmen_depth_mm", label="Chinstrap", color="blue", ax=ax)

In [None]:
df[df["species"] == "Gentoo"].plot.scatter(x="culmen_length_mm", y="culmen_depth_mm", label="Gentoo", color="green", ax=ax)

In [None]:
fig.set_size_inches(15,15)

In [None]:
# not necessary when translating functions to kedro nodes
fig.savefig("scatter_plot_species")

#### Heat Map

HeatMap showing correlation between flipper_length_mm & body_mass_g (since longer the flipper, bigger the bird)

Footnote:  
For more information on how to use IPython and Jupyter notebooks/Labs with Kedro follow this [link](https://kedro.readthedocs.io/en/stable/10_tools_integration/02_ipython.html "Kedro docs").