## In this notebook

- Data preparation.
- Links:
    - [data description](https://allisonhorst.github.io/palmerpenguins/)
    - [data source](https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data)

In [1]:
from pathlib import Path

import pandas as pd
import polars as pl

In [2]:
DATA_PATH_RAW = Path("/usr/src/app/data/palmer_penguins_raw.csv")
DATA_PATH_STAGE = Path("/usr/src/app/data/palmer_penguins.csv")

## Prepare data

In [3]:
df = (
    pd
    .read_csv(DATA_PATH_RAW)
    .rename(
        columns={
            col: (
                col
                .replace("(o/oo)", "")
                .replace("(", "")
                .replace(")", "")
                .strip()
                .replace(" ", "_")
                .lower()
            )
            for col in pd.read_csv(DATA_PATH_RAW).columns
        }
    )
    .rename(columns={
        "individual_id": "pinguin_id",
        "studyname": "study_name",
        "culmen_length_mm": "bill_length",
        "culmen_depth_mm": "bill_depth",
        "flipper_length_mm": "flipper_len",
    })
    .assign(
        species=lambda _df: _df.species.str.lower().str.split().str[0],
        date=lambda _df: pd.to_datetime(_df.date_egg, format="%m/%d/%y").astype(str),
    )
    .loc[:, [
        "pinguin_id", 
        "date", 
        "species", 
        # "island", 
        "bill_length", 
        "bill_depth",
        # "flipper_len",
        # "body_mass_g",
        "sex",
    ]]
)

df

Unnamed: 0,pinguin_id,date,species,bill_length,bill_depth,sex
0,N1A1,2007-11-11,adelie,39.1,18.7,MALE
1,N1A2,2007-11-11,adelie,39.5,17.4,FEMALE
2,N2A1,2007-11-16,adelie,40.3,18.0,FEMALE
3,N2A2,2007-11-16,adelie,,,
4,N3A1,2007-11-16,adelie,36.7,19.3,FEMALE
...,...,...,...,...,...,...
339,N38A2,2009-12-01,gentoo,,,
340,N39A1,2009-11-22,gentoo,46.8,14.3,FEMALE
341,N39A2,2009-11-22,gentoo,50.4,15.7,MALE
342,N43A1,2009-11-22,gentoo,45.2,14.8,FEMALE


In [4]:
# check data in polars

print(
    pl
    .from_pandas(df)
)

shape: (344, 6)
┌────────────┬────────────┬─────────┬─────────────┬────────────┬────────┐
│ pinguin_id ┆ date       ┆ species ┆ bill_length ┆ bill_depth ┆ sex    │
│ ---        ┆ ---        ┆ ---     ┆ ---         ┆ ---        ┆ ---    │
│ str        ┆ str        ┆ str     ┆ f64         ┆ f64        ┆ str    │
╞════════════╪════════════╪═════════╪═════════════╪════════════╪════════╡
│ N1A1       ┆ 2007-11-11 ┆ adelie  ┆ 39.1        ┆ 18.7       ┆ MALE   │
│ N1A2       ┆ 2007-11-11 ┆ adelie  ┆ 39.5        ┆ 17.4       ┆ FEMALE │
│ N2A1       ┆ 2007-11-16 ┆ adelie  ┆ 40.3        ┆ 18.0       ┆ FEMALE │
│ N2A2       ┆ 2007-11-16 ┆ adelie  ┆ null        ┆ null       ┆ null   │
│ N3A1       ┆ 2007-11-16 ┆ adelie  ┆ 36.7        ┆ 19.3       ┆ FEMALE │
│ …          ┆ …          ┆ …       ┆ …           ┆ …          ┆ …      │
│ N38A2      ┆ 2009-12-01 ┆ gentoo  ┆ null        ┆ null       ┆ null   │
│ N39A1      ┆ 2009-11-22 ┆ gentoo  ┆ 46.8        ┆ 14.3       ┆ FEMALE │
│ N39A2      ┆ 2009-11

In [5]:
df.to_csv(DATA_PATH_STAGE, index=False)

## Results

- Dataset was prepared.