# Smart Pandas Example
This notebook demonstrates the capabilities of the smart-pandas package for a standard ML pipeline.

## Basic Usage
In order to use smart-pandas, all you need to do is modify your pandas import to grab the package from smart-pandas. When you import pandas via smart-pandas it will include the custom smart-pandas api extension, which is how can we access the packages funcionality.

In [1]:
from smart_pandas import pandas as pd

Then you can load your pandas dataframe however you normally would. Here we are just generating an example dataframe for use in this example.

In [2]:
# an example dataframe
data = pd.DataFrame(
    {
        "user_id": ["1", "2", "3"],
        "timestamp": [pd.Timestamp("2020-01-01"), pd.Timestamp("2020-01-02"), pd.Timestamp("2020-01-03")],
        "name": ["Emily", "Adam", "Charles"],
        "weight": [60, 74, 80],
        "height": [165, 182, 185],
        "age": [25, 30, 35],
        "life_expectancy": [90, 80, 80],
    }
)

You also need to define a smart pandas config file, which we have at `example_config.yaml`. The config looks as follows:

```
{
  name: "life_expectancy_modelling_data",
  columns: [
    {
      name: "user_id",
      data_schema: {"dtype": "str"},
      tags: ["unique_identifier"],
      description: "Unique identifier for the person"
    },
    {
      name: "timestamp",
      data_schema: {"dtype": "datetime"},
      tags: ["row_timestamp"],
      description: "Timestamp of the data"
    },
    {
      name: "name",
      data_schema: {"dtype": "str"},
      tags: ["metadata"],
      description: "Name of the person"
    },
    {
      name: "weight",
      data_schema: {"dtype": "float"},
      tags: ["raw_feature"],
      description: "Weight of the person in kg"
    },
    {
      name: "height",
      data_schema: {"dtype": "float"},
      tags: ["raw_feature"],
      description: "Height of the person in cm"
    },
    {
      name: "age",
      data_schema: {"dtype": "int"},
      tags: ["raw_feature", "model_feature"],
      description: "Age of the person in years"
    },
    {
      name: "bmi",
      data_schema: {"dtype": "float"},
      tags: ["derived_feature", "model_feature"],
      description: "BMI of the person"
    },
    {
      name: "life_expectancy",
      data_schema: {"dtype": "int"},
      tags: ["target"],
      description: "Life expectancy of the person in years"
    },
  ]
}

```

Then you can initialise the smart-pandas configuration and attach it to your pandas dataframe with the following call:

In [4]:
data.smart_pandas.init(config_path="example_config.yaml")

Now we have access to the `smart_pandas` attribute on our pandas dataframe, and can access various calls. Below are some examples:

In [5]:
# access the unique identifier column
data.smart_pandas.unique_identifier

Unnamed: 0,user_id
0,1
1,2
2,3


In [6]:
# access the raw feature columns
data.smart_pandas.raw_features

Unnamed: 0,weight,height,age
0,60,165,25
1,74,182,30
2,80,185,35


## State
smart-pandas tracks the synchronisation between the data columns and the configuration file through the `state` attribute. The `state` attribute represents the data at certain phases in the ML data lifecycle. The state is built up of two attributes, the `StateName` and the `MLStage`. The `StateName` represents the current point in a specific data pipeline, whereas the `MLStage` identifies which data pipeline we are in. Lets see an example with our data.

In [7]:
data.smart_pandas.state

StateName.RAW, MLStage.TRAINING

The `StateName` for our current data is `RAW`, which indactes that we currently have the raw features and not the processed model features. The `MLStage` for our current data `TRAINING`, smart-pandas identifies this via the presence of the `target` column in the data.

An important difference between `StateName` and `MLStage` is that the `StateName` is mutable, and the `MLStage` is not. Once an `MLStage` is set for a given dataset, it cannot change without re-initialising the smart-pandas config for that dataframe. This should make intuitive sense, we are either in a training workflow or an inference workflow, and we won't be moving between them. However the `StateName` will necessarily change as we move through processes of our data transformations.

Let's see what happens to the state after we have done our feature engineering!

In [8]:
def feature_engineering(data: pd.DataFrame) -> pd.DataFrame:
    """An example feature engineering function."""

    data.loc[:, "bmi"] = data.loc[:, "weight"] / (data.loc[:, "height"] / 100) ** 2
    data.drop(columns=["weight", "height"], inplace=True)
    return data

In [9]:
# perform the feature engineering
data = feature_engineering(data)

# update the state based on the new columns
data.smart_pandas.update_state()

# check the state
data.smart_pandas.state

StateName.PROCESSED, MLStage.TRAINING

You can see that the `StateName` has now moved from `RAW` to `PROCESSED`, this is based on the precence of all the model features (check the tags in the config above) existing in the dataframe. This is telling us our data is ready to be fed into the model!

The state has two values which represent a loss in the expected data structure, these are `UNKNOWN` and `CORRUPTED`. A `CORRUPTED` state is a result of a strict breaking of one of the data assumptions that smart-pandas builds off. For example, smart-pandas requires your data to have a unique identifier. If that column is missing from your data, your state will be `CORRUPTED`.

In [10]:
# copy the id col to append back after this test
id_col = data.smart_pandas.unique_identifier

# drop the id column
data.drop(columns=data.smart_pandas.config.unique_identifier, inplace=True)

# update the state to reflect the missing id column
data.smart_pandas.update_state()

# check the state
data.smart_pandas.state



StateName.CORRUPTED, MLStage.TRAINING

The `UNKNOWN` state represents data which is somewhere between `RAW` and `PROCESSED`. That is to say, it is not missing any key columns, but it contains a partial combination of the raw and derived features. For example, if we generate the `bmi` feature as above, but we do not drop the non-model features, we will enter an `UNKNOWN` state.

In [11]:
data.loc[:, data.smart_pandas.config.unique_identifier] = id_col

## Data Validation
Smart Pandas uses [Pandera schemas](https://pandera.readthedocs.io/en/stable/) under the hood to provide a way to validate your data at various points in your pipeline. Smart Pandas uses the schema settings defined in the config, as well as the current state of the dataframe, to build a schema and validate the data. Below is an example (much like a lot of Pandas functionality, we offer the option to validate the data inplace. This means any type coercion performed by the schema validation will be applied to your data):

In [12]:
# re-initialise the dataframe to include the raw features rather than the processed model features
data = pd.DataFrame(
    {
        "user_id": ["1", "2", "3"],
        "timestamp": [pd.Timestamp("2020-01-01"), pd.Timestamp("2020-01-02"), pd.Timestamp("2020-01-03")],
        "name": ["Emily", "Adam", "Charles"],
        "weight": [60, 74, 80],
        "height": [165, 182, 185],
        "age": [25, 30, 35],
        "life_expectancy": [90, 80, 80],
    }
)

data.smart_pandas.init(config_path="example_config.yaml")

In [13]:
data.smart_pandas.validate(inplace=True)

Our data passed the validation checks without any isues, great! If we want to know exactly what we are validating, we can access the schema directly.

In [14]:
print(data.smart_pandas.schema)

<Schema DataFrameSchema(
    columns={
        'user_id': <Schema Column(name=user_id, type=DataType(str))>
        'timestamp': <Schema Column(name=timestamp, type=DataType(datetime64[ns]))>
        'name': <Schema Column(name=name, type=DataType(str))>
        'weight': <Schema Column(name=weight, type=DataType(float64))>
        'height': <Schema Column(name=height, type=DataType(float64))>
        'age': <Schema Column(name=age, type=DataType(int64))>
        'life_expectancy': <Schema Column(name=life_expectancy, type=DataType(int64))>
    },
    checks=[],
    parsers=[],
    coerce=True,
    dtype=None,
    index=None,
    strict=True,
    name=None,
    ordered=False,
    unique_column_names=True,
    metadata=None, 
    add_missing_columns=False
)>


A good question to ask might be: what happens if we try to validate the data after we have performed some feature engineering? Let's generate our `bmi` feature again and see what happens when we attempt to validate.

In [15]:
data = feature_engineering(data)

In [16]:
import pandera as pa

try:
    data.smart_pandas.validate(inplace=True, update_state=False)  # ignore the update_state flag for now, it will be explaiend below!
except pa.errors.SchemaError as e:
    print(e)

column 'bmi' not in DataFrameSchema {'user_id': <Schema Column(name=user_id, type=DataType(str))>, 'timestamp': <Schema Column(name=timestamp, type=DataType(datetime64[ns]))>, 'name': <Schema Column(name=name, type=DataType(str))>, 'weight': <Schema Column(name=weight, type=DataType(float64))>, 'height': <Schema Column(name=height, type=DataType(float64))>, 'age': <Schema Column(name=age, type=DataType(int64))>, 'life_expectancy': <Schema Column(name=life_expectancy, type=DataType(int64))>}


The validation has failed! This is because the schema is built based on the current state of the data. Since the state is still `RAW` (because we haven't updated it) the schema does not expect to find the derived column `bmi` in the data. Fortunately the `validate` method provides the handy `update_state` flag which we can set to true, this will update the state (and also the resulting schema) prior to doing the validation.

In [17]:
data.smart_pandas.validate(inplace=True, update_state=True) 

Now our data passes the validation, and if we check the schema we can see that it has been updated to reflect the current state of the data.

In [18]:
print(data.smart_pandas.schema)

<Schema DataFrameSchema(
    columns={
        'user_id': <Schema Column(name=user_id, type=DataType(str))>
        'timestamp': <Schema Column(name=timestamp, type=DataType(datetime64[ns]))>
        'name': <Schema Column(name=name, type=DataType(str))>
        'age': <Schema Column(name=age, type=DataType(int64))>
        'bmi': <Schema Column(name=bmi, type=DataType(float64))>
        'life_expectancy': <Schema Column(name=life_expectancy, type=DataType(int64))>
    },
    checks=[],
    parsers=[],
    coerce=True,
    dtype=None,
    index=None,
    strict=True,
    name=None,
    ordered=False,
    unique_column_names=True,
    metadata=None, 
    add_missing_columns=False
)>
