<img src='../images/xebia-logo.png' width='300px' align='right' style="padding: 15px">

# Python Packaging

In this notebook, you will practice how to organize python code into a package.

The first step will be to set up the basic infrastructure required to run these notebooks. 

**If you are comfortable using git, we recommend checking out a new branch to follow along (`git checkout -b BRANCH_NAME`) during the training.**

You need to:

1. Create a new **package project** with a `pyproject.toml` file in the root of the repository. Run: `uv init --package --name animal_shelter` 
    
2. Install **Pandas** as a dependency with `uv add pandas`. Notice that this will create a virtual environment and lock file, if it was not already created.

3. Install **Jupyter** as a development dependency so that we can run these notebooks from the virtual environment `uv add --dev jupyter`.

<mark>**Question:**</mark> Why are we adding it as a dev-dependency, but Pandas as a normal dependency?

## The use case

Now let's have a look as the use case these training uses as an example. It concerns an animal shelter that is trying to predict the outcome (e.g. adopted, transferred) of the animals that come through it.

In [1]:
import pandas as pd
import re

def load_data(path):
    """Load the data and convert the column names.

    Parameters
    ----------
    path : str
        Path to data
    Returns
    -------
    df : pandas.DataFrame
        DataFrame with data
    """
    df = (
        pd.read_csv(path, parse_dates=["DateTime"])
        .rename(columns=lambda x: x.replace("upon", "Upon"))
        .rename(columns=convert_camel_case)
        .fillna("Unknown")
    )
    return df


def convert_camel_case(name):
    """Convert camelCaseString to snake_case_string."""
    s1 = re.sub("(.)([A-Z][a-z]+)", r"\1_\2", name)
    return re.sub("([a-z0-9])([A-Z])", r"\1_\2", s1).lower()

In [3]:
animal_outcomes = load_data('../data/train.csv')
animal_outcomes.head(10)

Unnamed: 0,animal_id,name,date_time,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color
0,A671945,Hambone,2014-02-12 18:22:00,Return_to_owner,Unknown,Dog,Neutered Male,1 year,Shetland Sheepdog Mix,Brown/White
1,A656520,Emily,2013-10-13 12:44:00,Euthanasia,Suffering,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Cream Tabby
2,A686464,Pearce,2015-01-31 12:28:00,Adoption,Foster,Dog,Neutered Male,2 years,Pit Bull Mix,Blue/White
3,A683430,Unknown,2014-07-11 19:09:00,Transfer,Partner,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Cream
4,A667013,Unknown,2013-11-15 12:52:00,Transfer,Partner,Dog,Neutered Male,2 years,Lhasa Apso/Miniature Poodle,Tan
5,A677334,Elsa,2014-04-25 13:04:00,Transfer,Partner,Dog,Intact Female,1 month,Cairn Terrier/Chihuahua Shorthair,Black/Tan
6,A699218,Jimmy,2015-03-28 13:11:00,Transfer,Partner,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Tabby
7,A701489,Unknown,2015-04-30 17:02:00,Transfer,Partner,Cat,Unknown,3 weeks,Domestic Shorthair Mix,Brown Tabby
8,A671784,Lucy,2014-02-04 17:17:00,Adoption,Unknown,Dog,Spayed Female,5 months,American Pit Bull Terrier Mix,Red/White
9,A677747,Unknown,2014-05-03 07:48:00,Adoption,Offsite,Dog,Spayed Female,1 year,Cairn Terrier,White


Feel free to spend some time doing some preliminary data exploration:

1. What is the data about?
2. What could be a ML task that we'd like to carry out here? What could be the outcome for predictions?
3. What are the features, and how would you need to transform the data to retrieve those features?
etc.

In [None]:
# check out the animal_outcomes df if you want here

From this dataset, you can generate the following features about each animal that may be helpful to train a machine learning model later on.

- boolean indicator for whether it is a dog
- boolean indicator for whether it has a name
- categorical feature indicating its sex
- categorical feature indicating whether it is neutered
- catergorical feature indicating its hair type
- age upon outcome in days

All these features are already generated for you with the `add_features(df)` function below:

In [4]:
import numpy as np
import pandas as pd


def add_features(df):
    """Add some features to our data.
    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame with data (see load_data)
    Returns
    -------
    with_features : pandas.DataFrame
        DataFrame with some column features added
    """
    df['is_dog'] = check_is_dog(df['animal_type'])


    # Check if it has a name.
    df['has_name'] = df['name'].str.lower() != 'unknown'


    # Get sex.
    sexUponOutcome = df['sex_upon_outcome']
    sex = pd.Series('unknown', index=sexUponOutcome.index)

    sex.loc[sexUponOutcome.str.endswith('Female')] = 'female'
    sex.loc[sexUponOutcome.str.endswith('Male')] = 'male'
    df['sex'] = sex



    # Check if neutered.
    neutered = sexUponOutcome.str.lower()
    neutered.loc[neutered.str.contains('neutered')] = 'fixed'
    neutered.loc[neutered.str.contains('spayed')] = 'fixed'


    neutered.loc[neutered.str.contains('intact')] = 'intact'
    neutered.loc[~neutered.isin(['fixed', 'intact'])] = 'unknown'


    df['neutered'] = neutered


    # Get hair type.

    hairType = df['breed'].str.lower()
    Valid_hair_types = ['shorthair', 'medium hair', 'longhair']



    for hair in Valid_hair_types:
        is_hair_type = hairType.str.contains(hair)
        hairType[is_hair_type] = hair

    hairType[~hairType.isin(Valid_hair_types)] = 'unknown'


    df['hair_type'] = hairType


    # Age in days upon outcome.

    Split_Age = df['age_upon_outcome'].str.split()
    time = Split_Age.apply(lambda x: x[0] if x[0] != 'Unknown' else np.nan)
    period = Split_Age.apply(lambda x: x[1] if x[0] != 'Unknown' else None)
    period_Mapping = {'year': 365, 'years': 365, 'weeks': 7, 'week': 7,
                      'month': 30, 'months': 30, 'days': 1, 'day': 1}
    days_upon_outcome = time.astype(float) * period.map(period_Mapping)
    df['days_upon_outcome'] = days_upon_outcome



    return df

def check_is_dog(animal_type):
    """Check if the animal is a dog, otherwise return False.
    Parameters
    ----------
    animal_type : pandas.Series
        Type of animal
    Returns
    -------
    result : pandas.Series
        Dog or not
    """
    # Check if it's either a cat or a dog.
    is_cat_dog = animal_type.str.lower().isin(['dog', 'cat'])
    if not is_cat_dog.all():
        print('Found something else but dogs and cats:\n%s',
              animal_type[~is_cat_dog])
        raise RuntimeError("Found pets that are not dogs or cats.")
    is_dog = animal_type.str.lower() == 'dog'
    return is_dog

In [5]:
animal_outcomes = load_data('../data/train.csv')
with_features = add_features(animal_outcomes)
with_features.head()

Unnamed: 0,animal_id,name,date_time,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,is_dog,has_name,sex,neutered,hair_type,days_upon_outcome
0,A671945,Hambone,2014-02-12 18:22:00,Return_to_owner,Unknown,Dog,Neutered Male,1 year,Shetland Sheepdog Mix,Brown/White,True,True,male,fixed,unknown,365.0
1,A656520,Emily,2013-10-13 12:44:00,Euthanasia,Suffering,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Cream Tabby,False,True,female,fixed,shorthair,365.0
2,A686464,Pearce,2015-01-31 12:28:00,Adoption,Foster,Dog,Neutered Male,2 years,Pit Bull Mix,Blue/White,True,True,male,fixed,unknown,730.0
3,A683430,Unknown,2014-07-11 19:09:00,Transfer,Partner,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Cream,False,False,male,intact,shorthair,21.0
4,A667013,Unknown,2013-11-15 12:52:00,Transfer,Partner,Dog,Neutered Male,2 years,Lhasa Apso/Miniature Poodle,Tan,True,False,male,fixed,unknown,730.0


There are some bad practices going on in the functions above, but don't worry about their quality for now. Let's focus on packaging the code.

## <mark>Exercise:</mark> Successfully import the functions as a package
Your goal is to copy-paste the code from the cells above into a package that exports the functionality that a user (e.g. an analyst writing a report in a notebook or a service serving predictions) would use. 

They should be able to import the functions as in the cell below:

In [7]:
from animal_shelter.data import load_data
from animal_shelter.features import add_features
animal_outcomes = load_data('../data/test.csv')
with_features = add_features(animal_outcomes)
with_features.head()

Unnamed: 0,id,name,date_time,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,is_dog,has_name,sex,neutered,hair_type,days_upon_outcome
0,1,Summer,2015-10-12 12:15:00,Dog,Intact Female,10 months,Labrador Retriever Mix,Red/White,True,True,female,intact,unknown,300.0
1,2,Cheyenne,2014-07-26 17:59:00,Dog,Spayed Female,2 years,German Shepherd/Siberian Husky,Black/Tan,True,True,female,fixed,unknown,730.0
2,3,Gus,2016-01-13 12:20:00,Cat,Neutered Male,1 year,Domestic Shorthair Mix,Brown Tabby,False,True,male,fixed,shorthair,365.0
3,4,Pongo,2013-12-28 18:12:00,Dog,Intact Male,4 months,Collie Smooth Mix,Tricolor,True,True,male,intact,unknown,120.0
4,5,Skooter,2015-09-24 17:59:00,Dog,Neutered Male,2 years,Miniature Poodle Mix,White,True,True,male,fixed,unknown,730.0


***Hints:*** 

1. The location of the package should be in this folder structure: `./src/animal_shelter/__init__.py`
2. You will need two modules `data.py` and `features.py` to successfully run the code above.

You can run the cell below to automatically auto-reload changes to the source code of any imported package, which is very useful during development.

In [None]:
%load_ext autoreload
%autoreload 2