## Pandas DataFrame Validation with Engarde

In this notebook, we'll take a look at how to validate data within `pandas.DataFrame` objects. Tom Augspurger has created the library [engarde](https://github.com/TomAugspurger/engarde), which allows you to write both function decorators or utilize built-in functions to test your DataFrame with specific validation rules or definitions.

In [None]:
import pandas as pd
import engarde.decorators as ed
from datetime import datetime

In [None]:
sales = pd.read_csv('../data/sales_data_duped_with_nulls.csv')

## Data Quality Check

In [None]:
sales.head()

In [None]:
sales.dtypes

### Engarde let's us track datatypes, so first we need to record our expected results at the first function -- changing what we will change with our first method

In [None]:
new_dtypes = {
    'timestamp': object,
    'city': object,
    'store_id': int,
    'sale_number': float,
    'sale_amount': float,
    'associate': object
}

In [None]:
@ed.has_dtypes(new_dtypes)
@ed.is_shape((None, 6))
def update_dtypes(sales):
    sales.timestamp = sales.timestamp.map(
        lambda x: datetime.strptime(
        x, '%Y-%m-%dT%H:%M:%S').date())
    return sales

In [None]:
sales = update_dtypes(sales)

In [None]:
sales.timestamp.iloc[0]

## Now we want to remove poor quality data, let's remove any missing important columns we might need later

In [None]:
@ed.has_dtypes(new_dtypes)
@ed.is_shape((None, 6))
@ed.none_missing()
def remove_poor_quality_data(sales):
    sales = sales.drop_duplicates()
    sales = sales.dropna(subset=['sale_amount', 'store_id', 
                                 'sale_number', 
                                 'city', 'associate'])
    return sales

In [None]:
sales = remove_poor_quality_data(sales)

In [None]:
sales.isnull().any()

In [None]:
final_types = new_dtypes.copy()
final_types.update({
    'store_total': float,
    'associate_total': float,
    'city_total': float
})

In [None]:
@ed.has_dtypes(final_types)
@ed.none_missing()
def calculate_store_sales(sales):
    sales['store_total'] = sales.groupby(
        'store_id').transform(sum)['sale_amount']
    sales['associate_total'] = sales.groupby(
        'associate').transform(sum)['sale_amount']
    sales['city_total'] = sales.groupby('city')[
        'sale_amount'].transform(sum)
    return sales

In [None]:
sales.head()

In [None]:
sales = calculate_store_sales(sales)

## Exercise: Can you fix the above error?

In [None]:
# %load ../solutions/engarde.py


In [None]:
sales = calculate_store_sales(sales)

In [None]:
sales

In [None]:
@ed.is_shape((None, 9))
def save_report(sales):
    sales.to_csv('../data/sales_summary.csv')
    return sales

In [None]:
sales = save_report(sales)

In [None]:
sales.dtypes