Now that we have optimized a model, we can evaluate out-of-sample estimates using cross validation.  With these samples, we can evaluate the total financial error made in allocating purchases to wrongful GL categories.  At this point, it's not clear what the real-life costs are - whether they are used to simply understand which categories are impacting P/L or other financial statements, or to use in strategic planning such as marketing campaigns, vendor negotiations, etc.

In [51]:
%load_ext autoreload
%autoreload 2
import sys, os
sys.path.insert(1, os.path.join(sys.path[0], '..'))

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [97]:
import pandas as pd
from autocat.models.svm import get_svm_model_v1
from autocat.data.features import CombinedFeatureAdder, PandasDataFrameTransformer, feature_transactions_per_day
from autocat.data.filters import no_null_StdUnitsShipped_StdNetAmount
from autocat.data.datasets import get_training_data, get_project_data
from autocat.models.evaluation import get_scorer

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score

## Model
Load the pre-trained, optimized model and make predictions on the test set

In [81]:
TRAINING_DATA = '../data/processed/train_2018-08-24.csv'
model, X, y = get_svm_model_v1()

In [82]:
predictions = model.predict(X)

In [89]:
results = pd.DataFrame.from_records(list(zip(y, predictions)), columns=['Actual', 'Predicted'])
results['Correct'] = results.Actual == results.Predicted
print(results.Correct.value_counts(normalize=False))
print(results.Correct.value_counts(normalize=True))

True     4149
False    1461
Name: Correct, dtype: int64
True     0.739572
False    0.260428
Name: Correct, dtype: float64


In [96]:
results.query('Correct == False').groupby(['Actual']).size().sort_values(ascending=True)

Actual
Meat                     27
Packaged Grocery        130
Body Care               196
Packaged Tea            210
Vitamins                285
Frozen                  290
Refrigerated Grocery    323
dtype: int64

## Financial Evaluation

Load the original data and make sure it matches our training data

In [98]:
project_data = get_project_data()
project_data.head()

Unnamed: 0,UniversalProductCode,AvgUnitsShipped,StdUnitsShipped,MinUnitsShipped,MaxUnitsShipped,AvgNetAmount,StdNetAmount,MinNetAmount,MaxNetAmount,NumberOfTransactions,NumberOfTransactionDays,GLCategory
0,69765869205,1.0,0.0,1.0,1.0,37.18375,2.495539,34.92,40.5,8,8,Packaged Grocery
1,2250613410,1.145454,0.573699,1.0,6.0,35.645381,8.054258,0.0,76.2,275,174,Packaged Grocery
2,85688520009,4.336294,4.418702,0.0,36.0,8.962798,9.049392,0.0,79.56,1576,264,Body Care
3,89477300104,1.343834,0.910368,0.0,20.0,19.427881,13.187472,0.0,231.4,3682,290,Packaged Grocery
4,25317775304,1.579902,1.617614,-3.0,26.0,72.828589,68.665828,-123.66,1071.72,1531,237,Meat


In [121]:
X.head()

Unnamed: 0_level_0,AvgUnitsShipped,StdUnitsShipped,MinUnitsShipped,MaxUnitsShipped,AvgNetAmount,StdNetAmount,MinNetAmount,MaxNetAmount,NumberOfTransactions,NumberOfTransactionDays
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
677,1.018181,0.19527,0.0,4.0,50.620929,10.015416,0.0,203.84,495,218
4971,0.993097,0.334028,0.0,8.0,18.98197,6.503373,0.0,156.16,3477,290
4762,2.088541,1.485281,1.0,13.0,9.058697,6.42484,3.63,53.76,576,236
3317,1.139293,0.451081,0.0,4.0,22.904033,8.955612,0.0,83.2,481,218
4725,0.877192,0.425532,0.0,2.0,65.888421,31.647992,0.0,139.56,57,46


In [124]:
X.iloc[0], y.iloc[0]

(AvgUnitsShipped              1.018181
 StdUnitsShipped              0.195270
 MinUnitsShipped              0.000000
 MaxUnitsShipped              4.000000
 AvgNetAmount                50.620929
 StdNetAmount                10.015416
 MinNetAmount                 0.000000
 MaxNetAmount               203.840000
 NumberOfTransactions       495.000000
 NumberOfTransactionDays    218.000000
 Name: 677, dtype: float64, 'Frozen')

In [119]:
project_data.loc[677]

UniversalProductCode       4227200373
AvgUnitsShipped               1.01818
StdUnitsShipped               0.19527
MinUnitsShipped                     0
MaxUnitsShipped                     4
AvgNetAmount                  50.6209
StdNetAmount                  10.0154
MinNetAmount                        0
MaxNetAmount                   203.84
NumberOfTransactions              495
NumberOfTransactionDays           218
GLCategory                     Frozen
Name: 677, dtype: object

Load the product stats that include average price and total volume