# Profiling the rules execution

We need extra dependencies in order to get profiling information from C.E. execution.
The dependencies are listed in `requirements-profiling.txt` file. 
They are the default python profiling tools, visualization packages, and some libraries for high-performance execution:

- psutil
- line_profiler
- memory_profiler
- pyinstrument
- pandas-profiling
- matplotlib
- graphviz
- snakeviz
- py-heat-magic


## Fetch data and repare C.E.

In [3]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
# Setup path to import the ce python modules
import os
import sys
from pathlib import PurePath

# add custom python modules root to the path variable,
root_path = PurePath(os.getcwd()).parents[0]
if str(root_path) not in set(sys.path):
    sys.path.insert(0, str(root_path))
# sys.path

### Setup and data loading

The DB credentials are loaded from `.env` file.
Other configuration and parameters necessary to run this notebook are defined in this session.


In [41]:
import pandas as pd
from typing import  Dict

import warnings
warnings.filterwarnings('ignore')

from ce.algo import data_service, opportunities
from ce.utils import format_output, get_cell_name
from ce.algo.rules import RULES_BY_ID

In [6]:
# Verify if the PostgreSQL DB credentials and URL are correct
print(f"db host: {os.getenv('POSTGRES_HOST')}")
print(f"db port: {os.getenv('POSTGRES_PORT')}")
print(f"db name: {os.getenv('POSTGRES_DB')}")
print(f"db user: {os.getenv('POSTGRES_USER')}")


db host: localhost
db port: None
db name: newron
db user: algo


## Load the data


In [27]:
### Configuration 

country='DE'
group="LADIESSHAVERS"
period=2654

# parameters = {"item_group_code": "PTV_FLAT", "country_code": "DE",  "period_seq": 2650},
# parameters = {"item_group_code": "HEADPHONES_MOB_HEADSETS", "country_code": "DE",  "period_seq": 2650},


In [30]:
# Get the data service object to load the data for the target cell
ds = data_service.get_data_service()
data = ds.get_data(group, country, period)
market_configuration = ds.get_market_configuration(group, country)

In [31]:
data.head()

Unnamed: 0,item_id,item_group_code,country_code,period_seq,my_rank,competitor_item_id,loc_distance_euclidean,distance_euclidean,distribution_overlap,brand,...,brand_competitor,price_competitor,loc_price_competitor,salesunits_competitor,loc_salesunits_competitor,wgt_distr_competitor,loc_wgt_distr_competitor,no_of_periods_in_focus,tpr_efficiency_own,loc_tpr_efficiency_own
0,2713050,LADIESSHAVERS,DE,2650,2.0,94712562.0,1.0,0.699802,0.789622,PHILIPS,...,PHILIPS,29.096423,1.0,277.784664,1.0,0.584716,1.0,5,,1.0
1,2713050,LADIESSHAVERS,DE,2650,1.0,159473303.0,1.0,0.544313,0.789622,PHILIPS,...,SILK'N,22.580141,1.0,35.5,1.0,0.087581,1.0,5,,1.0
2,2713050,LADIESSHAVERS,DE,2651,2.0,94712562.0,1.0,0.699802,0.789622,PHILIPS,...,PHILIPS,29.002616,1.0,211.571429,1.0,0.847673,1.0,5,,1.0
3,2713050,LADIESSHAVERS,DE,2651,1.0,159473303.0,1.0,0.544313,0.789622,PHILIPS,...,SILK'N,21.856154,1.0,13.0,1.0,0.01906,1.0,5,,1.0
4,2713050,LADIESSHAVERS,DE,2652,2.0,94712562.0,1.0,0.699802,0.789622,PHILIPS,...,PHILIPS,29.034604,1.0,268.428571,1.0,0.57783,1.0,5,,1.0


In [32]:
market_configuration

Unnamed: 0,country_code,item_group_code,low_price_percentage,high_price_percentage,medium_price_percentage,lower_price_range_threshold,upper_price_range_threshold
19,DE,LADIESSHAVERS,0.15,0.05,0.1,80.01,196.06


## Run specifique rule

In [36]:
# Specify which rule you whant to run.
# See all available rules in RULES_BY_ID
RULES_BY_ID.keys()

dict_keys([49, 53, 61, 82])

In [40]:
rule_module = RULES_BY_ID[61]
output_df = rule_module.evaluate(data, market_configuration, period)
output_df

Unnamed: 0,item_id,item_group_code,country_code,period_seq,my_rank,competitor_item_id,loc_distance_euclidean,distance_euclidean,brand,price_own,...,asp_ratio,price_gap_per,price_gap,slope_own,rule_61,loc_slope,loc_price_gap,loc,comparators,evidence_types
0,110218230,LADIESSHAVERS,DE,2654,3.0,103389959.0,1.0,0.235513,PHILIPS,68.9,...,0.844739,0.155261,1,-1.4,1,1.0,1.0,1.0,"{140571552.0, 141746306.0, 141514709.0, 103389...",[PRICE_AND_SALESUNITS_HISTORY]
1,110218230,LADIESSHAVERS,DE,2654,2.0,140571552.0,1.0,0.19679,PHILIPS,68.9,...,0.920168,0.079832,0,-1.4,1,1.0,1.0,1.0,"{140571552.0, 141746306.0, 141514709.0, 103389...",[PRICE_AND_SALESUNITS_HISTORY]
2,110218230,LADIESSHAVERS,DE,2654,5.0,141514709.0,1.0,0.393864,PHILIPS,68.9,...,0.676088,0.323912,0,-1.4,1,1.0,1.0,1.0,"{140571552.0, 141746306.0, 141514709.0, 103389...",[PRICE_AND_SALESUNITS_HISTORY]
3,110218230,LADIESSHAVERS,DE,2654,1.0,141746306.0,1.0,0.038825,PHILIPS,68.9,...,0.81139,0.18861,0,-1.4,1,1.0,1.0,1.0,"{140571552.0, 141746306.0, 141514709.0, 103389...",[PRICE_AND_SALESUNITS_HISTORY]
4,151604190,LADIESSHAVERS,DE,2654,2.0,154411263.0,1.0,0.474205,PHILIPS,398.76,...,0.973209,0.026791,1,-10.1,1,1.0,1.0,1.0,"{164851343.0, 154411263.0}",[PRICE_AND_SALESUNITS_HISTORY]
5,151604190,LADIESSHAVERS,DE,2654,1.0,164851343.0,1.0,0.464869,PHILIPS,398.76,...,1.100261,0.100261,0,-10.1,1,1.0,1.0,1.0,"{164851343.0, 154411263.0}",[PRICE_AND_SALESUNITS_HISTORY]
6,155115113,LADIESSHAVERS,DE,2654,5.0,141745975.0,1.0,0.470501,PANASONIC,112.09,...,1.069649,0.069649,1,-34.1,1,1.0,1.0,1.0,"{154643155.0, 141745975.0}",[PRICE_AND_SALESUNITS_HISTORY]
7,155115113,LADIESSHAVERS,DE,2654,4.0,154643155.0,1.0,0.458822,PANASONIC,112.09,...,0.544824,0.455176,0,-34.1,1,1.0,1.0,1.0,"{154643155.0, 141745975.0}",[PRICE_AND_SALESUNITS_HISTORY]


## Profile the rules execution

Let's evaluate the rules' execution performance.

- `%timeit`: check time of execution
- `%prun`: cProfile statement
- `%lprun`: yields the time spent on each line of code giving us a line by line report


> Note: You should keep in mind that profiling typically adds an overhead to your code.

In [58]:
%load_ext snakeviz

The snakeviz extension is already loaded. To reload it, use:
  %reload_ext snakeviz


#### Rule 61

In [56]:
%timeit -n 5 -r 2 RULES_BY_ID[61].evaluate(data, market_configuration, period)

195 ms ± 3.1 ms per loop (mean ± std. dev. of 2 runs, 5 loops each)


In [61]:
%prun -D rule_61.prof RULES_BY_ID[61].evaluate(data, market_configuration, period)

 
*** Profile stats marshalled to file 'rule_61.prof'. 


         534632 function calls (529541 primitive calls) in 0.422 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    95349    0.023    0.000    0.037    0.000 {built-in method builtins.isinstance}
    34306    0.010    0.000    0.014    0.000 generic.py:10(_check)
     1496    0.009    0.000    0.009    0.000 {method 'reduce' of 'numpy.ufunc' objects}
     1976    0.008    0.000    0.021    0.000 blocks.py:292(getitem_block)
      310    0.008    0.000    0.017    0.000 managers.py:238(_rebuild_blknos_and_blklocs)
  561/229    0.008    0.000    0.038    0.000 base.py:293(__new__)
    52363    0.007    0.000    0.007    0.000 {built-in method builtins.getattr}
5105/4796    0.006    0.000    0.008    0.000 {built-in method numpy.array}
1390/1262    0.006    0.000    0.082    0.000 series.py:201(__init__)
        2    0.006    0.003    0.131    0.066 generic.py:1535(filter)
     2654    0.006    0.000    0.012    0.000 generic

In [62]:
!snakeviz rule_61.prof

snakeviz web server started on 127.0.0.1:8080; enter Ctrl-C to exit
http://127.0.0.1:8080/snakeviz/%2FUsers%2Fjean.metz%2Fworkspace%2FGFK%2Fconsulting-engine%2Fprofiling%2Frule_61.prof
^C

Bye!


## Profile all rules together

In [63]:
%timeit -n 5 -r 2 opportunities.process_rules(data, market_configuration, period)

321 ms ± 5.04 ms per loop (mean ± std. dev. of 2 runs, 5 loops each)


In [66]:
# cProfile: break down the execution function by function.
# you can use `-s cumulative` to sort the output w.r.t. to cummulative time
%prun -D all_rules.prof opportunities.process_rules(data, market_configuration, period)

 
*** Profile stats marshalled to file 'all_rules.prof'. 


         848196 function calls (840304 primitive calls) in 0.751 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   155776    0.042    0.000    0.068    0.000 {built-in method builtins.isinstance}
    56901    0.018    0.000    0.026    0.000 generic.py:10(_check)
  941/419    0.015    0.000    0.076    0.000 base.py:293(__new__)
     2507    0.014    0.000    0.014    0.000 {method 'reduce' of 'numpy.ufunc' objects}
    88171    0.013    0.000    0.013    0.000 {built-in method builtins.getattr}
    11359    0.011    0.000    0.028    0.000 common.py:1460(is_extension_array_dtype)
7718/7280    0.011    0.000    0.013    0.000 {built-in method numpy.array}
      412    0.010    0.000    0.023    0.000 managers.py:238(_rebuild_blknos_and_blklocs)
    11402    0.009    0.000    0.014    0.000 base.py:413(find)
2306/1860    0.009    0.000    0.025    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}

In [67]:
!snakeviz all_rules.prof

snakeviz web server started on 127.0.0.1:8080; enter Ctrl-C to exit
http://127.0.0.1:8080/snakeviz/%2FUsers%2Fjean.metz%2Fworkspace%2FGFK%2Fconsulting-engine%2Fprofiling%2Fall_rules.prof
^C

Bye!


#### Use line profiler

To use `line_profiler`, normally you'd need to modify your code and decorate the functions you want to profile with `@profile`.

However, when profiling using the extension directly in Jupyter this is not necessary. Instead you simply load the extension by adding `%load_ext line_profiler` a cell.


In [68]:
%load_ext line_profiler

In [76]:
%lprun?

[0;31mDocstring:[0m
Execute a statement under the line-by-line profiler from the
line_profiler module.

Usage:
  %lprun -f func1 -f func2 <statement>

The given statement (which doesn't require quote marks) is run via the
LineProfiler. Profiling is enabled for the functions specified by the -f
options. The statistics will be shown side-by-side with the code through the
pager once the statement has completed.

Options:

-f <function>: LineProfiler only profiles functions and methods it is told
to profile.  This option tells the profiler about these functions. Multiple
-f options may be used. The argument may be any expression that gives
a Python function or method object. However, one must be careful to avoid
spaces that may confuse the option parser.

-m <module>: Get all the functions/methods in a module

One or more -f or -m options are required to get any useful results.

-D <filename>: dump the raw statistics out to a pickle file on disk. The
usual extension for this is ".lprof".

In [71]:
# the -f argument specifies the function you'd like to profile
%lprun -f opportunities.process_rules opportunities.process_rules(data, market_configuration, period)

Timer unit: 1e-06 s

Total time: 0.954264 s
File: /Users/jean.metz/workspace/GFK/consulting-engine/ce/algo/opportunities.py
Function: process_rules at line 37

Line #      Hits         Time  Per Hit   % Time  Line Contents
    37                                           def process_rules(
    38                                               data: pd.DataFrame, market_config: pd.DataFrame, period_seq: int, rule_id: Optional[int] = None
    39                                           ) -> pd.DataFrame:
    40                                               """Processes each rule in turn, returning a DataFrame with the results."""
    41                                           
    42         1          7.0      7.0      0.0      rules = [RULES_BY_ID[rule_id]] if rule_id is not None else list(RULES_BY_ID.values())
    43                                           
    44         1          1.0      1.0      0.0      opportunities = []
    45                                           
   

In [75]:
# let's dive into the slowest rule function: rule_61
from ce.algo.rules import rule_61
%lprun -f rule_61.evaluate opportunities.process_rules(data, market_configuration, period)

Timer unit: 1e-06 s

Total time: 0.531388 s
File: /Users/jean.metz/workspace/GFK/consulting-engine/ce/algo/rules/rule_61.py
Function: evaluate at line 13

Line #      Hits         Time  Per Hit   % Time  Line Contents
    13                                           def evaluate(data: pd.DataFrame, market_config: pd.DataFrame, period_seq: int) -> pd.DataFrame:
    14         1         18.0     18.0      0.0      if data.empty:
    15                                                   return pd.DataFrame()
    16                                           
    17                                               # Ensure that distribution overlap is more than 50%
    18         1      24809.0  24809.0      4.7      data = filter_distribution_overlap(data=data, period_seq=period_seq, threshold=0.5)
    19                                           
    20         1        883.0    883.0      0.2      filter_same_brand = data["brand"] != data["brand_competitor"]
    21         1        422.0    

### Filter by brand

In [80]:
brand_distribution = data['brand'].value_counts()
brand_distribution.head(20)

BRAUN              1166
PHILIPS             771
UNBRANDED           394
REMINGTON           280
BEURER              181
SMOOTHSKIN          110
PANASONIC           110
SILK'N              105
FINISHING TOUCH      80
BABYLISS             65
KEMEI                47
AILORIA              44
VEET                 42
MECO                 37
ROWENTA              36
VITALMAXX            36
GENIUS               36
CREMAX               35
SCHIELE              27
DAGA                 27
Name: brand, dtype: int64

In [85]:
# Let's use  3 brands with as target brands try out the rules execution with different market sizes.
brands = brand_distribution.index
n = len(brands)
brand_1 = brands[0]
brand_2 = brands[3]
brand_3 = brands[n-1]

In [86]:
# cell.keys()

In [87]:
# period_seq = cell['period_seq']
# config = cell['config']

cell_b1 = data[data["brand"] == brand_1]
cell_b2 = data[data["brand"]  == brand_2]
cell_b3 = data[data["brand"] == brand_3]

print(f"{brand_1} shape: {cell_b1.shape}")
print(f"{brand_2} shape: {cell_b2.shape}")
print(f"{brand_3} shape: {cell_b3.shape}")

BRAUN shape: (1166, 27)
REMINGTON shape: (280, 27)
MARSKE shape: (1, 27)


In [88]:
%timeit -n 5 -r 2 opportunities.process_rules(cell_b1, market_configuration, period)

223 ms ± 8.95 ms per loop (mean ± std. dev. of 2 runs, 5 loops each)


In [89]:
%timeit -n 5 -r 2 opportunities.process_rules(cell_b2, market_configuration, period)

145 ms ± 3.42 ms per loop (mean ± std. dev. of 2 runs, 5 loops each)


In [90]:
%timeit -n 5 -r 2 opportunities.process_rules(cell_b3, market_configuration, period)

49.9 ms ± 2.93 ms per loop (mean ± std. dev. of 2 runs, 5 loops each)


In [91]:
%prun -D brand_1.prof opportunities.process_rules(cell_b1, market_configuration, period)

 
*** Profile stats marshalled to file 'brand_1.prof'. 


         454070 function calls (450001 primitive calls) in 0.349 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    86686    0.021    0.000    0.035    0.000 {built-in method builtins.isinstance}
    32171    0.010    0.000    0.014    0.000 generic.py:10(_check)
  540/269    0.008    0.000    0.041    0.000 base.py:293(__new__)
    50365    0.007    0.000    0.007    0.000 {built-in method builtins.getattr}
     1498    0.006    0.000    0.006    0.000 {method 'reduce' of 'numpy.ufunc' objects}
      718    0.006    0.000    0.046    0.000 algorithms.py:1616(take_nd)
     7180    0.006    0.000    0.016    0.000 common.py:1460(is_extension_array_dtype)
     2855    0.006    0.000    0.015    0.000 _dtype.py:321(_name_get)
     7211    0.006    0.000    0.009    0.000 base.py:413(find)
4081/3875    0.005    0.000    0.006    0.000 {built-in method numpy.array}
18422/15382    0.005    0.000    0.006    0.000 {built-in metho

In [92]:
!snakeviz brand_1.prof

snakeviz web server started on 127.0.0.1:8080; enter Ctrl-C to exit
http://127.0.0.1:8080/snakeviz/%2FUsers%2Fjean.metz%2Fworkspace%2FGFK%2Fconsulting-engine%2Fprofiling%2Fbrand_1.prof
^C

Bye!
