# Reanalysis

---

## Overview

Additional clarifying information was provided by the Behaviorally team.
While they had previously identified a set of products as the level of analysis; what they meant was the subset of UPC's associated with those products (I don't know what they are using 'UPC' for, but it's certainly not the common usage of ["Universal Product Code"](https://www.barcode-us.info/upc-codes/]).
Nisha mentioned that the UPCs in the data correspond to individual images of interest; which is truly the correct level of analysis.
They are interested _only_ in the UPCs that were actually run, which makes sense.
The problem is just that they didn't identify these initially; but these images-of-interest will be a further subset of upcs associated with the products in the set previous identified.

From the debrief in [analysis.ipynb](notebooks/analysis.ipynb):

> It turns out there are only 51 images of interest to Behaviorally.
> This changes the analysis substantially.
> Choosing to see this as something positive; narrowing things down this much allows us to really focus in on things.
>
> Starting with narrowing the predicted ONS values Tony had set up, this presents a much more focused approach to the rest of the data, and will make it much easier to drill down into specifics.


## Setup


### Working Directory

This just helps with using local imports from the larger project to the notebook.

In [1]:
cd ../

/Users/chrismessier/work/behaviorally


### Imports

In [2]:
import os

import numpy as np
import pandas as pd
import seaborn as sns
from tqdm.notebook import tqdm
from numpy import random as rng
import matplotlib as mpl
from matplotlib import pyplot as plt
from google.protobuf.struct_pb2 import Struct
from clarifai_grpc.channel.clarifai_channel import ClarifaiChannel
from clarifai_grpc.grpc.api import resources_pb2, service_pb2, service_pb2_grpc
from clarifai_grpc.grpc.api.status import status_pb2, status_code_pb2

import processors
import tools

#### Config

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
# for replicability
rng.seed(42)
# rng.seed(304)

In [5]:
%matplotlib inline

mpl.rcParams['figure.figsize'] = (12, 9)

sns.set(
    style='darkgrid'
)

In [6]:
from config import KEY_METRICS,\
    KEY_CONDITIONALS,\
    TO_NORMALIZE,\
    ONS_ANALYSIS_JOB_NUMBERS  # Job Numbers of interest for this\

In [7]:
N = 51  # sample size for images, keep LOW for dev

#### Functions

In [8]:
def plot_series(data, x, y, title=None, period_length=13):
    g = sns.lineplot(
        x=x,
        y=y,
        hue='Image ID', # hue='Brand Name', # hue="Image Name",  # hue="Product",
        data=data,
        legend=True
    )

    plt.title(title)
    plt.xlim((0,156))  # for the items not just in single periods
    plt.xlabel(x)
    plt.ylabel(y)
    plt.tight_layout()
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

    g.set_xticks(range(data['report_dates'].nunique())) # <--- set the ticks first
    g.set_xticklabels([t.strftime('%b %d, %Y') for t in data['report_dates'].unique()])

    for i, label in enumerate(g.xaxis.get_ticklabels()):
        if ((i+1) % period_length) == 0:
            label.set_visible(True)
        else:
            label.set_visible(False)

    plt.xticks(rotation=45)

    plt.show()


## Analysis

### Background

Now that I have all of the information that I need for this analysis, I can finally string things together in a sensible manner; cleaning up was I had initially done in the [project-forensics.ipynb](./project-forensics.ipynb) file.

In [9]:
CLEANED_DATA = "/Users/chrismessier/work/behaviorally/data/behaviorally_merged_sales_data.csv"

In [10]:
df = pd.read_csv(CLEANED_DATA, index_col=0)
print(f"df shape:", df.shape)
df = df.infer_objects()
df.dropna(inplace=True)
print(f"df shape:", df.shape)
df['REPORT'] = pd.Categorical(df['REPORT'])

df shape: (55848, 22)
df shape: (35906, 22)


In [11]:
df.sample(1)

Unnamed: 0,Image Name,Image ID,Raw ONS Line and Pack,job_number,upc,upc_10,UPC 10 digit,Product,Dollar Sales,Dollar Sales % Change vs YA,...,Unit Share of Category,Unit Share of Category Year Ago,Unit Share of SubCategory,Unit Share of SubCategory Year Ago,Price per Unit,Price per Unit % Change vs YA,Category Name,Sub-Category Name,Brand Name,REPORT
46,AB474_LINE15.jpg,LINE15,89.0,AB474,20735110225,2073511000.0,2073511000.0,TURKEY HILL FROZEN COOKIE AND CREAM ICE CREAM ...,260221.616285,0.00616,...,0.214314,0.166309,0.245011,0.189175,2.97168,-0.160924,ICE CREAM/SHERBET,ICE CREAM,TURKEY HILL,128.0


In [12]:
# NOTE moved to tools.infer_report_dates
# from itertools import count

# T = 156  # total number of reports

# START_DATE = '3/31/2019'  # just from looking at the Index page from their report spreadsheet.

# report_dates = {}
# t_0 = pd.to_datetime(START_DATE)  # just from looking at the Index page from their report spreadsheet.
# for i in count(start=1):
#     t_i = t_0 + pd.Timedelta(i, unit='w')  # period t_i
#     report_dates[float(i)] = t_i
#     if i-1 == T:
#         break

# df['report_dates'] = df['REPORT'].apply(lambda x: report_dates.get(x))
# df['report_dates'] = pd.to_datetime(df['report_dates'])

In [13]:
# 'REPORT' is the ndx column for the timeseries, or "report number"
df['report_dates'] = tools.infer_report_dates(df['REPORT'])

KeyError: 'REPORT'

Like I said in the debrief, changing things to this _greatly_ narrowed set of images makes for a lot more value in looking at things at the level of an individual product image; because we simultaneously have narrowed the products-of-interest down to the aforementioned 51 but we have also narrowed the scope of our analysis down to just one of the 5-ish images that we have for each product, so we can avoid the headaches of the widely-varying One Number Score (ONS) values between different images of the same product.
All of this makes looking at individual results so much easier.
I think something that could be useful is showing the lifecycle of products across periods.
Let me take a look at that now!

As we are still waiting on the update from Nisha on the specific images of interest, I'll simply sample an item at random and look at it across all of the "Periods".
This means looking at a random `'upc'` value, because this will be the base identifying unit for the data.

WRONG!!!

After going through this once, right now in the prepared data there is a many-to-one correspondence between `'Image Name'` and `'upc'`; but from what @nisha has been reporting, it should be a one-to-one correspondence.
So I'm just going to switch to looking at the `'Image Name'`. 


However, they're only interested in the `LINE` Images, so, let's narrow it down to that and get our random image again.

In [None]:
# I'm assuming that this is going to be exclusively 'LINE' or 'PACK'; therefore `line` can be treated as binary
for i, row in df.iterrows():
    image_id = row['Image ID']
    pack = True  
    if 'LINE' in image_id:
        pack = False
    elif 'PACK' in image_id:
        # the 'other' condition
        continue
    else:
        # this however, would be unexpected...
        print(image_id)
    

In [None]:
df = df[df['Image ID'].apply(lambda idx: 'LINE' in idx.upper())].copy()


Back to a random image.

Slight change, so I am taking another pass. I am going to look at a set of images, so that I can then just pass in the images of interest.

In [None]:
X = rng.choice(df['Image Name'].unique(), size=(N,))
print(f"Our random images are: {X}!")


In [None]:
df_ = df[df['Image Name'].isin(X)].copy()
print(f"{df_.shape}")


In [None]:
df_.reset_index(inplace=True, drop=True)

### Normalize


In [None]:
df_norm = df_[['Image Name'] + TO_NORMALIZE].copy()
df_norm = df_norm.groupby('Image Name').transform(lambda x: (x - x.mean()) / x.std())
df_norm.dropna(axis=1, inplace=True)
print(df_norm.shape)

In [None]:
# add in the normalized figures

norm_column_format = lambda c: '_'.join([s.lower() for s in c.split()] + ['norm'])

for i, row in df_norm.iterrows():
    for j, v in row.items():
        j_ = norm_column_format(j)
        # print(i, j_, v)
        df_.loc[i, j_] = v

In [None]:
df_.sample(2)

In [None]:
x_val = "REPORT"

for y in df_.columns:
    if '_norm' in y:
        plot_series(df_, x_val, y, y);

for y in KEY_METRICS:
    plot_series(df_, x_val, y, y);

In [None]:
df_['Brand Name'].value_counts()

Perfect!
This _does_ map to one observation from each of the 156 periods.




Was this image _not_ shown in period $t_{-1}$?
That would be one way to gauge impact.

Perahps this was just one of the items where there _wasn't_ a change to packaging or marketing material over the course of 3 years, but that seems odd; so I'm going to resample a few times to confirm. 

In [None]:
# x_ = rng.choice(df_['upc'].unique())
# print(f"Our random upc is: {x_}!")

# df_ = df_[df_['upc'] == x_].copy()
# print(f"{df_.shape}")


In [None]:
# df_.sample(1)

In [None]:
# assert df_['Product'].nunique() == 1, 'check'
# name = df_['Product'].values[0]

# print(name)

Given the smaller set, it might be worth the time to come up with more readible product names.
PArticularly if there aren't, say, more than one type of 'Gold Bond', or if they are, hopefully they can be differentiated by sizes, i.e. 'Gold Bond - 4oz', 'Gold Bond - 8oz', 'Gold Bond - 16oz' (I have no idea how accurate these are...). 

In [None]:
df_.head()

In [None]:
x = 'Raw ONS Line and Pack'
y = 'Dollar Sales'

df_[x].corr(df[y])

In [None]:
x, *_ = KEY_METRICS
periods = df_['REPORT'].values 

for y in KEY_METRICS:  # NOTE i was trying to just look at the $Y$-values, but the $x$-value as well.
    if y == x:
        print("skipping", x)
        continue
    
    plot_series(df_, 'REPORT', y)


_Very_ high degree of intercorrelation between these series; however, they do differ somewhat.
Look at the values between $t_{r}$ and  $t_{s}$ on the plot above. <ephemeral>

I think there's use in looking at this for each single product.
Demonstrate the peiodicity, seasonality, underlying trends, etc.
They may honestly just need someone to do that basic timeseries analysis; because it did also seem like they wanted _us_ to their basic regression analysis as well...
 

### Recreating the distributions

In [None]:
df_.shape

In [None]:
df.shape

In [None]:
X = rng.choice(df['Image Name'], size=(51,))

In [None]:
X

In [None]:
df_ = df[df['Image Name'].isin(X)].copy()

In [None]:
df_.sample(2)

In [None]:
d = {}

ONS = "Raw ONS Line and Pack"

for k in TO_NORMALIZE:
    R = df_[ONS].corr(df_[k])
    print(k, R)

In [None]:
df_.info()

In [None]:
df_[ONS].corr(df_['Dollar Sales'])