# Analysis

---

## Overview

While the work that was done in the [correlations.ipynb](notebooks/correlations.ipynb) was good, it's important to summarize an present it in a clean manner to the customers.
In addition there were some other results that @bryan and I wanted to see.
This notebook will provide a clean, succint, means of doing this subsequent analysis and presentation.


## Setup


### Working Directory

This just helps with using local imports from the larger project to the notebook.

In [1]:
cd ../

/Users/chrismessier/work/behaviorally


### Imports

In [2]:
import os

import numpy as np
import pandas as pd
import seaborn as sns
from tqdm.notebook import tqdm
from numpy import random as rng
from matplotlib import pyplot as plt
from google.protobuf.struct_pb2 import Struct
from clarifai_grpc.channel.clarifai_channel import ClarifaiChannel
from clarifai_grpc.grpc.api import resources_pb2, service_pb2, service_pb2_grpc
from clarifai_grpc.grpc.api.status import status_pb2, status_code_pb2

import processors
import tools

#### Plotting Config

In [3]:
%matplotlib inline
sns.set(
    style='darkgrid'
)

In [4]:
from config import ONS_ANALYSIS_JOB_NUMBERS  # Job Numbers of interest for this

## Analysis

### Background

Now that I have all of the information that I need for this analysis, I can finally string things together in a sensible manner; cleaning up was I had initially done in the [project-forensics.ipynb](./project-forensics.ipynb) file.

In [5]:
CLEANED_DATA = "/Users/chrismessier/work/behaviorally/data/behaviorally_merged_sales_data.csv"

In [6]:
df = pd.read_csv(CLEANED_DATA, index_col=0)

In [7]:
df = df.infer_objects()

In [8]:
for c in df.columns:
    print(c)

Image Name
Image ID
Raw ONS Line and Pack
job_number
upc
upc_10
UPC 10 digit
Product
Dollar Sales
Dollar Sales % Change vs YA
Unit Sales
Unit Sales % Change vs YA
Unit Share of Category
Unit Share of Category Year Ago
Unit Share of SubCategory
Unit Share of SubCategory Year Ago
Price per Unit
Price per Unit % Change vs YA
Category Name
Sub-Category Name
Brand Name
REPORT


In [9]:
df.head(2)

Unnamed: 0,Image Name,Image ID,Raw ONS Line and Pack,job_number,upc,upc_10,UPC 10 digit,Product,Dollar Sales,Dollar Sales % Change vs YA,...,Unit Share of Category,Unit Share of Category Year Ago,Unit Share of SubCategory,Unit Share of SubCategory Year Ago,Price per Unit,Price per Unit % Change vs YA,Category Name,Sub-Category Name,Brand Name,REPORT
0,AB031_LINE1.jpg,LINE1,42.0,AB031,12000012754,1200001000.0,1200001000.0,LIPTON CITRUS GREEN TEA LIQUID PREPARED TEA PL...,754888.359249,-0.015202,...,0.835315,0.906572,1.230062,1.303955,1.603665,0.101032,TEA/COFFEE - READY-TO-DRINK,CANNED AND BOTTLED TEA,LIPTON,1.0
1,AB031_LINE2.jpg,LINE2,48.0,AB031,12000012754,1200001000.0,1200001000.0,LIPTON CITRUS GREEN TEA LIQUID PREPARED TEA PL...,754888.359249,-0.015202,...,0.835315,0.906572,1.230062,1.303955,1.603665,0.101032,TEA/COFFEE - READY-TO-DRINK,CANNED AND BOTTLED TEA,LIPTON,1.0


ONe key thing that was noted in my review with @bryan was that there were issues with some of the  variables of interest provided by Behaviorally.
There are a couple where correlations just don't make causal sense: the "price-per-unit" columns.
Decisions made that impact the price per unit will impact the packaging/aesthetics; it's not that the package suddenly changes and the price per unit then changes to reflect this difference.
Also, there may be issues with colinearity between predictors: for instance, Dollar Sales is just a linear combination of unit sales, and unit price.
Aberrent values should be dropped post haste, but these offending values should be dropped with notes as to why.

One more thing: note those columns you want to see the conditional distributions of with the data!


In [10]:
for c in df.columns:
    print(f"'{c}',")

'Image Name',
'Image ID',
'Raw ONS Line and Pack',
'job_number',
'upc',
'upc_10',
'UPC 10 digit',
'Product',
'Dollar Sales',
'Dollar Sales % Change vs YA',
'Unit Sales',
'Unit Sales % Change vs YA',
'Unit Share of Category',
'Unit Share of Category Year Ago',
'Unit Share of SubCategory',
'Unit Share of SubCategory Year Ago',
'Price per Unit',
'Price per Unit % Change vs YA',
'Category Name',
'Sub-Category Name',
'Brand Name',
'REPORT',


In [11]:
COLUMNS = [
    'Image Name',  # [CONDITION] can be used as as
    'Image ID', # [CONDITION] this is the line/pack info. 
    'Raw ONS Line and Pack',  # this is our "X" value. ## NOTE fix the xlim to be 0-100!
    # 'job_number',  # dropped in favor of 'Product'
    # 'upc',  # should be repeat of 'job_number'
    # 'upc_10',  # should be repeat of 'job_number'
    # 'UPC 10 digit',  # should be repeat of 'job_number'
    'Product',  # [CONDITION] keeping this for viz labeling purposes
    'Dollar Sales',  # using this as the key analytical category, given the price effects
    'Dollar Sales % Change vs YA',  # removing the 'temporal' categories for now to simplify the analysis
    'Unit Sales',
    'Unit Sales % Change vs YA',
    'Unit Share of Category',
    'Unit Share of Category Year Ago',
    'Unit Share of SubCategory',
    'Unit Share of SubCategory Year Ago',
    'Price per Unit',  # correlation simpy doesn't make sense here
    'Price per Unit % Change vs YA',  # correlation simply doesn't make sense here.
    'Category Name',  # [CONDITION]
    'Sub-Category Name',  # [CONDITION]
    # 'Brand Name',  # [CONDITION]
    'REPORT',  # [CONDITION]
]

In [12]:
COLUMNS

['Image Name',
 'Image ID',
 'Raw ONS Line and Pack',
 'Product',
 'Dollar Sales',
 'Unit Sales',
 'Unit Share of Category',
 'Unit Share of SubCategory',
 'Category Name',
 'Sub-Category Name',
 'REPORT']

I'm simply going to drop all of the columns that are of no interest for analysis.

In [13]:
df.drop(columns=[c for c in df.columns if c not in COLUMNS], inplace=True)

In [14]:
df.head()

Unnamed: 0,Image Name,Image ID,Raw ONS Line and Pack,Product,Dollar Sales,Unit Sales,Unit Share of Category,Unit Share of SubCategory,Category Name,Sub-Category Name,REPORT
0,AB031_LINE1.jpg,LINE1,42.0,LIPTON CITRUS GREEN TEA LIQUID PREPARED TEA PL...,754888.359249,470727.082124,0.835315,1.230062,TEA/COFFEE - READY-TO-DRINK,CANNED AND BOTTLED TEA,1.0
1,AB031_LINE2.jpg,LINE2,48.0,LIPTON CITRUS GREEN TEA LIQUID PREPARED TEA PL...,754888.359249,470727.082124,0.835315,1.230062,TEA/COFFEE - READY-TO-DRINK,CANNED AND BOTTLED TEA,1.0
2,AB031_LINE3.jpg,LINE3,42.0,LIPTON CITRUS GREEN TEA LIQUID PREPARED TEA PL...,754888.359249,470727.082124,0.835315,1.230062,TEA/COFFEE - READY-TO-DRINK,CANNED AND BOTTLED TEA,1.0
3,AB031_LINE4.jpg,LINE4,40.0,LIPTON CITRUS GREEN TEA LIQUID PREPARED TEA PL...,754888.359249,470727.082124,0.835315,1.230062,TEA/COFFEE - READY-TO-DRINK,CANNED AND BOTTLED TEA,1.0
4,AB111_LINE1.jpg,LINE1,45.0,SOUR PATCH KIDS ASSORTED SOUR CHEWY CANDY PIEC...,,,,,NON-CHOCOLATE CANDY,NON CHOCOLATE CHEWY CANDY,1.0


In [15]:
conditions = [
    'Image Name',
    'Image ID',
    'Product',  
    'Category Name',
    'Sub-Category Name',
    'REPORT',
]
metrics = [
    # 'Product',  # for normalization
    'Raw ONS Line and Pack',  # x, all else y's
    'Dollar Sales',
    'Unit Sales',
    'Unit Share of Category',
    'Unit Share of SubCategory'
]

The amount of missing values might present a problem.
However for this, I'm simply going to drop the missing values.

In [16]:
df.dropna(inplace=True)

In [17]:
df.shape

(39808, 11)

In [18]:
df_conditions = df[conditions].copy()
df_metrics = df[metrics].copy()

I'm going to just grab the plotting from [correlations.ipynb](notebooks/correlations.ipynb)

I can't get the normalization to work, and there's no time to reapproach, just tackling graphs now.

In [19]:
# h/t https://stackoverflow.com/a/35828995
# grouping = 'Product'
# filt_df = df_y.loc[:, df_y.columns != grouping]
# low = .05
# high = .95
# quant_df = filt_df.quantile([low, high])
# print(quant_df)
# out_df = filt_df.apply(lambda x: x[(x > quant_df.loc[low, x.name]) & (x < quant_df.loc[high, x.name])], axis=0)
# pu = pd.concat([df.loc[:, grouping], filt_df], axis=1)


In [20]:
from scipy import stats
OVERWRITE = True
OUTPUT_DIR = "/Users/chrismessier/work/behaviorally/outputs/analysis"

x, *Y = df_metrics.columns
df_ = df.copy()
# df_y = df_[Y].copy()  # HACK indexing here
df_ = df_[(np.abs(stats.zscore(df_[Y])) < 3).all(axis=1)]

# df_ = pd.concat([df.loc[:, x], df_y], axis=1)

In [21]:
frmt = lambda s: '-'.join(s.upper().split()).replace('%', 'PCT')

In [25]:
for y in Y:
    print(f"{x} VS. {y}")

    R = df_[x].corr(df_[y])

    title = f"{x} vs. {y}\n$R={R:.3f}$"

    image_name = f"{frmt(x)}_v_{frmt(y)}.png"

    save_dir = os.path.join(OUTPUT_DIR)

    os.makedirs(save_dir, exist_ok=True)

    output_path = os.path.join(save_dir, image_name)

    if not os.path.exists(output_path) or OVERWRITE:

        plt.title(title)
        plt.scatter(df_[x], df_[y])
        plt.xlabel(x)
        plt.xlim((0, 100))  # once normalized this makes no sense
        plt.ylabel(c)
        plt.tight_layout()

        plt.savefig(output_path)

        plt.clf()  # 


Raw ONS Line and Pack VS. Dollar Sales
Raw ONS Line and Pack VS. Unit Sales
Raw ONS Line and Pack VS. Unit Share of Category
Raw ONS Line and Pack VS. Unit Share of SubCategory


<Figure size 432x288 with 0 Axes>

In [23]:
def quick_plot(a, b, title=None):
    
    plt.scatter(a, b)
    if title: plt.title(title)
    

Looking at the data, it's pretty obvious that I need to be doing some more "intelligent" outlier filtering, but really, I think this might be a moot point once this is normalized.

In [None]:
df_y = df_y.drop(columns='Product')

In [None]:
df_.head()

In [None]:
df_

4/26 Debrief

It turns out there are only 51 images of interest to Behaviorally.
This changes the analysis substantially.
Choosing to see this as something positive; narrowing things down this much allows us to really focus in on things.

Starting with narrowing the predicted ONS values Tony had set up, this presents a much more focused approach to the rest of the data, and will make it much easier to drill down into specifics.
