# Marketing Images

---

## Overview

Narrowing scope to those images of interest provided by @nisha on the Behaviorally.


## Setup


### Working Directory

This just helps with using local imports from the larger project to the notebook.

In [1]:
cd ../

/Users/chrismessier/work/behaviorally


### Imports

In [2]:
import os

import numpy as np
import pandas as pd
import seaborn as sns
from tqdm.notebook import tqdm
from numpy import random as rng
import matplotlib as mpl
from matplotlib import pyplot as plt
from google.protobuf.struct_pb2 import Struct
from clarifai_grpc.channel.clarifai_channel import ClarifaiChannel
from clarifai_grpc.grpc.api import resources_pb2, service_pb2, service_pb2_grpc
from clarifai_grpc.grpc.api.status import status_pb2, status_code_pb2

import processors
import tools

#### Config

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
# for replicability
rng.seed(42)
# rng.seed(304)

In [5]:
%matplotlib inline

mpl.rcParams['figure.figsize'] = (12, 9)

sns.set(
    style='darkgrid'
)

In [6]:
from config import KEY_METRICS,\
    KEY_CONDITIONALS,\
    TO_NORMALIZE,\
    ONS_ANALYSIS_JOB_NUMBERS,\
    IMAGES_OF_INTEREST

In [7]:
N = 51  # sample size for images, keep LOW for dev

#### Functions

In [8]:
def plot_series(data, x, y, title=None, period_length=13):
    g = sns.lineplot(
        x=x,
        y=y,
        hue='Image ID', # hue='Brand Name', # hue="Image Name",  # hue="Product",
        data=data,
        legend=True
    )

    plt.title(title)
    plt.xlim((0,156))  # for the items not just in single periods
    plt.xlabel(x)
    plt.ylabel(y)
    plt.tight_layout()
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

    g.set_xticks(range(data['report_dates'].nunique())) # <--- set the ticks first
    g.set_xticklabels([t.strftime('%b %d, %Y') for t in data['report_dates'].unique()])

    for i, label in enumerate(g.xaxis.get_ticklabels()):
        if ((i+1) % period_length) == 0:
            label.set_visible(True)
        else:
            label.set_visible(False)

    plt.xticks(rotation=45)

    plt.show()


## Analysis

We were provided some instructions by @nisha:

>In the file below, we have now included one more column where we have added the names of the images linked to the UPC codes. This is column G, existing images.
>
>[Copy of IRI Data Product List_3.21.22.xlsx](https://nam12.safelinks.protection.outlook.com/ap/x-59584e83/?url=https%3A%2F%2Fbehav-my.sharepoint.com%2F%3Ax%3A%2Fg%2Fpersonal%2Fskye_guggino_behaviorally_com%2FETyxthcQfnNCjDWRYGpOyAUBTTWqgqR0PhscL_OcLCtz3w%3Fe%3DYxFLgH&data=05%7C01%7CNisha.Yadav%40behaviorally.com%7C26bd690bb01c4c12456508da28923aff%7C01c55d0a027b47dfa914b27f0803c8b4%7C0%7C0%7C637866905773423479%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=guPGh3h2T6WhrVY1a3JRqhhs%2FRU8cgquntzsShArcaE%3D&reserved=0)
>
>In column H, Scraped Online images, you will find names of image files scraped from the internet. Most of the images in this column match or are very similar to the images in column H. there are a few that are a bit different, 13 in all, and are highlighted in red.
>
>For the analysis you shared on Tuesday, please do another run where you filter by ONS scores for the images listed in column H. then, please apply another filter and take out the 13 red ones from column G.
>
>Now, as we were discussing on Tuesday, we not only need a correlation analysis, but also need a linear regression, to understand the relationship between the ONS and sales metrics from IRI. Once you have the clean file, would you run this as well?
>
>Thanks, ny


>In the file below, we have now included one more column where we have added the names of the images linked to the UPC codes. This is column G, existing images.



In [9]:
images_of_interest = IMAGES_OF_INTEREST

In [10]:
len(images_of_interest)

51

In [11]:
data = "/Users/chrismessier/work/behaviorally/data/behaviorally_merged_sales_data.csv"

In [12]:
df = pd.read_csv(data, index_col=0)

In [13]:
df.sample(1)

Unnamed: 0,Image Name,Image ID,Raw ONS Line and Pack,job_number,upc,upc_10,UPC 10 digit,Product,Dollar Sales,Dollar Sales % Change vs YA,...,Unit Share of Category,Unit Share of Category Year Ago,Unit Share of SubCategory,Unit Share of SubCategory Year Ago,Price per Unit,Price per Unit % Change vs YA,Category Name,Sub-Category Name,Brand Name,REPORT
154,AD112_PACK1B.jpg,PACK1B,15.0,AD112,3024480008797,244800086.0,244800086.0,REMY MARTIN 1738 ACCORD ROYAL REGULAR LIQUEUR ...,,,...,,,,,,,SPIRITS/LIQUOR,SPIRITS,REMY MARTIN 1738 ACCORD ROYAL,151.0


In [14]:
df = df[df['Image Name'].isin(images_of_interest)].copy()

In [15]:
df = df.infer_objects()

In [16]:
print(df.shape)

(7644, 22)


In [None]:
df

In [17]:
df = df.dropna(subset=['REPORT']).copy()  # could this be dropping images?

In [18]:
print(df.shape)

(6240, 22)


Seems as if there's about 1400 observations that were then dropped because there was no report date?

In [19]:
df['Report'] = df['REPORT'].astype(int)

In [20]:
df.sample()

Unnamed: 0,Image Name,Image ID,Raw ONS Line and Pack,job_number,upc,upc_10,UPC 10 digit,Product,Dollar Sales,Dollar Sales % Change vs YA,...,Unit Share of Category Year Ago,Unit Share of SubCategory,Unit Share of SubCategory Year Ago,Price per Unit,Price per Unit % Change vs YA,Category Name,Sub-Category Name,Brand Name,REPORT,Report
249,AD513_LINE1.jpg,LINE1,30.0,AD513,35000985026,3500099000.0,3500099000.0,COLGATE RENEWAL COOL MINT ANTICAVITY ANTIGINGI...,536.369951,,...,,0.000435,,7.99,,TOOTHPASTE,TOOTHPASTE,COLGATE RENEWAL,94.0,94


In [21]:
breakpoint

<function breakpoint>

What 3 are missing?

In [22]:
missing_images = [f for f in images_of_interest if f not in df['Image Name'].values.tolist()]

In [23]:
missing_images

['AC296_LINE1.jpg',
 'AD387_PACK1.jpg',
 'AD692_LINE1.png',
 'L1501_LINE1.jpg',
 'AD296_PACK1.jpg',
 'AD518_LINE1.jpg',
 'AD672_LINE1.jpg',
 'L2115_LINE1.jpg',
 'AD615_LINE1.jpg',
 'AD507_LINE1A.jpg',
 'AD697_LINE1.jpg',
 'AC638_PACK1.jpg']

In [24]:
=

SyntaxError: invalid syntax (1763773627.py, line 1)

These images are missing from our predicitons.
The file `'L2115_LINE1.jpg'` is known to be missing, and examples have been provided by their team, however the other two were just identified now. 

In [None]:
df.shape

In [None]:
df.sample(2)

In [None]:
df_norm = df[['Image Name'] + TO_NORMALIZE].copy()
df_norm = df_norm.groupby('Image Name').transform(lambda x: (x - x.mean()) / x.std())

# df_norm.dropna(axis=1, inplace=True)
print(df_norm.shape)

In [None]:
df_norm.head()

In [None]:
columns_to_copy = ['Raw ONS Line and Pack', 'REPORT', 'Product']
df_norm[columns_to_copy] = df.loc[:, columns_to_copy].copy()

In [None]:
df_norm.head()

In [None]:
from collections import defaultdict

correlations = defaultdict(list)
for report in df['REPORT'].unique():
    df_ = df[df['REPORT'] == report].copy()
    for x in TO_NORMALIZE:

        R = df_['Raw ONS Line and Pack'].corr(df_[x])
        correlations[x].append(R)


In [None]:
for k, v in correlations.items():
    # print(k, len(v), v)
    g = sns.displot(v, kind='kde')
    m_ = max(g.ax.get_xbound())
    g.ax.set_xbound((-1 * m_, m_))
    plt.title(k)

In [None]:
plt.show()

In [None]:
x = 'Raw ONS Line and Pack'

for y in TO_NORMALIZE:
    title = f"{y} Over Time"
    plt.title(title)
    sns.lineplot(
        data=df_norm,
        x='REPORT',
        y=y,
        # hue='Product',
        # style='Category Name',
        legend=True
    )
    # sns.displot(x=x, data=correlations)
    plt.show()


Obviously now I want to show the effect of normalization.

In [None]:
df.sample()