# Project Forensics

---

## Setup


### Working Directory

This just helps with using local imports from the larger project to the notebook.

In [1]:
cd ../

/Users/chrismessier/work/behaviorally


### Imports

In [2]:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

In [3]:
import os

from google.protobuf.struct_pb2 import Struct
from clarifai_grpc.channel.clarifai_channel import ClarifaiChannel
from clarifai_grpc.grpc.api import resources_pb2, service_pb2, service_pb2_grpc
from clarifai_grpc.grpc.api.status import status_pb2, status_code_pb2

#### Plotting Config

In [4]:
%matplotlib inline
sns.set(
    style='darkgrid'
)

### Methods

In [5]:
def load_spreadsheet(f):
    pass

### Client Initialization

In [6]:
channel = ClarifaiChannel.get_json_channel()
stub = service_pb2_grpc.V2Stub(channel)

In [7]:
API_KEY = None

In [8]:
metadata = (('authorization', f'Key {API_KEY}'),)  # key value referenced from config module

## Analysis

### Background

### Spreadsheets

In [9]:
doc_4 = '/Users/chrismessier/work/behaviorally/data/ONS_output_Tony_Round 1.xlsx'

## Missing IRI Data

Part of what was revealed on the 4/22 call with the client is that there were gaps in the coverage for what we _should_ be seeing.
What we will take to be the ground-truth set of concepts-of-interest to be the set containing all of the items on the white rows of the '50 products' sheet of the [IRI Data Product List_3.21.22.xlsx spreadsheet](https://docs.google.com/spreadsheets/d/1HCsjUscUtgbhRXdIfxcJCENzz2ZhwicnJcwcEoZuoYw/edit?usp=sharing).

In [10]:
# manual copy/paste from the spreadsheet.
iri_job_numbers = [
    "AD905.00",
    "AD112.00",
    "AD456.00",
    "L2362.00",
    "AC296.00",
    "AB031.00",
    "AB220.00",
    "L2310.00",
    "AD386.00",
    "AC808.00",
    "L1874.00",
    "AB653.00",
    "AC904.00",
    "AD411.00",
    "AD387.00",
    "AD474.00",
    "AD719.00",
    "AB111.00",
    "AD517.00",
    "L2403.00",
    "AB219.00",
    "AB474.00",
    "AD230.00",
    "AD457.00",
    "AC624.00",
    "AD087.00c",
    "AD692.00",
    "L1501.00",
    "AD296.00",
    "AD445.00",
    "AB185.00",
    "L2331.00",
    "AD518.00",
    "AD216.00",
    "AD275.00",
    "AD145.00",
    # "AB759.00",  # accidentally added due to highlighting mix-up
    "AD513.00",
    "AD324.00",
    "AC870.00",
    "AC856.00",
    "AD672.00",
    "AB249.00",
    "L2115.00",
    "AD083.00",
    "AD615.00",
    "AD507.00",
    "AD644.00",
    "AD697.00",
    "AD437.00",
    "AC638.00",
    "AD549.00",
]

In [11]:
clean_iri_job_numbers = [s.split('.')[0] for s in iri_job_numbers]

In [12]:
clean_iri_job_numbers

['AD905',
 'AD112',
 'AD456',
 'L2362',
 'AC296',
 'AB031',
 'AB220',
 'L2310',
 'AD386',
 'AC808',
 'L1874',
 'AB653',
 'AC904',
 'AD411',
 'AD387',
 'AD474',
 'AD719',
 'AB111',
 'AD517',
 'L2403',
 'AB219',
 'AB474',
 'AD230',
 'AD457',
 'AC624',
 'AD087',
 'AD692',
 'L1501',
 'AD296',
 'AD445',
 'AB185',
 'L2331',
 'AD518',
 'AD216',
 'AD275',
 'AD145',
 'AD513',
 'AD324',
 'AC870',
 'AC856',
 'AD672',
 'AB249',
 'L2115',
 'AD083',
 'AD615',
 'AD507',
 'AD644',
 'AD697',
 'AD437',
 'AC638',
 'AD549']

In [13]:
# bringing this back in, for convenience.
target_sheet = 'ONS_output_Tony'
ons_results = pd.read_excel(doc_4, sheet_name=target_sheet)

In [14]:
clean_ons_job_numbers = [s.split('_')[0] for s in ons_results['Image Name'].values] 

In [15]:
ons_jobs_set = set(clean_ons_job_numbers)
iri_jobs_set = set(clean_iri_job_numbers)

In [16]:
# NOTE this is how you get all items in the union but not intersection of two sets
# h/t: https://stackoverflow.com/a/29947893
# diff = ons_jobs_set.symmetric_difference(iri_jobs_set)
# len(diff)

In [17]:
print(f"""There are {len(ons_jobs_set)} total unique job_numbers in Tony's data.""")

There are 536 total unique job_numbers in Tony's data.


### Writing

In order to share this conveniently, I'm going to write all of it to a file(s). 

In [20]:
with open('data/JobNumbers.yaml', 'w') as f:
    f.write('JobNumber:'+'\n')
    for x in iri_jobs_set:
        f.write('  - '+x+'\n')


In [19]:
for x in iri_jobs_set:
    if x in ons_jobs_set:
        pass
    else:
        print(x, "is MISSING")

L2115 is MISSING
