### Objectives:
- Identify data sources for all variables of interest. Categories based on https://www.canada.ca/en/health-canada/services/publications/food-nutrition/2019-canada-food-guide-food-classification-system-foods-beverages-categories.html 
    - vegetables and fruit
        - fruit
        - vegetables
    - grains
        - whole grains
        - whole grain and whole wheat foods
    - plant-based protein foods
        - plant-based yogurts
        - fortified plant-based cheeses
        - legumes
        - simulated meats
        - nuts and seeds
        - other plant-based foods
    - animal-based protein foods
        - yogurts and kefir
        - cheeses
        - other milk-based foods
        - red meats
            - beef
            - pork
            - lamb
            - goat
        - game meats
        - poultry
            - chicken
            - turkey
        - eggs
        - fish and shellfish
        - organ meats
    - beverages
        - water
        - fortified plant-based beverages 
        - non-fortified plant-based beverages
        - milks
        - fruit juice
        - vegetable juice
        - coffee
        - tea
        - soda
    - other foods
        - condiments, sauces, dressings
        - snack foods
        - processed meats
    - fats and oils
        - unsaturated fats and oils
        - saturated and trans fats and oils
    - unclassified foods
        - meal replacements
        - alcoholic beverages
        - herbs and spices
- acquire 2023 monthly averages for each variable
- where possible, extract and isolate the following variables:
    - category (if available)
    - product
    - product type (if applicable)
    - unit of measurement
    - unit quantity
    - price per unit
    - date
    - location (if available)

Resources:
- https://www.statcan.gc.ca/en/statistical-programs/document/2301_D68_V1 list of representative products used to calculate CPI "basket" of goods and services
- primary resource: https://www150.statcan.gc.ca/t1/tbl1/en/cv.action?pid=1810024501 monthly average retail prices for selected products
    - redirected to this table from data resources for red meat, dairy, produce
- supplemental resources: 
    - https://agriculture.canada.ca/en/market-information-system/rp/index-eng.cfm?action=pR&r=116&pdctc= monthly average retail prices for pork, turkey, chicken, eggs  

#### Primary data source: monthly average retail prices

In [1]:
# import pandas
import pandas as pd

In [2]:
# define relative path
path = '../data_sources/raw_data'

In [6]:
# import monthly average retail prices for food and related products
monthly_avg_retail_prices_2023 = pd.read_csv(f'{path}/2023_monthly_product_averages.csv')
monthly_avg_retail_prices_2023

Unnamed: 0,REF_DATE,GEO,DGUID,Products,UOM,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
0,Jan-23,Canada,2016A000011124,"Beef stewing cuts, per kilogram",Dollars,81,units,0,v1353834271,11.10,17.52,,,,2
1,Feb-23,Canada,2016A000011124,"Beef stewing cuts, per kilogram",Dollars,81,units,0,v1353834271,11.10,17.05,,,,2
2,Mar-23,Canada,2016A000011124,"Beef stewing cuts, per kilogram",Dollars,81,units,0,v1353834271,11.10,17.08,,,,2
3,Apr-23,Canada,2016A000011124,"Beef stewing cuts, per kilogram",Dollars,81,units,0,v1353834271,11.10,18.17,,,,2
4,May-23,Canada,2016A000011124,"Beef stewing cuts, per kilogram",Dollars,81,units,0,v1353834271,11.10,19.02,,,,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10876,May-23,British Columbia,2016A000259,"Laundry detergent, 4.43 litres",Dollars,81,units,0,v1458870250,10.11,16.21,,,,2
10877,Jun-23,British Columbia,2016A000259,"Laundry detergent, 4.43 litres",Dollars,81,units,0,v1458870250,10.11,16.75,,,,2
10878,Jul-23,British Columbia,2016A000259,"Laundry detergent, 4.43 litres",Dollars,81,units,0,v1458870250,10.11,16.71,,,,2
10879,Aug-23,British Columbia,2016A000259,"Laundry detergent, 4.43 litres",Dollars,81,units,0,v1458870250,10.11,16.22,,,,2


In [4]:
# check dtypes of dataframe
monthly_avg_retail_prices_2023.dtypes

REF_DATE          object
GEO               object
DGUID             object
Products          object
UOM               object
UOM_ID             int64
SCALAR_FACTOR     object
SCALAR_ID          int64
VECTOR            object
COORDINATE       float64
VALUE            float64
STATUS           float64
SYMBOL           float64
TERMINATED       float64
DECIMALS           int64
dtype: object

In [23]:
# filter for variables of interest
monthly_avg_retail_prices_2023 = monthly_avg_retail_prices_2023[['REF_DATE', 'GEO', 'Products', 'VALUE']]
monthly_avg_retail_prices_2023


Unnamed: 0,REF_DATE,GEO,Products,VALUE
0,Jan-23,Canada,"Beef stewing cuts, per kilogram",17.52
1,Feb-23,Canada,"Beef stewing cuts, per kilogram",17.05
2,Mar-23,Canada,"Beef stewing cuts, per kilogram",17.08
3,Apr-23,Canada,"Beef stewing cuts, per kilogram",18.17
4,May-23,Canada,"Beef stewing cuts, per kilogram",19.02
...,...,...,...,...
10876,May-23,British Columbia,"Laundry detergent, 4.43 litres",16.21
10877,Jun-23,British Columbia,"Laundry detergent, 4.43 litres",16.75
10878,Jul-23,British Columbia,"Laundry detergent, 4.43 litres",16.71
10879,Aug-23,British Columbia,"Laundry detergent, 4.43 litres",16.22


In [24]:
# cast products column to string
monthly_avg_retail_prices_2023 = monthly_avg_retail_prices_2023.astype({'Products':'string'})
monthly_avg_retail_prices_2023.dtypes

REF_DATE            object
GEO                 object
Products    string[python]
VALUE              float64
dtype: object

In [25]:
# test if product column can be split into item and units based on comma delimiter
df = monthly_avg_retail_prices_2023['Products'].str.split(',', expand = True)
df

Unnamed: 0,0,1
0,Beef stewing cuts,per kilogram
1,Beef stewing cuts,per kilogram
2,Beef stewing cuts,per kilogram
3,Beef stewing cuts,per kilogram
4,Beef stewing cuts,per kilogram
...,...,...
10876,Laundry detergent,4.43 litres
10877,Laundry detergent,4.43 litres
10878,Laundry detergent,4.43 litres
10879,Laundry detergent,4.43 litres


In [26]:
# define a new dataframe with split product/unit data
df2 = pd.concat([monthly_avg_retail_prices_2023, monthly_avg_retail_prices_2023['Products'].str.split(',', expand=True)], axis=1).drop('Products', axis=1)
df2

Unnamed: 0,REF_DATE,GEO,VALUE,0,1
0,Jan-23,Canada,17.52,Beef stewing cuts,per kilogram
1,Feb-23,Canada,17.05,Beef stewing cuts,per kilogram
2,Mar-23,Canada,17.08,Beef stewing cuts,per kilogram
3,Apr-23,Canada,18.17,Beef stewing cuts,per kilogram
4,May-23,Canada,19.02,Beef stewing cuts,per kilogram
...,...,...,...,...,...
10876,May-23,British Columbia,16.21,Laundry detergent,4.43 litres
10877,Jun-23,British Columbia,16.75,Laundry detergent,4.43 litres
10878,Jul-23,British Columbia,16.71,Laundry detergent,4.43 litres
10879,Aug-23,British Columbia,16.22,Laundry detergent,4.43 litres


In [27]:
# rename columns 
df2.rename(columns={'REF_DATE': 'date', 'GEO':'location', 'VALUE': 'price', 0: 'product', 1: 'unit'}, inplace=True)
df2

Unnamed: 0,date,location,price,product,unit
0,Jan-23,Canada,17.52,Beef stewing cuts,per kilogram
1,Feb-23,Canada,17.05,Beef stewing cuts,per kilogram
2,Mar-23,Canada,17.08,Beef stewing cuts,per kilogram
3,Apr-23,Canada,18.17,Beef stewing cuts,per kilogram
4,May-23,Canada,19.02,Beef stewing cuts,per kilogram
...,...,...,...,...,...
10876,May-23,British Columbia,16.21,Laundry detergent,4.43 litres
10877,Jun-23,British Columbia,16.75,Laundry detergent,4.43 litres
10878,Jul-23,British Columbia,16.71,Laundry detergent,4.43 litres
10879,Aug-23,British Columbia,16.22,Laundry detergent,4.43 litres


In [28]:
# identify unique product values
df2['product'].unique().tolist()

['Beef stewing cuts',
 'Beef striploin cuts',
 'Beef top sirloin cuts',
 'Beef rib cuts',
 'Ground beef',
 'Pork loin cuts',
 'Pork rib cuts',
 'Pork shoulder cuts',
 'Whole chicken',
 'Chicken breasts',
 'Chicken thigh',
 'Chicken drumsticks',
 'Bacon',
 'Wieners',
 'Salmon',
 'Shrimp',
 'Canned salmon',
 'Canned tuna',
 'Meatless burgers',
 'Milk',
 'Soy milk',
 'Nut milk',
 'Cream',
 'Butter',
 'Margarine',
 'Block cheese',
 'Yogurt',
 'Eggs',
 'Apples',
 'Oranges',
 'Bananas',
 'Pears',
 'Lemons',
 'Limes',
 'Grapes',
 'Cantaloupe',
 'Strawberries',
 'Avocado',
 'Potatoes',
 'Sweet potatoes',
 'Tomatoes',
 'Cabbage',
 'Carrots',
 'Onions',
 'Celery',
 'Cucumber',
 'Mushrooms',
 'Iceberg lettuce',
 'Romaine lettuce',
 'Broccoli',
 'Peppers',
 'Squash',
 'Salad greens',
 'Frozen french fried potatoes',
 'Frozen green beans',
 'Frozen broccoli',
 'Frozen corn',
 'Frozen mixed vegetables',
 'Frozen peas',
 'Frozen pizza',
 'Frozen spinach',
 'Frozen strawberries',
 'White bread',
 'Fla

In [29]:
# create list of irrelevant products
irrelevant = ['Baby food', 'Infant formula', 'Deodorant', 'Toothpaste', 'Shampoo', 'Laundry detergent']

In [30]:
# remove rows where the product is irrelevant
df3 = df2[~df2['product'].isin(irrelevant)]
df3

Unnamed: 0,date,location,price,product,unit
0,Jan-23,Canada,17.52,Beef stewing cuts,per kilogram
1,Feb-23,Canada,17.05,Beef stewing cuts,per kilogram
2,Mar-23,Canada,17.08,Beef stewing cuts,per kilogram
3,Apr-23,Canada,18.17,Beef stewing cuts,per kilogram
4,May-23,Canada,19.02,Beef stewing cuts,per kilogram
...,...,...,...,...,...
10840,May-23,British Columbia,4.51,Sunflower seeds,400 grams
10841,Jun-23,British Columbia,4.21,Sunflower seeds,400 grams
10842,Jul-23,British Columbia,4.40,Sunflower seeds,400 grams
10843,Aug-23,British Columbia,4.27,Sunflower seeds,400 grams


In [31]:
# identify products that have null values for unit quantity
df3[df3['unit'].isna()]

Unnamed: 0,date,location,price,product,unit
720,Jan-23,Canada,4.27,Tea (20 bags),
721,Feb-23,Canada,4.32,Tea (20 bags),
722,Mar-23,Canada,4.31,Tea (20 bags),
723,Apr-23,Canada,4.35,Tea (20 bags),
724,May-23,Canada,4.37,Tea (20 bags),
...,...,...,...,...,...
10615,May-23,British Columbia,4.55,Tea (20 bags),
10616,Jun-23,British Columbia,4.33,Tea (20 bags),
10617,Jul-23,British Columbia,4.75,Tea (20 bags),
10618,Aug-23,British Columbia,4.78,Tea (20 bags),


In [32]:
# split products with null unit quantity based on ( delimiter
df4 = pd.concat([df3, df3['product'].str.split('(', expand=True)], axis=1).drop('product', axis=1)
df4

Unnamed: 0,date,location,price,unit,0,1
0,Jan-23,Canada,17.52,per kilogram,Beef stewing cuts,
1,Feb-23,Canada,17.05,per kilogram,Beef stewing cuts,
2,Mar-23,Canada,17.08,per kilogram,Beef stewing cuts,
3,Apr-23,Canada,18.17,per kilogram,Beef stewing cuts,
4,May-23,Canada,19.02,per kilogram,Beef stewing cuts,
...,...,...,...,...,...,...
10840,May-23,British Columbia,4.51,400 grams,Sunflower seeds,
10841,Jun-23,British Columbia,4.21,400 grams,Sunflower seeds,
10842,Jul-23,British Columbia,4.40,400 grams,Sunflower seeds,
10843,Aug-23,British Columbia,4.27,400 grams,Sunflower seeds,


In [33]:
# compare number of rows that are not null in new split column to number of rows that were previously null in unit column
df4[df4[1].notna()]

Unnamed: 0,date,location,price,unit,0,1
720,Jan-23,Canada,4.27,,Tea,20 bags)
721,Feb-23,Canada,4.32,,Tea,20 bags)
722,Mar-23,Canada,4.31,,Tea,20 bags)
723,Apr-23,Canada,4.35,,Tea,20 bags)
724,May-23,Canada,4.37,,Tea,20 bags)
...,...,...,...,...,...,...
10615,May-23,British Columbia,4.55,,Tea,20 bags)
10616,Jun-23,British Columbia,4.33,,Tea,20 bags)
10617,Jul-23,British Columbia,4.75,,Tea,20 bags)
10618,Aug-23,British Columbia,4.78,,Tea,20 bags)


In [34]:
# remove ")" from "20 bags)"
df4[1] = df4[1].str.replace(")", "")
df4[df4[1].notna()]

Unnamed: 0,date,location,price,unit,0,1
720,Jan-23,Canada,4.27,,Tea,20 bags
721,Feb-23,Canada,4.32,,Tea,20 bags
722,Mar-23,Canada,4.31,,Tea,20 bags
723,Apr-23,Canada,4.35,,Tea,20 bags
724,May-23,Canada,4.37,,Tea,20 bags
...,...,...,...,...,...,...
10615,May-23,British Columbia,4.55,,Tea,20 bags
10616,Jun-23,British Columbia,4.33,,Tea,20 bags
10617,Jul-23,British Columbia,4.75,,Tea,20 bags
10618,Aug-23,British Columbia,4.78,,Tea,20 bags


In [35]:
# fill null unit values with values split based on "(" delimiter
df4['unit'] = df4['unit'].fillna(df4[1])
df4

Unnamed: 0,date,location,price,unit,0,1
0,Jan-23,Canada,17.52,per kilogram,Beef stewing cuts,
1,Feb-23,Canada,17.05,per kilogram,Beef stewing cuts,
2,Mar-23,Canada,17.08,per kilogram,Beef stewing cuts,
3,Apr-23,Canada,18.17,per kilogram,Beef stewing cuts,
4,May-23,Canada,19.02,per kilogram,Beef stewing cuts,
...,...,...,...,...,...,...
10840,May-23,British Columbia,4.51,400 grams,Sunflower seeds,
10841,Jun-23,British Columbia,4.21,400 grams,Sunflower seeds,
10842,Jul-23,British Columbia,4.40,400 grams,Sunflower seeds,
10843,Aug-23,British Columbia,4.27,400 grams,Sunflower seeds,


In [36]:
# drop split column
df4.drop(1, axis=1, inplace=True)
df4

Unnamed: 0,date,location,price,unit,0
0,Jan-23,Canada,17.52,per kilogram,Beef stewing cuts
1,Feb-23,Canada,17.05,per kilogram,Beef stewing cuts
2,Mar-23,Canada,17.08,per kilogram,Beef stewing cuts
3,Apr-23,Canada,18.17,per kilogram,Beef stewing cuts
4,May-23,Canada,19.02,per kilogram,Beef stewing cuts
...,...,...,...,...,...
10840,May-23,British Columbia,4.51,400 grams,Sunflower seeds
10841,Jun-23,British Columbia,4.21,400 grams,Sunflower seeds
10842,Jul-23,British Columbia,4.40,400 grams,Sunflower seeds
10843,Aug-23,British Columbia,4.27,400 grams,Sunflower seeds


In [37]:
# rename product column
df4.rename(columns={0:'product'}, inplace=True)
df4

Unnamed: 0,date,location,price,unit,product
0,Jan-23,Canada,17.52,per kilogram,Beef stewing cuts
1,Feb-23,Canada,17.05,per kilogram,Beef stewing cuts
2,Mar-23,Canada,17.08,per kilogram,Beef stewing cuts
3,Apr-23,Canada,18.17,per kilogram,Beef stewing cuts
4,May-23,Canada,19.02,per kilogram,Beef stewing cuts
...,...,...,...,...,...
10840,May-23,British Columbia,4.51,400 grams,Sunflower seeds
10841,Jun-23,British Columbia,4.21,400 grams,Sunflower seeds
10842,Jul-23,British Columbia,4.40,400 grams,Sunflower seeds
10843,Aug-23,British Columbia,4.27,400 grams,Sunflower seeds


In [42]:
# remove leading and trailing whitespace from unit column
df4['unit'] = df4['unit'].replace(r"^ +| +$", r"", regex=True)

In [43]:
# split unit column into individual terms based on whitespace
df5 = pd.concat([df4, df4['unit'].str.split(' ', expand=True)], axis=1).drop('unit', axis=1)
df5

Unnamed: 0,date,location,price,product,0,1
0,Jan-23,Canada,17.52,Beef stewing cuts,per,kilogram
1,Feb-23,Canada,17.05,Beef stewing cuts,per,kilogram
2,Mar-23,Canada,17.08,Beef stewing cuts,per,kilogram
3,Apr-23,Canada,18.17,Beef stewing cuts,per,kilogram
4,May-23,Canada,19.02,Beef stewing cuts,per,kilogram
...,...,...,...,...,...,...
10840,May-23,British Columbia,4.51,Sunflower seeds,400,grams
10841,Jun-23,British Columbia,4.21,Sunflower seeds,400,grams
10842,Jul-23,British Columbia,4.40,Sunflower seeds,400,grams
10843,Aug-23,British Columbia,4.27,Sunflower seeds,400,grams


In [44]:
# rename columns 
df5.rename(columns={0: 'unit_quantity', 1: 'unit'}, inplace=True)
df5

Unnamed: 0,date,location,price,product,unit_quantity,unit
0,Jan-23,Canada,17.52,Beef stewing cuts,per,kilogram
1,Feb-23,Canada,17.05,Beef stewing cuts,per,kilogram
2,Mar-23,Canada,17.08,Beef stewing cuts,per,kilogram
3,Apr-23,Canada,18.17,Beef stewing cuts,per,kilogram
4,May-23,Canada,19.02,Beef stewing cuts,per,kilogram
...,...,...,...,...,...,...
10840,May-23,British Columbia,4.51,Sunflower seeds,400,grams
10841,Jun-23,British Columbia,4.21,Sunflower seeds,400,grams
10842,Jul-23,British Columbia,4.40,Sunflower seeds,400,grams
10843,Aug-23,British Columbia,4.27,Sunflower seeds,400,grams


In [45]:
# identify unique unit quantity values
df5['unit_quantity'].unique().tolist()

['per',
 '500',
 '400',
 '300',
 '213',
 '170',
 '226',
 '1',
 '2',
 '4',
 '1.89',
 '454',
 '907',
 '1.36',
 'unit',
 '4.54',
 '227',
 '142',
 '750',
 '390',
 '600',
 '675',
 '200',
 '900',
 '2.5',
 '340',
 '20',
 '3',
 '890',
 '398',
 '796',
 '284',
 '540',
 '341',
 '350',
 '418',
 '650',
 '475',
 '450']

In [56]:
# check product rows where unit quantity is non-numeric: 'per', 'unit'
# list of products where unit quantity is 'per'
per_is_unit = df5[df5['unit_quantity'].isin(['per'])]['product'].unique().tolist()
per_is_unit

['Beef stewing cuts',
 'Beef striploin cuts',
 'Beef top sirloin cuts',
 'Beef rib cuts',
 'Ground beef',
 'Pork loin cuts',
 'Pork rib cuts',
 'Pork shoulder cuts',
 'Whole chicken',
 'Chicken breasts',
 'Chicken thigh',
 'Chicken drumsticks',
 'Salmon',
 'Apples',
 'Oranges',
 'Bananas',
 'Pears',
 'Grapes',
 'Potatoes',
 'Sweet potatoes',
 'Tomatoes',
 'Cabbage',
 'Onions',
 'Peppers',
 'Squash']

In [60]:
# where unit_quantity is 'per', make unit_quantity = 1
df5['unit_quantity'] = df5['unit_quantity'].replace('per', '1')

In [61]:
df5

Unnamed: 0,date,location,price,product,unit_quantity,unit
0,Jan-23,Canada,17.52,Beef stewing cuts,1,kilogram
1,Feb-23,Canada,17.05,Beef stewing cuts,1,kilogram
2,Mar-23,Canada,17.08,Beef stewing cuts,1,kilogram
3,Apr-23,Canada,18.17,Beef stewing cuts,1,kilogram
4,May-23,Canada,19.02,Beef stewing cuts,1,kilogram
...,...,...,...,...,...,...
10840,May-23,British Columbia,4.51,Sunflower seeds,400,grams
10841,Jun-23,British Columbia,4.21,Sunflower seeds,400,grams
10842,Jul-23,British Columbia,4.40,Sunflower seeds,400,grams
10843,Aug-23,British Columbia,4.27,Sunflower seeds,400,grams


In [57]:
# list of products where unit quantity is 'unit'
quantity_is_unit = df5[df5['unit_quantity'].isin(['unit'])]['product'].unique().tolist()
quantity_is_unit

['Lemons',
 'Limes',
 'Cantaloupe',
 'Avocado',
 'Celery',
 'Cucumber',
 'Iceberg lettuce',
 'Romaine lettuce',
 'Broccoli']

In [58]:
# identify unique unit values
unique_unit_vals = df5['unit'].unique().tolist()
unique_unit_vals

['kilogram',
 'grams',
 'litre',
 'litres',
 'dozen',
 'kilograms',
 <NA>,
 'bags',
 'millilitres']

In [59]:
# investigate null values in unit column
df5[df5['unit'].isna()]

Unnamed: 0,date,location,price,product,unit_quantity,unit
315,Jan-23,Canada,0.91,Lemons,unit,
316,Feb-23,Canada,0.97,Lemons,unit,
317,Mar-23,Canada,0.94,Lemons,unit,
318,Apr-23,Canada,0.98,Lemons,unit,
319,May-23,Canada,0.91,Lemons,unit,
...,...,...,...,...,...,...
10381,May-23,British Columbia,4.49,Broccoli,unit,
10382,Jun-23,British Columbia,4.46,Broccoli,unit,
10383,Jul-23,British Columbia,3.96,Broccoli,unit,
10384,Aug-23,British Columbia,3.78,Broccoli,unit,


In [63]:
# check whether all cases where unit_quantity is 'unit' are cases where unit is null
filter = df5['unit_quantity']=='unit'
df6 = df5.where(filter)
df6['unit'].dropna()

Series([], Name: unit, dtype: string)

In [64]:
# where unit_quantity is 'unit' and unit column is null, make unit_quantity = 1 and unit = each
df5['unit_quantity'] = df5['unit_quantity'].replace('unit', '1')
df5['unit'] = df5['unit'].fillna('each')
df5

Unnamed: 0,date,location,price,product,unit_quantity,unit
0,Jan-23,Canada,17.52,Beef stewing cuts,1,kilogram
1,Feb-23,Canada,17.05,Beef stewing cuts,1,kilogram
2,Mar-23,Canada,17.08,Beef stewing cuts,1,kilogram
3,Apr-23,Canada,18.17,Beef stewing cuts,1,kilogram
4,May-23,Canada,19.02,Beef stewing cuts,1,kilogram
...,...,...,...,...,...,...
10840,May-23,British Columbia,4.51,Sunflower seeds,400,grams
10841,Jun-23,British Columbia,4.21,Sunflower seeds,400,grams
10842,Jul-23,British Columbia,4.40,Sunflower seeds,400,grams
10843,Aug-23,British Columbia,4.27,Sunflower seeds,400,grams


In [65]:
# check new unique unit values
unique_unit_vals_check = df5['unit'].unique().tolist()
unique_unit_vals_check

['kilogram',
 'grams',
 'litre',
 'litres',
 'dozen',
 'kilograms',
 'each',
 'bags',
 'millilitres']

In [66]:
# check new unit unit quantity values
unique_unit_quant_vals_check = df5['unit_quantity'].unique().tolist()
unique_unit_quant_vals_check

['1',
 '500',
 '400',
 '300',
 '213',
 '170',
 '226',
 '2',
 '4',
 '1.89',
 '454',
 '907',
 '1.36',
 '4.54',
 '227',
 '142',
 '750',
 '390',
 '600',
 '675',
 '200',
 '900',
 '2.5',
 '340',
 '20',
 '3',
 '890',
 '398',
 '796',
 '284',
 '540',
 '341',
 '350',
 '418',
 '650',
 '475',
 '450']

In [67]:
df5.dtypes

date                     object
location                 object
price                   float64
product          string[python]
unit_quantity    string[python]
unit             string[python]
dtype: object

In [78]:
# redefine datatypes
df7 = df5.astype({'date': 'string', 'location': 'string', 'product':'string', 'unit_quantity': 'float64', 'unit':'string'})

df7.dtypes

date             string[python]
location         string[python]
price                   float64
product          string[python]
unit_quantity           float64
unit             string[python]
dtype: object

In [82]:
# cast date column to datetime datatype
df7['date'] = pd.to_datetime(df7['date'], format='%b-%y')

In [83]:
df7

Unnamed: 0,date,location,price,product,unit_quantity,unit
0,2023-01-01,Canada,17.52,Beef stewing cuts,1.0,kilogram
1,2023-02-01,Canada,17.05,Beef stewing cuts,1.0,kilogram
2,2023-03-01,Canada,17.08,Beef stewing cuts,1.0,kilogram
3,2023-04-01,Canada,18.17,Beef stewing cuts,1.0,kilogram
4,2023-05-01,Canada,19.02,Beef stewing cuts,1.0,kilogram
...,...,...,...,...,...,...
10840,2023-05-01,British Columbia,4.51,Sunflower seeds,400.0,grams
10841,2023-06-01,British Columbia,4.21,Sunflower seeds,400.0,grams
10842,2023-07-01,British Columbia,4.40,Sunflower seeds,400.0,grams
10843,2023-08-01,British Columbia,4.27,Sunflower seeds,400.0,grams


In [84]:
df7.dtypes

date             datetime64[ns]
location         string[python]
price                   float64
product          string[python]
unit_quantity           float64
unit             string[python]
dtype: object

In [85]:
# save dataframe to .csv
df7.to_csv('../data_sources/clean_data/2023_MRA_clean.csv')

### Identify unique vector IDs that may be usable in API data acquisition

In [5]:
# import clean dataframe
food_df = pd.read_csv('../data_sources/clean_data/2023_MRA_clean.csv', index_col=0)
food_df

Unnamed: 0,date,location,price,product,unit_quantity,unit
0,2023-01-01,Canada,17.52,Beef stewing cuts,1.0,kilogram
1,2023-02-01,Canada,17.05,Beef stewing cuts,1.0,kilogram
2,2023-03-01,Canada,17.08,Beef stewing cuts,1.0,kilogram
3,2023-04-01,Canada,18.17,Beef stewing cuts,1.0,kilogram
4,2023-05-01,Canada,19.02,Beef stewing cuts,1.0,kilogram
...,...,...,...,...,...,...
10840,2023-05-01,British Columbia,4.51,Sunflower seeds,400.0,grams
10841,2023-06-01,British Columbia,4.21,Sunflower seeds,400.0,grams
10842,2023-07-01,British Columbia,4.40,Sunflower seeds,400.0,grams
10843,2023-08-01,British Columbia,4.27,Sunflower seeds,400.0,grams


In [7]:
# generate a list of unique products in monthly retail averages table
vector_list = monthly_avg_retail_prices_2023['VECTOR'].unique().tolist()
vector_list

['v1353834271',
 'v1353834272',
 'v1353834273',
 'v1353834311',
 'v1353834274',
 'v1353834275',
 'v1353834276',
 'v1353834312',
 'v1353834277',
 'v1353834278',
 'v1353834279',
 'v1353834313',
 'v1353834280',
 'v1353834281',
 'v1458869929',
 'v1458869931',
 'v1353834314',
 'v1353834282',
 'v1458869922',
 'v1353834283',
 'v1353834284',
 'v1353834285',
 'v1458869932',
 'v1458869923',
 'v1353834286',
 'v1353834287',
 'v1458869921',
 'v1353834288',
 'v1353834289',
 'v1353834290',
 'v1353834291',
 'v1353834292',
 'v1353834293',
 'v1353834294',
 'v1353834295',
 'v1353834296',
 'v1353834315',
 'v1353834297',
 'v1353834298',
 'v1458869934',
 'v1353834299',
 'v1353834300',
 'v1353834316',
 'v1353834317',
 'v1353834301',
 'v1353834302',
 'v1353834303',
 'v1353834304',
 'v1353834305',
 'v1353834306',
 'v1353834307',
 'v1353834308',
 'v1353834318',
 'v1353834319',
 'v1353834309',
 'v1353834310',
 'v1458869933',
 'v1458869928',
 'v1353834320',
 'v1353834321',
 'v1353834322',
 'v1353834323',
 'v13538

Identify unique product values in 2023 monthly product retail averages table:

In [None]:
# generate a list of unique products in monthly retail averages table
product_list = df3['product'].unique().tolist()

In [None]:
# save list of unique products in a .csv table column
product_list_df = pd.DataFrame(list(zip(*[product_list]))).add_prefix('Col')
product_list_df.to_csv('monthly_average_unique_product_list.csv', index=False)
print(product_list_df)

                      Col0
0        Beef stewing cuts
1      Beef striploin cuts
2    Beef top sirloin cuts
3            Beef rib cuts
4              Ground beef
..                     ...
100        Sunflower seeds
101              Deodorant
102             Toothpaste
103                Shampoo
104      Laundry detergent

[105 rows x 1 columns]
