### Objectives:
- Identify data sources for all variables of interest. Categories based on https://www.canada.ca/en/health-canada/services/publications/food-nutrition/2019-canada-food-guide-food-classification-system-foods-beverages-categories.html 
    - vegetables and fruit
        - fruit
        - vegetables
    - grains
        - whole grains
        - whole grain and whole wheat foods
    - plant-based protein foods
        - plant-based yogurts
        - fortified plant-based cheeses
        - legumes
        - simulated meats
        - nuts and seeds
        - other plant-based foods
    - animal-based protein foods
        - yogurts and kefir
        - cheeses
        - other milk-based foods
        - red meats
            - beef
            - pork
            - lamb
            - goat
        - game meats
        - poultry
            - chicken
            - turkey
        - eggs
        - fish and shellfish
        - organ meats
    - beverages
        - water
        - fortified plant-based beverages 
        - non-fortified plant-based beverages
        - milks
        - fruit juice
        - vegetable juice
        - coffee
        - tea
        - soda
    - other foods
        - condiments, sauces, dressings
        - snack foods
        - processed meats
    - fats and oils
        - unsaturated fats and oils
        - saturated and trans fats and oils
    - unclassified foods
        - meal replacements
        - alcoholic beverages
        - herbs and spices
- acquire 2023 monthly averages for each variable
- where possible, extract and isolate the following variables:
    - category (if available)
    - product
    - product type (if applicable)
    - unit of measurement
    - unit quantity
    - price per unit
    - date
    - location (if available)

Resources:
- https://www.statcan.gc.ca/en/statistical-programs/document/2301_D68_V1 list of representative products used to calculate CPI "basket" of goods and services
- primary resource: https://www150.statcan.gc.ca/t1/tbl1/en/cv.action?pid=1810024501 monthly average retail prices for selected products
    - redirected to this table from data resources for red meat, dairy, produce
- supplemental resources: 
    - https://agriculture.canada.ca/en/market-information-system/rp/index-eng.cfm?action=pR&r=116&pdctc= monthly average retail prices for pork, turkey, chicken, eggs  

#### Primary data source: monthly average retail prices

In [22]:
# import pandas
import pandas as pd

In [23]:
# define relative path
path = '../data_sources/raw_data'

In [24]:
# import monthly average retail prices for food and related products
monthly_avg_retail_prices_2023 = pd.read_csv(f'{path}/2023_monthly_product_averages.csv')
monthly_avg_retail_prices_2023

Unnamed: 0,REF_DATE,GEO,DGUID,Products,UOM,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
0,Jan-23,Canada,2016A000011124,"Beef stewing cuts, per kilogram",Dollars,81,units,0,v1353834271,11.10,17.52,,,,2
1,Feb-23,Canada,2016A000011124,"Beef stewing cuts, per kilogram",Dollars,81,units,0,v1353834271,11.10,17.05,,,,2
2,Mar-23,Canada,2016A000011124,"Beef stewing cuts, per kilogram",Dollars,81,units,0,v1353834271,11.10,17.08,,,,2
3,Apr-23,Canada,2016A000011124,"Beef stewing cuts, per kilogram",Dollars,81,units,0,v1353834271,11.10,18.17,,,,2
4,May-23,Canada,2016A000011124,"Beef stewing cuts, per kilogram",Dollars,81,units,0,v1353834271,11.10,19.02,,,,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10876,May-23,British Columbia,2016A000259,"Laundry detergent, 4.43 litres",Dollars,81,units,0,v1458870250,10.11,16.21,,,,2
10877,Jun-23,British Columbia,2016A000259,"Laundry detergent, 4.43 litres",Dollars,81,units,0,v1458870250,10.11,16.75,,,,2
10878,Jul-23,British Columbia,2016A000259,"Laundry detergent, 4.43 litres",Dollars,81,units,0,v1458870250,10.11,16.71,,,,2
10879,Aug-23,British Columbia,2016A000259,"Laundry detergent, 4.43 litres",Dollars,81,units,0,v1458870250,10.11,16.22,,,,2


In [25]:
# check dtypes of dataframe
monthly_avg_retail_prices_2023.dtypes

REF_DATE          object
GEO               object
DGUID             object
Products          object
UOM               object
UOM_ID             int64
SCALAR_FACTOR     object
SCALAR_ID          int64
VECTOR            object
COORDINATE       float64
VALUE            float64
STATUS           float64
SYMBOL           float64
TERMINATED       float64
DECIMALS           int64
dtype: object

In [26]:
# filter for variables of interest
monthly_avg_retail_prices_2023 = monthly_avg_retail_prices_2023[['REF_DATE', 'GEO', 'Products', 'UOM', 'VALUE']]
monthly_avg_retail_prices_2023


Unnamed: 0,REF_DATE,GEO,Products,UOM,VALUE
0,Jan-23,Canada,"Beef stewing cuts, per kilogram",Dollars,17.52
1,Feb-23,Canada,"Beef stewing cuts, per kilogram",Dollars,17.05
2,Mar-23,Canada,"Beef stewing cuts, per kilogram",Dollars,17.08
3,Apr-23,Canada,"Beef stewing cuts, per kilogram",Dollars,18.17
4,May-23,Canada,"Beef stewing cuts, per kilogram",Dollars,19.02
...,...,...,...,...,...
10876,May-23,British Columbia,"Laundry detergent, 4.43 litres",Dollars,16.21
10877,Jun-23,British Columbia,"Laundry detergent, 4.43 litres",Dollars,16.75
10878,Jul-23,British Columbia,"Laundry detergent, 4.43 litres",Dollars,16.71
10879,Aug-23,British Columbia,"Laundry detergent, 4.43 litres",Dollars,16.22


In [27]:
# cast products column to string
monthly_avg_retail_prices_2023 = monthly_avg_retail_prices_2023.astype({'Products':'string'})
monthly_avg_retail_prices_2023.dtypes

REF_DATE            object
GEO                 object
Products    string[python]
UOM                 object
VALUE              float64
dtype: object

In [28]:
# test if product column can be split into item and units based on comma delimiter
df = monthly_avg_retail_prices_2023['Products'].str.split(',', expand = True)
df

Unnamed: 0,0,1
0,Beef stewing cuts,per kilogram
1,Beef stewing cuts,per kilogram
2,Beef stewing cuts,per kilogram
3,Beef stewing cuts,per kilogram
4,Beef stewing cuts,per kilogram
...,...,...
10876,Laundry detergent,4.43 litres
10877,Laundry detergent,4.43 litres
10878,Laundry detergent,4.43 litres
10879,Laundry detergent,4.43 litres


In [29]:
# define a new dataframe with split product/unit data
df2 = pd.concat([monthly_avg_retail_prices_2023, monthly_avg_retail_prices_2023['Products'].str.split(',', expand=True)], axis=1).drop('Products', axis=1)
df2

Unnamed: 0,REF_DATE,GEO,UOM,VALUE,0,1
0,Jan-23,Canada,Dollars,17.52,Beef stewing cuts,per kilogram
1,Feb-23,Canada,Dollars,17.05,Beef stewing cuts,per kilogram
2,Mar-23,Canada,Dollars,17.08,Beef stewing cuts,per kilogram
3,Apr-23,Canada,Dollars,18.17,Beef stewing cuts,per kilogram
4,May-23,Canada,Dollars,19.02,Beef stewing cuts,per kilogram
...,...,...,...,...,...,...
10876,May-23,British Columbia,Dollars,16.21,Laundry detergent,4.43 litres
10877,Jun-23,British Columbia,Dollars,16.75,Laundry detergent,4.43 litres
10878,Jul-23,British Columbia,Dollars,16.71,Laundry detergent,4.43 litres
10879,Aug-23,British Columbia,Dollars,16.22,Laundry detergent,4.43 litres


In [30]:
# rename columns 
df2.rename(columns={'REF_DATE': 'Date', 'GEO':'Location', 'UOM':'price units', 'VALUE': 'Price', 0: 'Product', 1: 'unit'}, inplace=True)
df2

Unnamed: 0,Date,Location,price units,Price,Product,unit
0,Jan-23,Canada,Dollars,17.52,Beef stewing cuts,per kilogram
1,Feb-23,Canada,Dollars,17.05,Beef stewing cuts,per kilogram
2,Mar-23,Canada,Dollars,17.08,Beef stewing cuts,per kilogram
3,Apr-23,Canada,Dollars,18.17,Beef stewing cuts,per kilogram
4,May-23,Canada,Dollars,19.02,Beef stewing cuts,per kilogram
...,...,...,...,...,...,...
10876,May-23,British Columbia,Dollars,16.21,Laundry detergent,4.43 litres
10877,Jun-23,British Columbia,Dollars,16.75,Laundry detergent,4.43 litres
10878,Jul-23,British Columbia,Dollars,16.71,Laundry detergent,4.43 litres
10879,Aug-23,British Columbia,Dollars,16.22,Laundry detergent,4.43 litres


In [31]:
# identify products that have null values for unit quantity
df2[df2['unit'].isna()]

Unnamed: 0,Date,Location,price units,Price,Product,unit
720,Jan-23,Canada,Dollars,4.27,Tea (20 bags),
721,Feb-23,Canada,Dollars,4.32,Tea (20 bags),
722,Mar-23,Canada,Dollars,4.31,Tea (20 bags),
723,Apr-23,Canada,Dollars,4.35,Tea (20 bags),
724,May-23,Canada,Dollars,4.37,Tea (20 bags),
...,...,...,...,...,...,...
10615,May-23,British Columbia,Dollars,4.55,Tea (20 bags),
10616,Jun-23,British Columbia,Dollars,4.33,Tea (20 bags),
10617,Jul-23,British Columbia,Dollars,4.75,Tea (20 bags),
10618,Aug-23,British Columbia,Dollars,4.78,Tea (20 bags),


In [33]:
# split products with null unit quantity based on ( delimiter
df3 = pd.concat([df2, df2['Product'].str.split('(', expand=True)], axis=1).drop('Product', axis=1)
df3

Unnamed: 0,Date,Location,price units,Price,unit,0,1
0,Jan-23,Canada,Dollars,17.52,per kilogram,Beef stewing cuts,
1,Feb-23,Canada,Dollars,17.05,per kilogram,Beef stewing cuts,
2,Mar-23,Canada,Dollars,17.08,per kilogram,Beef stewing cuts,
3,Apr-23,Canada,Dollars,18.17,per kilogram,Beef stewing cuts,
4,May-23,Canada,Dollars,19.02,per kilogram,Beef stewing cuts,
...,...,...,...,...,...,...,...
10876,May-23,British Columbia,Dollars,16.21,4.43 litres,Laundry detergent,
10877,Jun-23,British Columbia,Dollars,16.75,4.43 litres,Laundry detergent,
10878,Jul-23,British Columbia,Dollars,16.71,4.43 litres,Laundry detergent,
10879,Aug-23,British Columbia,Dollars,16.22,4.43 litres,Laundry detergent,


In [34]:
# compare number of rows that are not null in new split column to number of rows that were previously null in unit column
df3[df3[1].notna()]

Unnamed: 0,Date,Location,price units,Price,unit,0,1
720,Jan-23,Canada,Dollars,4.27,,Tea,20 bags)
721,Feb-23,Canada,Dollars,4.32,,Tea,20 bags)
722,Mar-23,Canada,Dollars,4.31,,Tea,20 bags)
723,Apr-23,Canada,Dollars,4.35,,Tea,20 bags)
724,May-23,Canada,Dollars,4.37,,Tea,20 bags)
...,...,...,...,...,...,...,...
10615,May-23,British Columbia,Dollars,4.55,,Tea,20 bags)
10616,Jun-23,British Columbia,Dollars,4.33,,Tea,20 bags)
10617,Jul-23,British Columbia,Dollars,4.75,,Tea,20 bags)
10618,Aug-23,British Columbia,Dollars,4.78,,Tea,20 bags)


In [35]:
# fill null unit values with values split based on "(" delimiter
df3['unit'] = df3['unit'].fillna(df3[1])

In [36]:
# drop split column
df3.drop(1, axis=1, inplace=True)
df3

Unnamed: 0,Date,Location,price units,Price,unit,0
0,Jan-23,Canada,Dollars,17.52,per kilogram,Beef stewing cuts
1,Feb-23,Canada,Dollars,17.05,per kilogram,Beef stewing cuts
2,Mar-23,Canada,Dollars,17.08,per kilogram,Beef stewing cuts
3,Apr-23,Canada,Dollars,18.17,per kilogram,Beef stewing cuts
4,May-23,Canada,Dollars,19.02,per kilogram,Beef stewing cuts
...,...,...,...,...,...,...
10876,May-23,British Columbia,Dollars,16.21,4.43 litres,Laundry detergent
10877,Jun-23,British Columbia,Dollars,16.75,4.43 litres,Laundry detergent
10878,Jul-23,British Columbia,Dollars,16.71,4.43 litres,Laundry detergent
10879,Aug-23,British Columbia,Dollars,16.22,4.43 litres,Laundry detergent


In [37]:
# rename product column
df3.rename(columns={0:'product'}, inplace=True)
df3

Unnamed: 0,Date,Location,price units,Price,unit,product
0,Jan-23,Canada,Dollars,17.52,per kilogram,Beef stewing cuts
1,Feb-23,Canada,Dollars,17.05,per kilogram,Beef stewing cuts
2,Mar-23,Canada,Dollars,17.08,per kilogram,Beef stewing cuts
3,Apr-23,Canada,Dollars,18.17,per kilogram,Beef stewing cuts
4,May-23,Canada,Dollars,19.02,per kilogram,Beef stewing cuts
...,...,...,...,...,...,...
10876,May-23,British Columbia,Dollars,16.21,4.43 litres,Laundry detergent
10877,Jun-23,British Columbia,Dollars,16.75,4.43 litres,Laundry detergent
10878,Jul-23,British Columbia,Dollars,16.71,4.43 litres,Laundry detergent
10879,Aug-23,British Columbia,Dollars,16.22,4.43 litres,Laundry detergent


In [38]:
df4 = pd.concat([df3, df3['unit'].str.split(' ', expand=True)], axis=1).drop('unit', axis=1)
df4

Unnamed: 0,Date,Location,price units,Price,product,0,1,2,3
0,Jan-23,Canada,Dollars,17.52,Beef stewing cuts,,per,kilogram,
1,Feb-23,Canada,Dollars,17.05,Beef stewing cuts,,per,kilogram,
2,Mar-23,Canada,Dollars,17.08,Beef stewing cuts,,per,kilogram,
3,Apr-23,Canada,Dollars,18.17,Beef stewing cuts,,per,kilogram,
4,May-23,Canada,Dollars,19.02,Beef stewing cuts,,per,kilogram,
...,...,...,...,...,...,...,...,...,...
10876,May-23,British Columbia,Dollars,16.21,Laundry detergent,,4.43,litres,
10877,Jun-23,British Columbia,Dollars,16.75,Laundry detergent,,4.43,litres,
10878,Jul-23,British Columbia,Dollars,16.71,Laundry detergent,,4.43,litres,
10879,Aug-23,British Columbia,Dollars,16.22,Laundry detergent,,4.43,litres,


In [39]:
# split units into unit and unit quantity
# for unit in ['kilogram', 'litre', 'litres', 'gram', 'grams', 'dozen', 'millilitres', 'bags']:

In [40]:
# # split products column into product and quantity based on comma delimiter
# data_split = pd.DataFrame(
#     [x.rsplit(', ', 1) for x in monthly_avg_retail_prices_2023.Products.tolist()], 
#     columns=['product', 'unit_quantity']
# )
# data_split

### Identify unique vector IDs that may be usable in API data acquisition

In [41]:
# generate a list of unique products in monthly retail averages table
# vector_list = monthly_avg_retail_prices_2023['VECTOR'].unique().tolist()
# vector_list

Identify unique product values in 2023 monthly product retail averages table:

In [42]:
# generate a list of unique products in monthly retail averages table
product_list = df3['product'].unique().tolist()
product_list

['Beef stewing cuts',
 'Beef striploin cuts',
 'Beef top sirloin cuts',
 'Beef rib cuts',
 'Ground beef',
 'Pork loin cuts',
 'Pork rib cuts',
 'Pork shoulder cuts',
 'Whole chicken',
 'Chicken breasts',
 'Chicken thigh',
 'Chicken drumsticks',
 'Bacon',
 'Wieners',
 'Salmon',
 'Shrimp',
 'Canned salmon',
 'Canned tuna',
 'Meatless burgers',
 'Milk',
 'Soy milk',
 'Nut milk',
 'Cream',
 'Butter',
 'Margarine',
 'Block cheese',
 'Yogurt',
 'Eggs',
 'Apples',
 'Oranges',
 'Bananas',
 'Pears',
 'Lemons',
 'Limes',
 'Grapes',
 'Cantaloupe',
 'Strawberries',
 'Avocado',
 'Potatoes',
 'Sweet potatoes',
 'Tomatoes',
 'Cabbage',
 'Carrots',
 'Onions',
 'Celery',
 'Cucumber',
 'Mushrooms',
 'Iceberg lettuce',
 'Romaine lettuce',
 'Broccoli',
 'Peppers',
 'Squash',
 'Salad greens',
 'Frozen french fried potatoes',
 'Frozen green beans',
 'Frozen broccoli',
 'Frozen corn',
 'Frozen mixed vegetables',
 'Frozen peas',
 'Frozen pizza',
 'Frozen spinach',
 'Frozen strawberries',
 'White bread',
 'Fla

In [45]:
def categorize_product(Product):
    keywords = {
        'Veg_Fruits': ['apples', 'oranges', 'bananas', 'pears', 'lemons', 'limes', 'grapes', 'cantaloupe', 'strawberries', 'avocado', 'potatoes', 'cabbage', 'carrots', 'onions', 'celery', 'cucumber', 'mushrooms', 'lettuce', 'broccoli', 'peppers', 'squash', 'salad greens'],
        'Frozen': ['frozen'],
        'Grains': ['bread', 'flatbread', 'crackers', 'cookies', 'pasta', 'rice', 'cereal', 'wheat'],
        'Protein': ['beef', 'chicken', 'pork', 'bacon', 'wieners', 'salmon', 'shrimp', 'eggs'],
        'Plant_based': ['meatless', 'soy milk', 'nut milk'],
        'Dairy': ['cream', 'butter', 'margarine', 'cheese', 'yogurt'],
        'Beverages': ['milk', 'juice', 'coffee', 'tea'],
        'Others': ['canned', 'sugar', 'ketchup', 'peanut butter', 'mayonnaise', 'dried', 'dry', 'tofu', 'hummus', 'salsa', 'pasta sauce', 'salas dressing'],
        'Oils': ['oil'],
        'Unclassified': ['almonds', 'peanuts', 'sunflower']
    }
    for category, category_keywords in keywords.items():
        for keyword in category_keywords:
            if keyword.lower() in Product.lower():
                return category
    
    return '?'

df2['Category'] = df2['Product'].apply(categorize_product)

'''
veg_fruits = ['apples', 'oranges', 'bananas', 'pears', 'lemons', 'limes', 'grapes', 'cantaloupe', 'strawberries', 'avocado', 'potatoes', 'cabbage', 'carrots', 'onions', 'celery', 'cucumber', 'mushrooms', 'lettuce', 'broccoli', 'peppers', 'squash', 'salad greens']
frozen = ['frozen']
grains = ['bread', 'flatbread', 'crackers', 'cookies', 'pasta', 'rice', 'cereal', 'wheat']
protein_animal_based = ['beef', 'chicken', 'pork', 'bacon', 'wieners', 'salmon', 'shrimp', 'eggs']
plant_based = ['meatless', 'soy milk', 'nut milk']
dairy = ['cream', 'butter', 'margarine', 'cheese', 'yogurt']
beverages = ['milk', 'juice', 'coffee', 'tea']
others = ['canned', 'sugar', 'ketchup', 'peanut butter', 'mayonnaise', 'dried', 'dry', 'tofu', 'hummus', 'salsa', 'pasta sauce', 'salas dressing']
oils = ['oil']
unclassified = ['almonds', 'peanuts', 'sunflower']

df2['category'] = df2.apply(lambda x: 'Protein (Animal-based)' if any(word in df2['product'] for word in protein_animal_based) else '', axis=1)

# milk -> beverages, plant_based
# strawberries -> veg_fruits, frozen
# pasta sauce -> grains, others
'''

"\nveg_fruits = ['apples', 'oranges', 'bananas', 'pears', 'lemons', 'limes', 'grapes', 'cantaloupe', 'strawberries', 'avocado', 'potatoes', 'cabbage', 'carrots', 'onions', 'celery', 'cucumber', 'mushrooms', 'lettuce', 'broccoli', 'peppers', 'squash', 'salad greens']\nfrozen = ['frozen']\ngrains = ['bread', 'flatbread', 'crackers', 'cookies', 'pasta', 'rice', 'cereal', 'wheat']\nprotein_animal_based = ['beef', 'chicken', 'pork', 'bacon', 'wieners', 'salmon', 'shrimp', 'eggs']\nplant_based = ['meatless', 'soy milk', 'nut milk']\ndairy = ['cream', 'butter', 'margarine', 'cheese', 'yogurt']\nbeverages = ['milk', 'juice', 'coffee', 'tea']\nothers = ['canned', 'sugar', 'ketchup', 'peanut butter', 'mayonnaise', 'dried', 'dry', 'tofu', 'hummus', 'salsa', 'pasta sauce', 'salas dressing']\noils = ['oil']\nunclassified = ['almonds', 'peanuts', 'sunflower']\n\ndf2['category'] = df2.apply(lambda x: 'Protein (Animal-based)' if any(word in df2['product'] for word in protein_animal_based) else '', axi

In [51]:
#df2.drop('price units', inplace=True, axis=1) 
# df2.head(50)

Unnamed: 0,Date,Location,Price,Product,unit,Category
0,Jan-23,Canada,17.52,Beef stewing cuts,per kilogram,Protein
1,Feb-23,Canada,17.05,Beef stewing cuts,per kilogram,Protein
2,Mar-23,Canada,17.08,Beef stewing cuts,per kilogram,Protein
3,Apr-23,Canada,18.17,Beef stewing cuts,per kilogram,Protein
4,May-23,Canada,19.02,Beef stewing cuts,per kilogram,Protein
5,Jun-23,Canada,19.63,Beef stewing cuts,per kilogram,Protein
6,Jul-23,Canada,19.48,Beef stewing cuts,per kilogram,Protein
7,Aug-23,Canada,16.23,Beef stewing cuts,per kilogram,Protein
8,Sep-23,Canada,18.5,Beef stewing cuts,per kilogram,Protein
9,Jan-23,Canada,27.98,Beef striploin cuts,per kilogram,Protein


In [47]:
df2['Category']

0        Protein
1        Protein
2        Protein
3        Protein
4        Protein
          ...   
10876     Others
10877     Others
10878     Others
10879     Others
10880     Others
Name: Category, Length: 10881, dtype: object

In [None]:
# save list of unique products in a .csv table column
product_list_df = pd.DataFrame(list(zip(*[product_list]))).add_prefix('Col')
product_list_df.to_csv('monthly_average_unique_product_list.csv', index=False)
print(product_list_df)

                      Col0
0        Beef stewing cuts
1      Beef striploin cuts
2    Beef top sirloin cuts
3            Beef rib cuts
4              Ground beef
..                     ...
100        Sunflower seeds
101              Deodorant
102             Toothpaste
103                Shampoo
104      Laundry detergent

[105 rows x 1 columns]
