### Objectives:
- Identify data sources for all variables of interest. Categories based on https://www.canada.ca/en/health-canada/services/publications/food-nutrition/2019-canada-food-guide-food-classification-system-foods-beverages-categories.html 
    - vegetables and fruit
        - fruit
        - vegetables
    - grains
        - whole grains
        - whole grain and whole wheat foods
    - plant-based protein foods
        - plant-based yogurts
        - fortified plant-based cheeses
        - legumes
        - simulated meats
        - nuts and seeds
        - other plant-based foods
    - animal-based protein foods
        - yogurts and kefir
        - cheeses
        - other milk-based foods
        - red meats
            - beef
            - pork
            - lamb
            - goat
        - game meats
        - poultry
            - chicken
            - turkey
        - eggs
        - fish and shellfish
        - organ meats
    - beverages
        - water
        - fortified plant-based beverages 
        - non-fortified plant-based beverages
        - milks
        - fruit juice
        - vegetable juice
        - coffee
        - tea
        - soda
    - other foods
        - condiments, sauces, dressings
        - snack foods
        - processed meats
    - fats and oils
        - unsaturated fats and oils
        - saturated and trans fats and oils
    - unclassified foods
        - meal replacements
        - alcoholic beverages
        - herbs and spices
- acquire 2023 monthly averages for each variable
- where possible, extract and isolate the following variables:
    - category (if available)
    - product
    - product type (if applicable)
    - unit of measurement
    - unit quantity
    - price per unit
    - date
    - location (if available)

Resources:
- https://www.statcan.gc.ca/en/statistical-programs/document/2301_D68_V1 list of representative products used to calculate CPI "basket" of goods and services
- primary resource: https://www150.statcan.gc.ca/t1/tbl1/en/cv.action?pid=1810024501 monthly average retail prices for selected products
    - redirected to this table from data resources for red meat, dairy, produce
- supplemental resources: 
    - https://agriculture.canada.ca/en/market-information-system/rp/index-eng.cfm?action=pR&r=116&pdctc= monthly average retail prices for pork, turkey, chicken, eggs  

#### Primary data source: monthly average retail prices

In [1]:
# import pandas
import pandas as pd

In [2]:
# define relative path
path = '../data_sources/raw_data'

In [3]:
# import monthly average retail prices for food and related products
monthly_avg_retail_prices_2023 = pd.read_csv(f'{path}/2023_monthly_product_averages.csv')
monthly_avg_retail_prices_2023

FileNotFoundError: [Errno 2] No such file or directory: '../data_sources/raw_data/2023_monthly_product_averages.csv'

In [4]:
# check dtypes of dataframe
monthly_avg_retail_prices_2023.dtypes

NameError: name 'monthly_avg_retail_prices_2023' is not defined

In [6]:
# filter for variables of interest
monthly_avg_retail_prices_2023 = monthly_avg_retail_prices_2023[['REF_DATE', 'GEO', 'Products', 'UOM', 'VALUE']]
monthly_avg_retail_prices_2023


Unnamed: 0,REF_DATE,GEO,Products,UOM,VALUE
0,Jan-23,Canada,"Beef stewing cuts, per kilogram",Dollars,17.52
1,Feb-23,Canada,"Beef stewing cuts, per kilogram",Dollars,17.05
2,Mar-23,Canada,"Beef stewing cuts, per kilogram",Dollars,17.08
3,Apr-23,Canada,"Beef stewing cuts, per kilogram",Dollars,18.17
4,May-23,Canada,"Beef stewing cuts, per kilogram",Dollars,19.02
...,...,...,...,...,...
10876,May-23,British Columbia,"Laundry detergent, 4.43 litres",Dollars,16.21
10877,Jun-23,British Columbia,"Laundry detergent, 4.43 litres",Dollars,16.75
10878,Jul-23,British Columbia,"Laundry detergent, 4.43 litres",Dollars,16.71
10879,Aug-23,British Columbia,"Laundry detergent, 4.43 litres",Dollars,16.22


In [7]:
# cast products column to string
monthly_avg_retail_prices_2023 = monthly_avg_retail_prices_2023.astype({'Products':'string'})
monthly_avg_retail_prices_2023.dtypes

REF_DATE     object
GEO          object
Products     string
UOM          object
VALUE       float64
dtype: object

In [8]:
# test if product column can be split into item and units based on comma delimiter
df = monthly_avg_retail_prices_2023['Products'].str.split(',', expand = True)
df

Unnamed: 0,0,1
0,Beef stewing cuts,per kilogram
1,Beef stewing cuts,per kilogram
2,Beef stewing cuts,per kilogram
3,Beef stewing cuts,per kilogram
4,Beef stewing cuts,per kilogram
...,...,...
10876,Laundry detergent,4.43 litres
10877,Laundry detergent,4.43 litres
10878,Laundry detergent,4.43 litres
10879,Laundry detergent,4.43 litres


In [9]:
# define a new dataframe with split product/unit data
df2 = pd.concat([monthly_avg_retail_prices_2023, monthly_avg_retail_prices_2023['Products'].str.split(',', expand=True)], axis=1).drop('Products', axis=1)
df2

Unnamed: 0,REF_DATE,GEO,UOM,VALUE,0,1
0,Jan-23,Canada,Dollars,17.52,Beef stewing cuts,per kilogram
1,Feb-23,Canada,Dollars,17.05,Beef stewing cuts,per kilogram
2,Mar-23,Canada,Dollars,17.08,Beef stewing cuts,per kilogram
3,Apr-23,Canada,Dollars,18.17,Beef stewing cuts,per kilogram
4,May-23,Canada,Dollars,19.02,Beef stewing cuts,per kilogram
...,...,...,...,...,...,...
10876,May-23,British Columbia,Dollars,16.21,Laundry detergent,4.43 litres
10877,Jun-23,British Columbia,Dollars,16.75,Laundry detergent,4.43 litres
10878,Jul-23,British Columbia,Dollars,16.71,Laundry detergent,4.43 litres
10879,Aug-23,British Columbia,Dollars,16.22,Laundry detergent,4.43 litres


In [10]:
# rename columns 
df2.rename(columns={'REF_DATE': 'date', 'GEO':'location', 'UOM':'price units', 'VALUE': 'price', 0: 'product', 1: 'unit'}, inplace=True)
df2

Unnamed: 0,date,location,price units,price,product,unit
0,Jan-23,Canada,Dollars,17.52,Beef stewing cuts,per kilogram
1,Feb-23,Canada,Dollars,17.05,Beef stewing cuts,per kilogram
2,Mar-23,Canada,Dollars,17.08,Beef stewing cuts,per kilogram
3,Apr-23,Canada,Dollars,18.17,Beef stewing cuts,per kilogram
4,May-23,Canada,Dollars,19.02,Beef stewing cuts,per kilogram
...,...,...,...,...,...,...
10876,May-23,British Columbia,Dollars,16.21,Laundry detergent,4.43 litres
10877,Jun-23,British Columbia,Dollars,16.75,Laundry detergent,4.43 litres
10878,Jul-23,British Columbia,Dollars,16.71,Laundry detergent,4.43 litres
10879,Aug-23,British Columbia,Dollars,16.22,Laundry detergent,4.43 litres


In [11]:
df2.product.unique()

AttributeError: 'function' object has no attribute 'unique'

In [42]:
# identify products that have null values for unit quantity
df2[df2['unit'].isna()]

Unnamed: 0,date,location,price units,price,product,unit
720,2023-01,Canada,Dollars,4.27,Tea (20 bags),
721,2023-02,Canada,Dollars,4.32,Tea (20 bags),
722,2023-03,Canada,Dollars,4.31,Tea (20 bags),
723,2023-04,Canada,Dollars,4.35,Tea (20 bags),
724,2023-05,Canada,Dollars,4.37,Tea (20 bags),
...,...,...,...,...,...,...
10615,2023-05,British Columbia,Dollars,4.55,Tea (20 bags),
10616,2023-06,British Columbia,Dollars,4.33,Tea (20 bags),
10617,2023-07,British Columbia,Dollars,4.75,Tea (20 bags),
10618,2023-08,British Columbia,Dollars,4.78,Tea (20 bags),


In [43]:
# split products with null unit quantity based on ( delimiter
df3 = pd.concat([df2, df2['product'].str.split('(', expand=True)], axis=1).drop('product', axis=1)
df3

Unnamed: 0,date,location,price units,price,unit,0,1
0,2023-01,Canada,Dollars,17.52,per kilogram,Beef stewing cuts,
1,2023-02,Canada,Dollars,17.05,per kilogram,Beef stewing cuts,
2,2023-03,Canada,Dollars,17.08,per kilogram,Beef stewing cuts,
3,2023-04,Canada,Dollars,18.17,per kilogram,Beef stewing cuts,
4,2023-05,Canada,Dollars,19.02,per kilogram,Beef stewing cuts,
...,...,...,...,...,...,...,...
10876,2023-05,British Columbia,Dollars,16.21,4.43 litres,Laundry detergent,
10877,2023-06,British Columbia,Dollars,16.75,4.43 litres,Laundry detergent,
10878,2023-07,British Columbia,Dollars,16.71,4.43 litres,Laundry detergent,
10879,2023-08,British Columbia,Dollars,16.22,4.43 litres,Laundry detergent,


In [45]:
# compare number of rows that are not null in new split column to number of rows that were previously null in unit column
df3[df3[1].notna()]

Unnamed: 0,date,location,price units,price,unit,0,1
720,2023-01,Canada,Dollars,4.27,,Tea,20 bags)
721,2023-02,Canada,Dollars,4.32,,Tea,20 bags)
722,2023-03,Canada,Dollars,4.31,,Tea,20 bags)
723,2023-04,Canada,Dollars,4.35,,Tea,20 bags)
724,2023-05,Canada,Dollars,4.37,,Tea,20 bags)
...,...,...,...,...,...,...,...
10615,2023-05,British Columbia,Dollars,4.55,,Tea,20 bags)
10616,2023-06,British Columbia,Dollars,4.33,,Tea,20 bags)
10617,2023-07,British Columbia,Dollars,4.75,,Tea,20 bags)
10618,2023-08,British Columbia,Dollars,4.78,,Tea,20 bags)


In [46]:
# fill null unit values with values split based on "(" delimiter
df3['unit'] = df3['unit'].fillna(df3[1])

In [53]:
# drop split column
df3.drop(1, axis=1, inplace=True)
df3

Unnamed: 0,date,location,price units,price,unit,0
0,2023-01,Canada,Dollars,17.52,per kilogram,Beef stewing cuts
1,2023-02,Canada,Dollars,17.05,per kilogram,Beef stewing cuts
2,2023-03,Canada,Dollars,17.08,per kilogram,Beef stewing cuts
3,2023-04,Canada,Dollars,18.17,per kilogram,Beef stewing cuts
4,2023-05,Canada,Dollars,19.02,per kilogram,Beef stewing cuts
...,...,...,...,...,...,...
10876,2023-05,British Columbia,Dollars,16.21,4.43 litres,Laundry detergent
10877,2023-06,British Columbia,Dollars,16.75,4.43 litres,Laundry detergent
10878,2023-07,British Columbia,Dollars,16.71,4.43 litres,Laundry detergent
10879,2023-08,British Columbia,Dollars,16.22,4.43 litres,Laundry detergent


In [56]:
# rename product column
df3.rename(columns={0:'product'}, inplace=True)
df3

Unnamed: 0,date,location,price units,price,unit,product
0,2023-01,Canada,Dollars,17.52,per kilogram,Beef stewing cuts
1,2023-02,Canada,Dollars,17.05,per kilogram,Beef stewing cuts
2,2023-03,Canada,Dollars,17.08,per kilogram,Beef stewing cuts
3,2023-04,Canada,Dollars,18.17,per kilogram,Beef stewing cuts
4,2023-05,Canada,Dollars,19.02,per kilogram,Beef stewing cuts
...,...,...,...,...,...,...
10876,2023-05,British Columbia,Dollars,16.21,4.43 litres,Laundry detergent
10877,2023-06,British Columbia,Dollars,16.75,4.43 litres,Laundry detergent
10878,2023-07,British Columbia,Dollars,16.71,4.43 litres,Laundry detergent
10879,2023-08,British Columbia,Dollars,16.22,4.43 litres,Laundry detergent


In [72]:
df4 = pd.concat([df3, df3['unit'].str.split(' ', expand=True)], axis=1).drop('unit', axis=1)
df4

Unnamed: 0,date,location,price units,price,product,0,1,2,3
0,2023-01,Canada,Dollars,17.52,Beef stewing cuts,,per,kilogram,
1,2023-02,Canada,Dollars,17.05,Beef stewing cuts,,per,kilogram,
2,2023-03,Canada,Dollars,17.08,Beef stewing cuts,,per,kilogram,
3,2023-04,Canada,Dollars,18.17,Beef stewing cuts,,per,kilogram,
4,2023-05,Canada,Dollars,19.02,Beef stewing cuts,,per,kilogram,
...,...,...,...,...,...,...,...,...,...
10876,2023-05,British Columbia,Dollars,16.21,Laundry detergent,,4.43,litres,
10877,2023-06,British Columbia,Dollars,16.75,Laundry detergent,,4.43,litres,
10878,2023-07,British Columbia,Dollars,16.71,Laundry detergent,,4.43,litres,
10879,2023-08,British Columbia,Dollars,16.22,Laundry detergent,,4.43,litres,


In [74]:
col3 = df4[3]
col3

0        <NA>
1        <NA>
2        <NA>
3        <NA>
4        <NA>
         ... 
10876    <NA>
10877    <NA>
10878    <NA>
10879    <NA>
10880    <NA>
Name: 3, Length: 10881, dtype: string

In [78]:
col3.unique()

<StringArray>
[<NA>, '']
Length: 2, dtype: string

In [None]:
# split units into unit and unit quantity
for unit in ['kilogram', 'litre', 'litres', 'gram', 'grams', 'dozen', 'millilitres', 'bags']:
    


In [23]:
# # split products column into product and quantity based on comma delimiter
# data_split = pd.DataFrame(
#     [x.rsplit(', ', 1) for x in monthly_avg_retail_prices_2023.Products.tolist()], 
#     columns=['product', 'unit_quantity']
# )
# data_split

Unnamed: 0,product,unit_quantity
0,Beef stewing cuts,per kilogram
1,Beef stewing cuts,per kilogram
2,Beef stewing cuts,per kilogram
3,Beef stewing cuts,per kilogram
4,Beef stewing cuts,per kilogram
...,...,...
10876,Laundry detergent,4.43 litres
10877,Laundry detergent,4.43 litres
10878,Laundry detergent,4.43 litres
10879,Laundry detergent,4.43 litres


### Identify unique vector IDs that may be usable in API data acquisition

In [7]:
# generate a list of unique products in monthly retail averages table
vector_list = monthly_avg_retail_prices_2023['VECTOR'].unique().tolist()
vector_list

['v1353834271',
 'v1353834272',
 'v1353834273',
 'v1353834311',
 'v1353834274',
 'v1353834275',
 'v1353834276',
 'v1353834312',
 'v1353834277',
 'v1353834278',
 'v1353834279',
 'v1353834313',
 'v1353834280',
 'v1353834281',
 'v1458869929',
 'v1458869931',
 'v1353834314',
 'v1353834282',
 'v1458869922',
 'v1353834283',
 'v1353834284',
 'v1353834285',
 'v1458869932',
 'v1458869923',
 'v1353834286',
 'v1353834287',
 'v1458869921',
 'v1353834288',
 'v1353834289',
 'v1353834290',
 'v1353834291',
 'v1353834292',
 'v1353834293',
 'v1353834294',
 'v1353834295',
 'v1353834296',
 'v1353834315',
 'v1353834297',
 'v1353834298',
 'v1458869934',
 'v1353834299',
 'v1353834300',
 'v1353834316',
 'v1353834317',
 'v1353834301',
 'v1353834302',
 'v1353834303',
 'v1353834304',
 'v1353834305',
 'v1353834306',
 'v1353834307',
 'v1353834308',
 'v1353834318',
 'v1353834319',
 'v1353834309',
 'v1353834310',
 'v1458869933',
 'v1458869928',
 'v1353834320',
 'v1353834321',
 'v1353834322',
 'v1353834323',
 'v13538

Identify unique product values in 2023 monthly product retail averages table:

In [None]:
# generate a list of unique products in monthly retail averages table
product_list = df3['product'].unique().tolist()

In [None]:
# save list of unique products in a .csv table column
product_list_df = pd.DataFrame(list(zip(*[product_list]))).add_prefix('Col')
product_list_df.to_csv('monthly_average_unique_product_list.csv', index=False)
print(product_list_df)

                      Col0
0        Beef stewing cuts
1      Beef striploin cuts
2    Beef top sirloin cuts
3            Beef rib cuts
4              Ground beef
..                     ...
100        Sunflower seeds
101              Deodorant
102             Toothpaste
103                Shampoo
104      Laundry detergent

[105 rows x 1 columns]
