# What's in an Avocado Toast: A Supply Chain Analysis

![](avocado_wallpaper.jpeg)

You find yourself in London, crafting a delectable avocado toast, a dish that has risen dramatically in popularity on breakfast menus since the 2010s. This straightforward recipe requires just five ingredients: a ripe avocado, half a lemon, a generous pinch of salt flakes, two slices of sourdough bread, and a good drizzle of extra virgin olive oil. Most of these ingredients are now staples in grocery stores, and as you will find with this project, that is no small feat!

In this project, you'll conduct a supply chain analysis of three ingredients used in avocado toast using the Open Food Facts database. This database contains extensive, openly-sourced information on various foods, including their origins. Through this analysis, you will gain an in-depth understanding of the complex supply chain involved in producing a single dish.

Three pairs of files are provided in the data folder:
- A CSV file for each ingredient, such as `avocado.csv`, with data about each food item and countries of origin.
- A TXT file for each ingredient, such as `relevant_avocado_categories`, containing only the category tags of interest for that food.

Here are some other key points about these files:
- Some of the rows of data in each of the three CSV files do not contain relevant data for your investigation. In each dataset, you will need to filter out rows with irrelevant data, based on values in the `categories_tags` column. Examples of categories are fruits, vegetables, and fruit-based oils. Filter the DataFrame to include only rows where `categories_tags` contains one of the tags in the relevant categories for that ingredient.
- Each row of data usually has multiple category tags in the `categories_tags` column.
There is a column in each CSV file called `origins_tags`, which contains strings for the country of origin of each item.

After completing this project, you'll be armed with a list of ingredients and their countries of origin and be well-positioned to launch into other analyses that explore how long, on average, these ingredients spend at sea.

[Open Food Facts database](https://world.openfoodfacts.org/)

Apply your data manipulation and analysis skills to investigate the supply chain of ingredients for making avocado toast in the U.K. Your task is to determine the following information:

- The name of the most common country of origin for each of the three key ingredients: avocados, olive oil, and sourdough.
Store the most common country of origin for each ingredient in the following variables: top_avocado_origin, top_olive_oil_origin, and top_sourdough_origin. Ensure that the country names contain only letters (A-Z) and spaces, with no hyphens or other characters.

To focus your analysis, subset each of the DataFrames to include only these relevant columns: 'code', 'lc', 'productnameen', 'quantity', 'servingsize', 'packagingtags', 'brands', 'brandstags', 'categoriestags', 'labelstags', 'countries', 'countriestags', 'origins', 'origins_tags'.

After completing this project, feel free to explore other questions using the food data!

In [2]:
import pandas as pd
avocado = pd.read_csv('data/avocado.csv', sep='\t')
avocado

Unnamed: 0,code,lc,product_name_de,product_name_el,product_name_en,product_name_es,product_name_fi,product_name_fr,product_name_id,product_name_it,...,off:ecoscore_data.adjustments.packaging.non_recyclable_and_non_biodegradable_materials,off:ecoscore_data.adjustments.production_system.value,off:ecoscore_data.adjustments.threatened_species.value,sources_fields:org-database-usda:available_date,sources_fields:org-database-usda:fdc_category,sources_fields:org-database-usda:fdc_data_source,sources_fields:org-database-usda:fdc_id,sources_fields:org-database-usda:modified_date,sources_fields:org-database-usda:publication_date,data_sources
0,0059749979702,fr,,,,,,Naturalia Avocado Oil,,,...,1.0,0.0,,,,,,,,"App - yuka, Apps"
1,7610095131409,en,,,,,,Avocado Bowl chips,,,...,1.0,0.0,,,,,,,,"App - Yuka, Apps, Producers, Producer - zweifel"
2,4005514005578,en,,,Gelbe Linse Avocado Brotaufstrich,,,,,,...,1.0,15.0,,,,,,,,"App - yuka, Apps, App - smoothie-openfoodfacts"
3,0879890002513,en,,,Avocado toast chili lime,,,,,,...,1.0,0.0,,,,,,,,"App - Yuka, Apps, App - InFood"
4,0223086613685,en,,,Avocado,,,,,,...,1.0,0.0,,,,,,,,"App - Yuka, Apps"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1780,0819573012712,en,,,"Organic Baby Food, Apples, Kale & Avocados",,,,,,...,1.0,0.0,,,,,,,,"Database - USDA NDB, Databases"
1781,0052200072097,en,,,"Just Pineapple, Pear & Avocado",,,,,,...,1.0,0.0,,,,,,,,"Database - USDA NDB, Databases"
1782,0793613300000,en,,,Spinach Avocado Dip,,,,,,...,1.0,0.0,,,,,,,,"Database - USDA NDB, Databases"
1783,05252428,en,,,"Organic Just Apple, Raspberry & Avocado, Apple...",,,,,,...,1.0,0.0,,,,,,,,"Database - USDA NDB, Databases"


In [3]:
subset_columns = [ 'code', 'lc', 'product_name_en', 'quantity', 'serving_size', 'packaging_tags', 'brands', 'brands_tags', 'categories_tags', 'labels_tags', 'countries', 'countries_tags', 'origins','origins_tags']
avocado = avocado[subset_columns]
avocado

Unnamed: 0,code,lc,product_name_en,quantity,serving_size,packaging_tags,brands,brands_tags,categories_tags,labels_tags,countries,countries_tags,origins,origins_tags
0,0059749979702,fr,,,,,Naturalia,naturalia,"en:plant-based-foods-and-beverages,en:plant-ba...",,Canada,en:canada,,
1,7610095131409,en,,,,,Zweifel,zweifel,"en:snacks,en:salty-snacks,en:appetizers,en:chi...","en:vegetarian,en:vegan","Switzerland, World","en:switzerland,en:world",,
2,4005514005578,en,Gelbe Linse Avocado Brotaufstrich,,,,Tartex,tartex,de:abendbrotsufstrich,"en:organic,en:eu-organic,en:eg-oko-verordnung",Germany,en:germany,,
3,0879890002513,en,Avocado toast chili lime,,,,,,,,United States,en:united-states,,
4,0223086613685,en,Avocado,,,,,,,,United States,en:united-states,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1780,0819573012712,en,"Organic Baby Food, Apples, Kale & Avocados",,113 g (113 g),,Happybaby,happybaby,,,United States,en:united-states,,
1781,0052200072097,en,"Just Pineapple, Pear & Avocado",,60 g (60 g),,"Beech-Nut, Beech-Nut Nutrition Company","beech-nut,beech-nut-nutrition-company",,,United States,en:united-states,,
1782,0793613300000,en,Spinach Avocado Dip,,28 g (2 Tbsp),,Classy Delites,classy-delites,,,United States,en:united-states,,
1783,05252428,en,"Organic Just Apple, Raspberry & Avocado, Apple...",,60 g (0.25 cup),,Beech-Nut,beech-nut,,,United States,en:united-states,,


In [4]:
with open('data/relevant_avocado_categories.txt', 'r') as file:
    relevant_avocado_categories = file.read().splitlines()
    file.close()

relevant_avocado_categories

['en:avocadoes',
 'en:avocados',
 'en:fresh-foods',
 'en:fresh-vegetables',
 'en:fruchte',
 'en:fruits',
 'en:raw-green-avocados',
 'en:tropical-fruits',
 'en:tropische-fruchte',
 'en:vegetables-based-foods',
 'fr:hass-avocados']

In [5]:
# Turn a column of comma-separated tags into a column of lists
avocado['categories_list'] = avocado['categories_tags'].str.split(',')
avocado['categories_list']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  avocado['categories_list'] = avocado['categories_tags'].str.split(',')


0       [en:plant-based-foods-and-beverages, en:plant-...
1       [en:snacks, en:salty-snacks, en:appetizers, en...
2                                 [de:abendbrotsufstrich]
3                                                     NaN
4                                                     NaN
                              ...                        
1780                                                  NaN
1781                                                  NaN
1782                                                  NaN
1783                                                  NaN
1784                                                  NaN
Name: categories_list, Length: 1785, dtype: object

In [6]:
# Drop rows with null values in a particular column
avocado = avocado.dropna(subset = 'categories_list')

In [None]:
# Filter a DataFrame based on a column of lists
avocado = avocado[avocado['categories_list'].apply(lambda x: any([i for i in x if i in relevant_avocado_categories]))]

# Solution

In [None]:
### Read in the avocado data

# Read tab-delimited data
import pandas as pd
avocado = pd.read_csv('data/avocado.csv', sep='\t')

# Subset large DataFrame to include only relevant columns
subset_columns = [ 'code', 'lc', 'product_name_en', 'quantity', 'serving_size', 'packaging_tags', 'brands', 'brands_tags', 'categories_tags', 'labels_tags', 'countries', 'countries_tags', 'origins','origins_tags']
avocado = avocado[subset_columns]

# Gather relevant categories data for avocados
with open("data/relevant_avocado_categories.txt", "r") as file:
    relevant_avocado_categories = file.read().splitlines()
    file.close()
    
### Filter avocado data using relevant category tags

# Turn a column of comma-separated tags into a column of lists
avocado['categories_list'] = avocado['categories_tags'].str.split(',')

# Drop rows with null values in a particular column
avocado = avocado.dropna(subset = 'categories_list')

# Filter a DataFrame based on a column of lists
avocado = avocado[avocado['categories_list'].apply(lambda x: any([i for i in x if i in relevant_avocado_categories]))]

### Where do most avocados come from?

# Filter DataFrame for UK data
avocados_uk = avocado[(avocado['countries']=='United Kingdom')]

# Find most common country for avocado origin
avocado_origin = (avocados_uk['origins_tags'].value_counts().index[0])
avocado_origin = avocado_origin.lstrip("en:")


### Create a general function to read and filter data for a particular ingredient, 
###    and return the top origin country for that food item

def read_and_filter_data(filename, relevant_categories):
  df = pd.read_csv('data/' + filename, sep='\t')
  
  # Subset large DataFrame to include only relevant columns
  subset_columns = [ 'code', 'lc', 'product_name_en', 'quantity', 'serving_size', 'packaging_tags', 'brands', 'brands_tags', 'categories_tags', 'labels_tags', 'countries', 'countries_tags', 'origins','origins_tags']
  df = df[subset_columns]

  # Split tags into lists
  df['categories_list'] = df['categories_tags'].str.split(',')

  # Drop rows with null categories data
  df = df.dropna(subset = 'categories_list')

  # Filter data for relevant categories
  df = df[df['categories_list'].apply(lambda x: any([i for i in x if i in relevant_categories]))]
    
  # Filter data for the UK
  df_uk = df[(df['countries']=='United Kingdom')]

  # Find top origin country string with the highest count
  top_origin_string = (df_uk['origins_tags'].value_counts().index[0])

  # Clean up top origin country string
  top_origin_country = top_origin_string.lstrip("en:")
  top_origin_country = top_origin_country.replace('-', ' ')

  print(f'**{filename[:-4]} origins**','\n', top_origin_country, '\n')

  print ("Top origin country: ", top_origin_country)
  print ("\n")

  # End of function - return top origin country for this ingredient
  return top_origin_country


# Analyze avocado origins again, this time by calling function
top_avocado_origin = read_and_filter_data('avocado.csv',relevant_avocado_categories)

### Repeat process above with new function for the other 2 ingredients

# Gather relevant categories data for olive oil
with open("data/relevant_olive_oil_categories.txt", "r") as file:
    relevant_olive_oil_categories = file.read().splitlines()
    file.close()

# Call user-defined function on olive_oil.csv
top_olive_oil_origin = read_and_filter_data('olive_oil.csv',relevant_olive_oil_categories)

# Gather relevant categories data for sourdough
with open("data/relevant_sourdough_categories.txt", "r") as file:
    relevant_sourdough_categories = file.read().splitlines()
    file.close()

# Call user-defined function on sourdough.csv
top_sourdough_origin = read_and_filter_data('sourdough.csv',relevant_sourdough_categories)
