# What's in an Avocado Toast: A Supply Chain Analysis

You're in London, making an avocado toast, a quick-to-make dish that has soared in popularity on breakfast menus since the 2010s. A simple smashed avocado toast can be made with five ingredients: one ripe avocado, half a lemon, a big pinch of salt flakes, two slices of sourdough bread and a good drizzle of extra virgin olive oil. It's no small feat that most of these ingredients are readily available in grocery stores. 

In this project, you'll conduct a supply chain analysis of three of these ingredients used in an avocado toast, utilizing the Open Food Facts database. This database contains extensive, openly-sourced information on various foods, including their origins. Through this analysis, you will gain an in-depth understanding of the complex supply chain involved in producing a single dish.

Three pairs of files are provided in the data folder:
- A CSV file for each ingredient, such as `avocado.csv`, with data about each food item and countries of origin
- A TXT file for each ingredient, such as `relevant_avocado_categories`, containing only the category tags of interest for that food.

Here are some other key points about these files:
- Some of the rows of data in each of the three CSV files do not contain relevant data for your investigation. In each dataset, you will need to filter out rows with irrelevant data, based on values in the `categories_tags` column. Examples of categories are, fruits, vegetables, and fruit-based oils. Filter the DataFrame to include only rows where `categories_tags` contains one of the tags in the relevant categories for that ingredient.
- Each row of data usually has multiple categories tags in the `categories_tags` column.
- There is a column in each CSV file called `origins_tags` with strings for country of origin of that item.

After completing this project, you'll be armed with a list of ingredients and their countries of origin, and be well-positioned to launch into other analyses that explore how long, on average, these ingredients spend at sea.

![](avocado_wallpaper.jpeg)

In [2]:
import pandas as pd

# Clean avocado dataset
# Read avocado dataset
avocado = pd.read_csv('data/avocado.csv', delimiter='\t', header=0)

# Read relevant categories into a list
relevant_avocado_categories = 'data/relevant_avocado_categories.txt'
with open(relevant_avocado_categories, 'r') as file:
    relevant_avocado_categories = [line.strip() for line in file.readlines()]

# Fill NaN values with an empty string
avocado['categories_tags'] = avocado['categories_tags'].fillna('')

# Filter avocado dataset by relevant categories tag
filtered_avocado = avocado[avocado['categories_tags'].str.split(',').apply(lambda x: any(tag in relevant_avocado_categories for tag in x))]

# Filter avocado dataset by columns
cols = ['code', 'lc', 'product_name_en', 'quantity', 'serving_size', 'packaging_tags', 'brands', 'brands_tags', 'categories_tags', 'labels_tags', 'countries', 'countries_tags', 'origins','origins_tags']

filtered_avocado = filtered_avocado[cols]  # Remove extra brackets

print(filtered_avocado.head())
print(filtered_avocado.shape)

             code  lc  ...            origins       origins_tags
5   3662994002063  fr  ...                NaN                NaN
6   8437013031011  fr  ...                NaN                NaN
14  4016249238155  de  ...  Europäische Union  en:european-union
17  8718963381532  de  ...                NaN                NaN
23  8436002746707  es  ...                NaN                NaN

[5 rows x 14 columns]
(182, 14)


In [3]:
# Clean olive oil dataset
# Read olive oil dataset
olive = pd.read_csv('data/olive_oil.csv', delimiter='\t', header=0)

# Read relevant categories into a list
relevant_olive_categories = 'data/relevant_olive_oil_categories.txt'
with open(relevant_olive_categories, 'r') as file:
    relevant_olive_categories = [line.strip() for line in file.readlines()]

# Fill NaN values with an empty string
olive['categories_tags'] = olive['categories_tags'].fillna('')

# Filter olive oil dataset by relevant categories tag
filtered_olive = olive[olive['categories_tags'].str.split(',').apply(lambda x: any(tag in relevant_olive_categories for tag in x))]

# Filter avocado dataset by columns
filtered_olive = filtered_olive[cols]  # Remove extra brackets

print(filtered_olive.head())
print(filtered_olive.shape)

            code  lc  ...           origins         origins_tags
0  3560070910366  fr  ...  Espagne, Tunisie  en:spain,en:tunisia
1  3270190008262  fr  ...  Union européenne    en:european-union
2  6191509903023  en  ...           Tunisia           en:tunisia
3  6191509900855  fr  ...           Tunisie           en:tunisia
7  8002470023154  fr  ...               NaN                  NaN

[5 rows x 14 columns]
(5290, 14)


In [4]:
# Clean sourdough oil dataset
# Read sourdough oil dataset
sourdough = pd.read_csv('data/sourdough.csv', delimiter='\t', header=0)

# Read relevant categories into a list
relevant_sourdough_categories = 'data/relevant_sourdough_categories.txt'
with open(relevant_sourdough_categories, 'r') as file:
    relevant_sourdough_categories = [line.strip() for line in file.readlines()]

# Fill NaN values with an empty string
sourdough['categories_tags'] = sourdough['categories_tags'].fillna('')

# Filter sourdough dataset by relevant categories tag
filtered_sourdough = sourdough[sourdough['categories_tags'].str.split(',').apply(lambda x: any(tag in relevant_sourdough_categories for tag in x))]

# Filter avocado dataset by columns
filtered_sourdough = filtered_sourdough[cols]  # Remove extra brackets

print(filtered_sourdough.head())
print(filtered_sourdough.shape)

                    code  lc  ... origins origins_tags
4   10500016941075200179  fr  ...     NaN          NaN
5   10500016930534800159  fr  ...     NaN          NaN
9               10086988  en  ...     NaN          NaN
15              00252751  en  ...     NaN          NaN
25         5059697263682  en  ...     NaN          NaN

[5 rows x 14 columns]
(399, 14)


In [16]:
# Filter the datasets based on UK as the consumption country

uk_tags = ["en:united-kingdom"]

filtered_avocado = filtered_avocado[filtered_avocado['countries_tags'].fillna('').str.split(',').apply(lambda x: any(tag in uk_tags for tag in x))]

filtered_olive = filtered_olive[filtered_olive['countries_tags'].fillna('').str.split(',').apply(lambda x: any(tag in uk_tags for tag in x))]

filtered_sourdough = filtered_sourdough[filtered_sourdough['countries_tags'].fillna('').str.split(',').apply(lambda x: any(tag in uk_tags for tag in x))]

print(filtered_avocado.info())
print(filtered_olive.info())
print(filtered_sourdough.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17 entries, 361 to 1771
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   code             17 non-null     object
 1   lc               17 non-null     object
 2   product_name_en  14 non-null     object
 3   quantity         10 non-null     object
 4   serving_size     3 non-null      object
 5   packaging_tags   13 non-null     object
 6   brands           15 non-null     object
 7   brands_tags      15 non-null     object
 8   categories_tags  17 non-null     object
 9   labels_tags      5 non-null      object
 10  countries        17 non-null     object
 11  countries_tags   17 non-null     object
 12  origins          6 non-null      object
 13  origins_tags     6 non-null      object
dtypes: object(14)
memory usage: 2.0+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 172 entries, 23 to 8272
Data columns (total 14 columns):
 #   Column         

In [22]:
# Find the most common origin country for each ingredient

top_avocado_origin = filtered_avocado['origins'].value_counts().idxmax()
top_olive_oil_origin = filtered_olive['origins'].value_counts().idxmax()
top_sourdough_origin = filtered_sourdough['origins'].value_counts().idxmax()

print(top_avocado_origin)
print(top_olive_oil_origin)
print(top_sourdough_origin)



Peru
Greece
United Kingdom
