# Food & Climate: Data-Driven Sustainability Project
![image.png](https://media.istockphoto.com/id/1805849861/pt/foto/harvesting-in-agriculture-crop-field.jpg?s=612x612&w=0&k=20&c=j00EhF1-GVLbuZtc7fzzWzTUSKXBhqTX_KFO3HLmZT0=) ![image.png](https://media.istockphoto.com/id/124677784/pt/foto/silhueta-de-tractor.jpg?s=612x612&w=0&k=20&c=KGG6YRaCGEOcwZRpkdAiokeOUUmz13HZ4FJcUhZ9L3g=)![image.png](https://media.istockphoto.com/id/1297005736/pt/foto/young-couple-villagers-with-milk-cans.jpg?s=612x612&w=0&k=20&c=b5s5kLwnW_QAyf5EbqdPiLR1JL99EgrLfi6sAv4S_84=) ![image.png](https://media.istockphoto.com/id/1479516556/pt/foto/grain-field-inspection.jpg?s=612x612&w=0&k=20&c=WHYjpj8KQKK7hBg64lyCPu663heNACoG9cngxgNeM-E=) 

# Table of contents:



## 1. Project Overview

1.1. Introduction and KPIs:

Over the past century, the global population has surged, with projections by the United Nations (2019) estimating it could reach 10 billion by 2050. This rapid growth has led to a corresponding increase in demand for food, energy, and water, placing immense pressure on natural resources and global food systems.

Technological advancements and economic expansion over the last 70 years have helped meet these demands, but they have also contributed to environmental challenges. One of the most pressing concerns is climate change, with consistent data indicating that Earth's annual maximum temperatures are rising at an alarming rate.

Agriculture and livestock production play a significant role in this phenomenon, accounting for 25-30% of total global CO₂ emissions. Therefore, the goal of this project is to analyze historical food production trends, quantify its environmental impact, and explore the relationship between agriculture-related emissions and climate change. Analyze global food production trends, quantify greenhouse gas emissions from food systems, and explore their relationship with climate anomalies using Python, Google Cloud Platform with BigQuery, and Tableau for Data Viz. Also hypothesis testing and statistical methods to determine their impact on climate anomalies. 

In that order, the definition of the following KPIs are in order:
* KPI: Total Global Food Production (by weight)
* KPI: GHG Emissions per Food Category/Product (Total & Per Kg)
* KPI: Country-Level GHG Emissions from Food Production
* KPI: Per Capita GHG Emissions from Food Production
* KPI: Correlation between Total Food Production/GHG Emissions and Global/Regional Temperature Anomalies.

1.2. Research Questions & Key Focus Areas:

💡 Global Food Production Trends
* How has food and feed production evolved over time?
* Which countries are the largest producers?
* Which countries are major contributors to both food production and its environmental impact?
* Which food products and feedstocks dominate global production?

💡 Environmental Impact & CO₂ Emissions
* Which foods contribute the most to greenhouse gas emissions?
* What production stages generate the highest emissions?
* How has the global temperature varied over the years?
* Which countries have experienced the most significant temperature changes?
* Is there a discernible relationship between food production, its environmental impact, and rising temperature anomalies?

2. Data Selection & Sources

For a comprehensive analysis, this project integrates three datasets, all available on Kaggle:

📌 Food Production Data (FAOSTAT - Food and Agriculture Organization)
[Kaggle: https://www.kaggle.com/datasets/dorbicycle/world-foodfeed-production]
- Covers global food and feed production from 1961 to 2013
- Includes country-level production statistics for various food items
- Source: FAOSTAT Food Production Dataset

📌 Environmental Impact of Food (Our World in Data)
[Kaggle: https://www.kaggle.com/datasets/selfvivek/environment-impact-of-food-production]
- Provides data on greenhouse gas emissions from 43 major food products
- Covers various stages of the production chain (e.g., farming, processing, transportation)
- Source: Environmental Impact of Food Dataset

📌 Climate Data - Temperature Anomalies (FAOSTAT) 
[Kaggle: https://www.kaggle.com/datasets/sevgisarac/temperature-change/data?select=Environment_Temperature_change_E_All_Data_NOFLAG.csv]
- Tracks annual surface temperature variations per country from 1961 to 2019
- Records temperature anomalies compared to the baseline period (1951-1980)
- Source: FAOSTAT Temperature Change Dataset

These datasets will be cleaned, merged, and analyzed to uncover patterns in food production, regional emissions hotspots, and climate trends linked to agriculture.

1.4. Hypothesis Testing: Structured Approach
* 📌 Null Hypothesis (H₀): There is no significant difference between the mean emissions from meat & dairy products vs. other food categories. 
* 📌 Alternative Hypothesis (H₁): Meat & dairy production generates significantly higher GHG emissions than plant-based food.
* ✅ Statistical Test: Use t-test for independent samples to compare emissions from meat/dairy vs. plant-based products.

Technologies & Tools
* Data Engineering: Python (Pandas), Kaggle Datasets
* Storage & Querying: Google BigQuery, SQL
* Statistical Analysis: SciPy, Statsmodels
* Visualization: Matplotlib, Seaborn, Tableau


## 2. Data selection
* 2.1. Import the libraries 
* 2.2. Data Engineering: Python (Pandas), Kaggle Datasets using APIs
* 2.3. Storage & Querying: Google BigQuery, SQL

### 2.1. Import the libraries

In [115]:
# Import the libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import tabulate # Pretty print tabular data

from IPython.display import display
%matplotlib inline
import plotly.offline as py
import statsmodels.api as sm
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.express as px
from plotly.subplots import make_subplots
py.init_notebook_mode(connected=True)
import plotly.io as pio
from sklearn.ensemble import IsolationForest # Machine Learning model for anomaly detection

# Set default renderer for vscode
pio.renderers.default = 'vscode'

# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub
from kagglehub import KaggleDatasetAdapter
import os
# Ensure the Kaggle folder exists
os.makedirs(os.path.expanduser("~/.kaggle"), exist_ok=True)

import warnings
warnings.filterwarnings('ignore')

### 2.2. Data Engineering: Python (Pandas), Kaggle Datasets using APIs

#### 2.2.1. Set up the directory

In [116]:
# Import your data here
data = ".../Documents/GitHub/Food-Climate-Data-Driven-Sustainability"
os.makedirs(data, exist_ok=True)  # Create directory if it doesn't exist

#### 2.2.2. Download Datasets using APIs

In [117]:
# Download the dataset using KaggleHub API
!kaggle datasets download -d dorbicycle/world-foodfeed-production -p {data}
!kaggle datasets download -d sevgisarac/temperature-change -p {data}
!kaggle datasets download -d selfvivek/environment-impact-of-food-production -p {data}

Dataset URL: https://www.kaggle.com/datasets/dorbicycle/world-foodfeed-production
License(s): CC0-1.0
Downloading world-foodfeed-production.zip to .../Documents/GitHub/Food-Climate-Data-Driven-Sustainability
  0%|                                                | 0.00/874k [00:00<?, ?B/s]
100%|████████████████████████████████████████| 874k/874k [00:00<00:00, 1.18GB/s]
Dataset URL: https://www.kaggle.com/datasets/sevgisarac/temperature-change
License(s): Attribution 3.0 IGO (CC BY 3.0 IGO)
Downloading temperature-change.zip to .../Documents/GitHub/Food-Climate-Data-Driven-Sustainability
  0%|                                               | 0.00/4.07M [00:00<?, ?B/s]
100%|██████████████████████████████████████| 4.07M/4.07M [00:00<00:00, 1.54GB/s]
Dataset URL: https://www.kaggle.com/datasets/selfvivek/environment-impact-of-food-production
License(s): DbCL-1.0
Downloading environment-impact-of-food-production.zip to .../Documents/GitHub/Food-Climate-Data-Driven-Sustainability
  0%|         

#### 2.2.3. Extract the files

In [118]:
# Extract the files as zip files
import glob # glob is used to find all the pathnames matching a specified pattern

zip_files = glob.glob(f"{data}/*.zip")
print(zip_files)  # Lists all .zip files found

import zipfile # Extract the zip files

for zip_file in zip_files:
    with zipfile.ZipFile(zip_file, 'r') as z:
        z.extractall(data)
    os.remove(zip_file)  # Cleanup after extraction

['.../Documents/GitHub/Food-Climate-Data-Driven-Sustainability/world-foodfeed-production.zip', '.../Documents/GitHub/Food-Climate-Data-Driven-Sustainability/temperature-change.zip', '.../Documents/GitHub/Food-Climate-Data-Driven-Sustainability/environment-impact-of-food-production.zip']


In [119]:
# check the folder again
print(os.listdir(data))

['FAOSTAT_data_en_11-1-2024.csv', 'Food_Production.csv', 'FAOSTAT_data_11-24-2020.csv', 'FAO.csv', 'FAOSTAT_data_1-10-2022.csv', 'Environment_Temperature_change_E_All_Data_NOFLAG.csv']


Unzip the files manually in your computer

#### 2.2.4. Load CSV files in Jupyter Notebook

In [126]:
# Define file paths

# Food production dataset from FAOSTAT
food_production_file = f"{data}/FAO.csv"  

# Temperature change dataset from Food and Agriculture Organization of the United Nations (FAOSTAT)
temperature_file_2024 = f"{data}/FAOSTAT_data_en_11-1-2024.csv"
temperature_file_2022 = f"{data}/FAOSTAT_data_1-10-2022.csv"
temperature_file_2020 = f"{data}/FAOSTAT_data_11-24-2020.csv"
temperature_file = f"{data}/Environment_Temperature_change_E_All_Data_NOFLAG.csv" # temperature change dataset

# Environment Impact of Food Production dataset
environment_file = f"{data}/Food_Production.csv"

In [127]:
# Load the 'food_production - FAO' dataset

# Install chardet if not already installed
%pip install chardet

import chardet #chardet is a character encoding detector

# Detect encoding
with open(food_production_file, 'rb') as f:
    result = chardet.detect(f.read())
    print(result['encoding'])  # Prints the detected encoding

# Load the CSV with the detected encoding
food_production_df = pd.read_csv(food_production_file, encoding=result['encoding'])

# Create a copy of the DataFrame
food_production_df_copy = food_production_df.copy()

Note: you may need to restart the kernel to use updated packages.
ISO-8859-1


In [128]:
# Load the 'Environment Impact of Food Production' dataset

# Detect encoding
with open(environment_file, 'rb') as f:
    result = chardet.detect(f.read()) #'rb' mode is used to read the file in binary format
    print(result['encoding'])  # Prints the detected encoding

# Load the CSV with the detected encoding
environment_file_df = pd.read_csv(environment_file, encoding=result['encoding'])

# Create a copy of the DataFrame
environment_file_df_copy = environment_file_df.copy()

utf-8


In [129]:
# Load the 'temperature change' dataset

# Detect encoding for temperature datasets
with open(temperature_file, 'rb') as f:
    result = chardet.detect(f.read())
    print(result['encoding'])  # Prints the detected encoding

# Load the CSV files with the detected encodings
temperature_df = pd.read_csv(temperature_file, encoding="latin1")  # OR
temperature_df = pd.read_csv(temperature_file, encoding="ISO-8859-1") 

# Create a copy of the DataFrame
temperature_df_copy = temperature_df.copy()

Johab


In [130]:
# Load the rest of dataset for 'temperature change' dataset. CSV files with the detected encodings 
temperature_df_2024 = pd.read_csv(temperature_file_2024)
temperature_df_2022 = pd.read_csv(temperature_file_2022)
country_ISO3_df = pd.read_csv(temperature_file_2020)

# Create copies of the DataFrames
temperature_df_2024_copy = temperature_df_2024.copy()
temperature_df_2022_copy = temperature_df_2022.copy()
country_ISO3_df_copy = country_ISO3_df.copy() #country ISO codes dataset

## 3. Data Collection & Preparation
* 3.1. Sourcing Datasets
    * 3.1.1. Identifying datasets (food production, greenhouse gas emissions, climate anomalies)
        * 3.1.1.1. Food and feed production dataset (food_production_df)
        * 3.1.1.2. Greenhouse gases emission for food production dataset (enviroment_file_df)
        * 3.1.1.3. Temperature change dataset (temperature_df)
            * 3.1.1.3.1 ISO-3 country code dataset

 * 3.2. Data cleaning and transformation
 * 3.3. Merging datasets
 * 3.4. Creating KPIs

#### 3.1. Sourcing Datasets. Identifying the datasets and previewing the data

* Food and feed production worlwide dataset (food_production_df)

In [131]:
# Preview the dataset food_production_df
food_production_df

Unnamed: 0,Area Abbreviation,Area Code,Area,Item Code,Item,Element Code,Element,Unit,latitude,longitude,...,Y2004,Y2005,Y2006,Y2007,Y2008,Y2009,Y2010,Y2011,Y2012,Y2013
0,AFG,2,Afghanistan,2511,Wheat and products,5142,Food,1000 tonnes,33.94,67.71,...,3249.0,3486.0,3704.0,4164.0,4252.0,4538.0,4605.0,4711.0,4810,4895
1,AFG,2,Afghanistan,2805,Rice (Milled Equivalent),5142,Food,1000 tonnes,33.94,67.71,...,419.0,445.0,546.0,455.0,490.0,415.0,442.0,476.0,425,422
2,AFG,2,Afghanistan,2513,Barley and products,5521,Feed,1000 tonnes,33.94,67.71,...,58.0,236.0,262.0,263.0,230.0,379.0,315.0,203.0,367,360
3,AFG,2,Afghanistan,2513,Barley and products,5142,Food,1000 tonnes,33.94,67.71,...,185.0,43.0,44.0,48.0,62.0,55.0,60.0,72.0,78,89
4,AFG,2,Afghanistan,2514,Maize and products,5521,Feed,1000 tonnes,33.94,67.71,...,120.0,208.0,233.0,249.0,247.0,195.0,178.0,191.0,200,200
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21472,ZWE,181,Zimbabwe,2948,Milk - Excluding Butter,5142,Food,1000 tonnes,-19.02,29.15,...,373.0,357.0,359.0,356.0,341.0,385.0,418.0,457.0,426,451
21473,ZWE,181,Zimbabwe,2960,"Fish, Seafood",5521,Feed,1000 tonnes,-19.02,29.15,...,5.0,4.0,9.0,6.0,9.0,5.0,15.0,15.0,15,15
21474,ZWE,181,Zimbabwe,2960,"Fish, Seafood",5142,Food,1000 tonnes,-19.02,29.15,...,18.0,14.0,17.0,14.0,15.0,18.0,29.0,40.0,40,40
21475,ZWE,181,Zimbabwe,2961,"Aquatic Products, Other",5142,Food,1000 tonnes,-19.02,29.15,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0


Content:
The Food and Agriculture Organization of the United Nations provides free access to food and agriculture data for over 245 countries and territories, from the year 1961 to the most recent update (depends on the dataset). One dataset from the FAO's database is the Food Balance Sheets. It presents a comprehensive picture of the pattern of a country's food supply during a specified reference period, the last time an update was loaded to the FAO database was in 2013. The food balance sheet shows for each food item the sources of supply and its utilization. This chunk of the dataset is focused on two utilizations of each food item available:

* Food - refers to the total amount of the food item available as human food during the reference period.
* Feed - refers to the quantity of the food item available for feeding to the livestock and poultry during the reference period.

Acknowledgements
This dataset was meticulously gathered, organized and published by the Food and Agriculture Organization of the United Nations.

The dataset has the following columns:
* Area Abbreviation: Country name abbreviation. |Object|
* Area code: Country code. |int64|
* Area: Country name |Object|
* Item code: Food item code. |int64|
* Item: Food item. |Object|
* Element code: Food or Feed code |int64|
* Element: Food or Feed |Object|
* Unit: Unit of measurement. |Object| |Unique value|
* Latitude: Latitude |float64|
* Longitude: Longitude |float64|
* Years from 1961 to 2022 |float64|
* Years 2012 to 2013 |int64|

* Greenhouse gases emission for food production dataset (enviroment_file_df)

In [132]:
# Preview the dataset environment_file_df
environment_file_df

Unnamed: 0,Food product,Land use change,Animal Feed,Farm,Processing,Transport,Packging,Retail,Total_emissions,Eutrophying emissions per 1000kcal (gPO₄eq per 1000kcal),...,Freshwater withdrawals per 100g protein (liters per 100g protein),Freshwater withdrawals per kilogram (liters per kilogram),Greenhouse gas emissions per 1000kcal (kgCO₂eq per 1000kcal),Greenhouse gas emissions per 100g protein (kgCO₂eq per 100g protein),Land use per 1000kcal (m² per 1000kcal),Land use per kilogram (m² per kilogram),Land use per 100g protein (m² per 100g protein),Scarcity-weighted water use per kilogram (liters per kilogram),Scarcity-weighted water use per 100g protein (liters per 100g protein),Scarcity-weighted water use per 1000kcal (liters per 1000 kilocalories)
0,Wheat & Rye (Bread),0.1,0.0,0.8,0.2,0.1,0.1,0.1,1.4,,...,,,,,,,,,,
1,Maize (Meal),0.3,0.0,0.5,0.1,0.1,0.1,0.0,1.1,,...,,,,,,,,,,
2,Barley (Beer),0.0,0.0,0.2,0.1,0.0,0.5,0.3,1.1,,...,,,,,,,,,,
3,Oatmeal,0.0,0.0,1.4,0.0,0.1,0.1,0.0,1.6,4.281357,...,371.076923,482.4,0.945482,1.907692,2.897446,7.6,5.846154,18786.2,14450.92308,7162.104461
4,Rice,0.0,0.0,3.6,0.1,0.1,0.1,0.1,4.0,9.514379,...,3166.760563,2248.4,1.207271,6.267606,0.759631,2.8,3.943662,49576.3,69825.77465,13449.89148
5,Potatoes,0.0,0.0,0.2,0.0,0.1,0.0,0.0,0.3,4.754098,...,347.647059,59.1,0.628415,2.705882,1.202186,0.88,5.176471,2754.2,16201.17647,3762.568306
6,Cassava,0.6,0.0,0.2,0.0,0.1,0.0,0.0,0.9,0.708419,...,,0.0,1.355236,14.666667,1.858316,1.81,20.111111,0.0,,
7,Cane Sugar,1.2,0.0,0.5,0.0,0.8,0.1,0.0,2.6,4.820513,...,,620.1,0.911681,,0.581197,2.04,,16438.6,,4683.361823
8,Beet Sugar,0.0,0.0,0.5,0.2,0.6,0.1,0.0,1.4,1.541311,...,,217.7,0.51567,,0.521368,1.83,,9493.3,,2704.643875
9,Other Pulses,0.0,0.0,1.1,0.0,0.1,0.4,0.0,1.6,5.008798,...,203.503036,435.7,0.524927,0.836058,4.565982,15.57,7.272303,22477.4,10498.55208,


Environment Impact of Food Production dataset

Content:
This dataset contains most 43 most common foods grown across the globe and 23 columns as their respective land, water usage and carbon footprints.

Columns:

* Food product: Food product name |object|
* Land use change - Kg CO2 - equivalents per kg product |float64|
* Animal Feed - Kg CO2 - equivalents per kg product |float64|
* Farm - Kg CO2 - equivalents per kg product |float64|
* Processing - Kg CO2 - equivalents per kg product |float64|
* Transport - Kg CO2 - equivalents per kg product |float64|
* Packaging - Kg CO2 - equivalents per kg product |float64|
* Retail - Kg CO2 - equivalents per kg product |float64|

These represent greenhouse gas emissions per kg of food product(Kg CO2 - equivalents per kg product) across different stages in the lifecycle of food production.

Eutrophication – the pollution of water bodies and ecosystems with excess nutrients – is a major environmental problem. The runoff of nitrogen and other nutrients from agricultural production systems is a leading contributor.

Acknowledgements
https://ourworldindata.org



* Temperature change dataset (temperature_df)

In [133]:
# Preview the dataset temperature_df
temperature_df

Unnamed: 0,Area Code,Area,Months Code,Months,Element Code,Element,Unit,Y1961,Y1962,Y1963,...,Y2010,Y2011,Y2012,Y2013,Y2014,Y2015,Y2016,Y2017,Y2018,Y2019
0,2,Afghanistan,7001,January,7271,Temperature change,°C,0.777,0.062,2.744,...,3.601,1.179,-0.583,1.233,1.755,1.943,3.416,1.201,1.996,2.951
1,2,Afghanistan,7001,January,6078,Standard Deviation,°C,1.950,1.950,1.950,...,1.950,1.950,1.950,1.950,1.950,1.950,1.950,1.950,1.950,1.950
2,2,Afghanistan,7002,February,7271,Temperature change,°C,-1.743,2.465,3.919,...,1.212,0.321,-3.201,1.494,-3.187,2.699,2.251,-0.323,2.705,0.086
3,2,Afghanistan,7002,February,6078,Standard Deviation,°C,2.597,2.597,2.597,...,2.597,2.597,2.597,2.597,2.597,2.597,2.597,2.597,2.597,2.597
4,2,Afghanistan,7003,March,7271,Temperature change,°C,0.516,1.336,0.403,...,3.390,0.748,-0.527,2.246,-0.076,-0.497,2.296,0.834,4.418,0.234
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9651,5873,OECD,7018,JunJulAug,6078,Standard Deviation,°C,0.247,0.247,0.247,...,0.247,0.247,0.247,0.247,0.247,0.247,0.247,0.247,0.247,0.247
9652,5873,OECD,7019,SepOctNov,7271,Temperature change,°C,0.036,0.461,0.665,...,0.958,1.106,0.885,1.041,0.999,1.670,1.535,1.194,0.581,1.233
9653,5873,OECD,7019,SepOctNov,6078,Standard Deviation,°C,0.378,0.378,0.378,...,0.378,0.378,0.378,0.378,0.378,0.378,0.378,0.378,0.378,0.378
9654,5873,OECD,7020,Meteorological year,7271,Temperature change,°C,0.165,-0.009,0.134,...,1.246,0.805,1.274,0.991,0.811,1.282,1.850,1.349,1.088,1.297



Temperature change dataset

Context:

The FAOSTAT Temperature Change domain disseminates statistics of mean surface temperature change by country, with annual updates. The current dissemination covers the period 1961–2023. Statistics are available for monthly, seasonal and annual mean temperature anomalies, i.e., temperature change with respect to a baseline climatology, corresponding to the period 1951–1980. The standard deviation of the temperature change of the baseline methodology is also available. Data are based on the publicly available GISTEMP data, the Global Surface Temperature Change data distributed by the National Aeronautics and Space Administration Goddard Institute for Space Studies (NASA-GISS).

Content:
* Statistical standards: country and regional calculations employ a definition of “Land area” consistent with SEEA Land Use definitions 

Columns:
* Area Code: The numerical code of area column, type of area code is an integer.
* Area: Countries and Territories (In 2019: 190 countries and 37 other territorial entities.), type of area is an object.
* Months code: The numerical code of months column, type of months code is an integer.
* Months: Months, Seasons, Meteorological year, type of months is an object.
* Element Code: The numerical code of element column, type of element code is an integer.
* Element: 'Temperature change', 'Standard Deviation', type of element is an object.
* Unit: Celsius degrees °C, type of unit is an object.
* Years

In [134]:
# Preview the dataset temperature_2024_df
temperature_df_2024

Unnamed: 0,Domain Code,Domain,Area Code (M49),Area,Element Code,Element,Months Code,Months,Year Code,Year,Unit,Value,Flag,Flag Description
0,ET,Temperature change on land,4,Afghanistan,7271,Temperature change,7001,January,1961,1961,°c,0.745,E,Estimated value
1,ET,Temperature change on land,4,Afghanistan,7271,Temperature change,7001,January,1962,1962,°c,0.015,E,Estimated value
2,ET,Temperature change on land,4,Afghanistan,7271,Temperature change,7001,January,1963,1963,°c,2.706,E,Estimated value
3,ET,Temperature change on land,4,Afghanistan,7271,Temperature change,7001,January,1964,1964,°c,-5.250,E,Estimated value
4,ET,Temperature change on land,4,Afghanistan,7271,Temperature change,7001,January,1965,1965,°c,1.854,E,Estimated value
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
241888,ET,Temperature change on land,716,Zimbabwe,7271,Temperature change,7020,Meteorological year,2019,2019,°c,1.199,E,Estimated value
241889,ET,Temperature change on land,716,Zimbabwe,7271,Temperature change,7020,Meteorological year,2020,2020,°c,0.581,E,Estimated value
241890,ET,Temperature change on land,716,Zimbabwe,7271,Temperature change,7020,Meteorological year,2021,2021,°c,0.109,E,Estimated value
241891,ET,Temperature change on land,716,Zimbabwe,7271,Temperature change,7020,Meteorological year,2022,2022,°c,-0.251,E,Estimated value


In [135]:
# Preview the dataset temperature_2022_df
temperature_df_2022

Unnamed: 0,Domain Code,Domain,Area Code (FAO),Area,Element Code,Element,Months Code,Months,Year Code,Year,Unit,Value,Flag,Flag Description
0,ET,Temperature change,2,Afghanistan,7271,Temperature change,7001,January,1961,1961,?C,0.746,Fc,Calculated data
1,ET,Temperature change,2,Afghanistan,7271,Temperature change,7001,January,1962,1962,?C,0.009,Fc,Calculated data
2,ET,Temperature change,2,Afghanistan,7271,Temperature change,7001,January,1963,1963,?C,2.695,Fc,Calculated data
3,ET,Temperature change,2,Afghanistan,7271,Temperature change,7001,January,1964,1964,?C,-5.277,Fc,Calculated data
4,ET,Temperature change,2,Afghanistan,7271,Temperature change,7001,January,1965,1965,?C,1.827,Fc,Calculated data
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
229920,ET,Temperature change,181,Zimbabwe,7271,Temperature change,7020,Meteorological year,2016,2016,?C,1.470,Fc,Calculated data
229921,ET,Temperature change,181,Zimbabwe,7271,Temperature change,7020,Meteorological year,2017,2017,?C,0.443,Fc,Calculated data
229922,ET,Temperature change,181,Zimbabwe,7271,Temperature change,7020,Meteorological year,2018,2018,?C,0.747,Fc,Calculated data
229923,ET,Temperature change,181,Zimbabwe,7271,Temperature change,7020,Meteorological year,2019,2019,?C,1.359,Fc,Calculated data


* Country ISO-3 dataset

In [136]:
# Preview the dataset country_ISO_df
country_ISO3_df

Unnamed: 0,Country Code,Country,M49 Code,ISO2 Code,ISO3 Code,Start Year,End Year
0,2,Afghanistan,4.0,AF,AFG,,
1,5100,Africa,2.0,,X06,,
2,284,Åland Islands,248.0,,ALA,,
3,3,Albania,8.0,AL,ALB,,
4,4,Algeria,12.0,DZ,DZA,,
...,...,...,...,...,...,...,...
316,246,Yemen Ar Rp,886.0,,,,
317,247,Yemen Dem,720.0,,,,
318,248,Yugoslav SFR,890.0,,,,1991.0
319,251,Zambia,894.0,ZM,ZMB,,


FAOSTAT Country ISO3 dataset

Context

Definitions and standards used in FAOSTAT

Columns:
* Country Code: Country numbers. |int64|
* Country: Country names |object|
* M49 Code: Country code (M49) |float64|
* ISO2 Code: 2 letters abbreviation of Country names. |object|
* ISO3 Code: 2 letters abb. of Country names. |object|
* Start Year: Start year of using |float64|
* End Year: End year of using |float64|


#### 3.2. Data cleaning and transformation

This section of the notebook clean, transforms and prepare the data contained in the datasets for data analysis using Exploratory Data Analysis (EDA) for data cleaning.

Create a function to preview the data: (preview_data)
* display(df.head) Shows a preview of the first 20 rows of the dataframe
* print(df.shape) Shows the shape of the data: rows and columns
* print(df.columns.tolist()) Gives the columns of the dataset in a list format
* print(df.types) Shows the data types of the df to later categorized them as categorical or numerical variables
* print(df.isnull(0.sum())) Shows the number of NaN values in the df

In [137]:
# Define a function to preview the data
def preview_data(df, num_rows=20):
    """Preview the first few rows of a DataFrame."""
    display(df.head(num_rows))
    print(f"Shape: {df.shape}")
    print(f"Columns: {df.columns.tolist()}")
    print(f"Data Types:\n{df.dtypes}")
    print(f"Missing Values:\n{df.isnull().sum()}")

Importing Functions for EDA and filling the NaN values (Functions_DA_DS)

In [138]:
# Importing the Functions for EDA will help you

import sys
sys.path.append('/Users/ivanacaridad/Documents/GitHub/Funtions')

from Functions_DA_DS import *

##### 3.2.1. Food and feed production worlwide dataset (food_production_df)

In [139]:
# Preview food production worlwide dataset
preview_data(food_production_df)

Unnamed: 0,Area Abbreviation,Area Code,Area,Item Code,Item,Element Code,Element,Unit,latitude,longitude,...,Y2004,Y2005,Y2006,Y2007,Y2008,Y2009,Y2010,Y2011,Y2012,Y2013
0,AFG,2,Afghanistan,2511,Wheat and products,5142,Food,1000 tonnes,33.94,67.71,...,3249.0,3486.0,3704.0,4164.0,4252.0,4538.0,4605.0,4711.0,4810,4895
1,AFG,2,Afghanistan,2805,Rice (Milled Equivalent),5142,Food,1000 tonnes,33.94,67.71,...,419.0,445.0,546.0,455.0,490.0,415.0,442.0,476.0,425,422
2,AFG,2,Afghanistan,2513,Barley and products,5521,Feed,1000 tonnes,33.94,67.71,...,58.0,236.0,262.0,263.0,230.0,379.0,315.0,203.0,367,360
3,AFG,2,Afghanistan,2513,Barley and products,5142,Food,1000 tonnes,33.94,67.71,...,185.0,43.0,44.0,48.0,62.0,55.0,60.0,72.0,78,89
4,AFG,2,Afghanistan,2514,Maize and products,5521,Feed,1000 tonnes,33.94,67.71,...,120.0,208.0,233.0,249.0,247.0,195.0,178.0,191.0,200,200
5,AFG,2,Afghanistan,2514,Maize and products,5142,Food,1000 tonnes,33.94,67.71,...,231.0,67.0,82.0,67.0,69.0,71.0,82.0,73.0,77,76
6,AFG,2,Afghanistan,2517,Millet and products,5142,Food,1000 tonnes,33.94,67.71,...,15.0,21.0,11.0,19.0,21.0,18.0,14.0,14.0,14,12
7,AFG,2,Afghanistan,2520,"Cereals, Other",5142,Food,1000 tonnes,33.94,67.71,...,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0
8,AFG,2,Afghanistan,2531,Potatoes and products,5142,Food,1000 tonnes,33.94,67.71,...,276.0,294.0,294.0,260.0,242.0,250.0,192.0,169.0,196,230
9,AFG,2,Afghanistan,2536,Sugar cane,5521,Feed,1000 tonnes,33.94,67.71,...,50.0,29.0,61.0,65.0,54.0,114.0,83.0,83.0,69,81


Shape: (21477, 63)
Columns: ['Area Abbreviation', 'Area Code', 'Area', 'Item Code', 'Item', 'Element Code', 'Element', 'Unit', 'latitude', 'longitude', 'Y1961', 'Y1962', 'Y1963', 'Y1964', 'Y1965', 'Y1966', 'Y1967', 'Y1968', 'Y1969', 'Y1970', 'Y1971', 'Y1972', 'Y1973', 'Y1974', 'Y1975', 'Y1976', 'Y1977', 'Y1978', 'Y1979', 'Y1980', 'Y1981', 'Y1982', 'Y1983', 'Y1984', 'Y1985', 'Y1986', 'Y1987', 'Y1988', 'Y1989', 'Y1990', 'Y1991', 'Y1992', 'Y1993', 'Y1994', 'Y1995', 'Y1996', 'Y1997', 'Y1998', 'Y1999', 'Y2000', 'Y2001', 'Y2002', 'Y2003', 'Y2004', 'Y2005', 'Y2006', 'Y2007', 'Y2008', 'Y2009', 'Y2010', 'Y2011', 'Y2012', 'Y2013']
Data Types:
Area Abbreviation     object
Area Code              int64
Area                  object
Item Code              int64
Item                  object
                      ...   
Y2009                float64
Y2010                float64
Y2011                float64
Y2012                  int64
Y2013                  int64
Length: 63, dtype: object
Missing Values

* We have NaN values in Years 2009, 2010, and 2011
* Data type conversion: Year columns should be standarized as INT 

In [140]:
# Check the types of variables in the food production dataset
food_production_df.dtypes

Area Abbreviation     object
Area Code              int64
Area                  object
Item Code              int64
Item                  object
                      ...   
Y2009                float64
Y2010                float64
Y2011                float64
Y2012                  int64
Y2013                  int64
Length: 63, dtype: object

In [141]:
# We have categorical and numerical columns
# Check the number of unique values in each column
food_production_df.nunique()

Area Abbreviation     169
Area Code             174
Area                  174
Item Code             117
Item                  115
                     ... 
Y2009                2029
Y2010                2046
Y2011                2081
Y2012                2084
Y2013                2107
Length: 63, dtype: int64

##### 3.1.2.1.1. Standardizing column names (normalization)

In [142]:
# Remove the letter 'Y' from the 'Year' column
food_production_df.rename(columns={x: x.replace('Y', '') for x in food_production_df.columns}, inplace=True)

# Delete unnecessary columns
food_production_df.drop(columns=['Area Code', 'Item Code', 'Element Code'], inplace=True)

# change the column names for simplicity
food_production_df.rename(columns={
    'Area': 'country',
    'Area Abreviation': 'country_code',
    'Item': 'food_item',
    'Element': 'element_type',
    'Unit': 'unit_of_measurement'
}, inplace=True)

# change the column names to lowercase
food_production_df.rename(columns=lambda x: x.lower(), inplace=True)

##### 3.1.2.1.2. Missing values and quantitative variables

* Verify the percentage of null values in the df
* How many missing values are there?
* Is there a reason for this?

Step 1: Meet your data and check for null values (NaN)

In [143]:
# Check the shape of the dataset
print(f"The dataset has {food_production_df.shape[0]} rows and {food_production_df.shape[1]} columns.")

The dataset has 21477 rows and 60 columns.


Check the percentage on Null Values

In [144]:
help(percentage_nullValues)

Help on function percentage_nullValues in module Functions_DA_DS:

percentage_nullValues(data)
    Function that calculates the percentage of missing values in every column of your dataset
    input: data --> dataframe



In [145]:
preview_data(food_production_df)

Unnamed: 0,area abbreviation,country,food_item,element_type,unit_of_measurement,latitude,longitude,1961,1962,1963,...,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
0,AFG,Afghanistan,Wheat and products,Food,1000 tonnes,33.94,67.71,1928.0,1904.0,1666.0,...,3249.0,3486.0,3704.0,4164.0,4252.0,4538.0,4605.0,4711.0,4810,4895
1,AFG,Afghanistan,Rice (Milled Equivalent),Food,1000 tonnes,33.94,67.71,183.0,183.0,182.0,...,419.0,445.0,546.0,455.0,490.0,415.0,442.0,476.0,425,422
2,AFG,Afghanistan,Barley and products,Feed,1000 tonnes,33.94,67.71,76.0,76.0,76.0,...,58.0,236.0,262.0,263.0,230.0,379.0,315.0,203.0,367,360
3,AFG,Afghanistan,Barley and products,Food,1000 tonnes,33.94,67.71,237.0,237.0,237.0,...,185.0,43.0,44.0,48.0,62.0,55.0,60.0,72.0,78,89
4,AFG,Afghanistan,Maize and products,Feed,1000 tonnes,33.94,67.71,210.0,210.0,214.0,...,120.0,208.0,233.0,249.0,247.0,195.0,178.0,191.0,200,200
5,AFG,Afghanistan,Maize and products,Food,1000 tonnes,33.94,67.71,403.0,403.0,410.0,...,231.0,67.0,82.0,67.0,69.0,71.0,82.0,73.0,77,76
6,AFG,Afghanistan,Millet and products,Food,1000 tonnes,33.94,67.71,17.0,18.0,19.0,...,15.0,21.0,11.0,19.0,21.0,18.0,14.0,14.0,14,12
7,AFG,Afghanistan,"Cereals, Other",Food,1000 tonnes,33.94,67.71,0.0,0.0,0.0,...,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0
8,AFG,Afghanistan,Potatoes and products,Food,1000 tonnes,33.94,67.71,111.0,97.0,103.0,...,276.0,294.0,294.0,260.0,242.0,250.0,192.0,169.0,196,230
9,AFG,Afghanistan,Sugar cane,Feed,1000 tonnes,33.94,67.71,45.0,45.0,45.0,...,50.0,29.0,61.0,65.0,54.0,114.0,83.0,83.0,69,81


Shape: (21477, 60)
Columns: ['area abbreviation', 'country', 'food_item', 'element_type', 'unit_of_measurement', 'latitude', 'longitude', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013']
Data Types:
area abbreviation       object
country                 object
food_item               object
element_type            object
unit_of_measurement     object
latitude               float64
longitude              float64
1961                   float64
1962                   float64
1963                   float64
1964                   float64
1965                   float64
1966                   float64
1967                   

In [146]:
percentage_nullValues(food_production_df)

Unnamed: 0,Percentage_NaN
1984,16.5
1981,16.5
1974,16.5
1975,16.5
1976,16.5
1977,16.5
1978,16.5
1979,16.5
1982,16.5
1972,16.5


* The columns containing the highest percentages of missing values are the years between 1961 and 1991 (16.5 to 15.9).
In the dataset documentation we can see the following info:
    "The Food Balance sheet's data was relatively complete. A few countries that do not exist anymore, such as Czechoslovakia, were deleted from the database. Countries which were formed lately such as South Sudan were kept, even though they do not have all full data going back to 1961. In addition, data aggregation for the 7 different continents was available as well, but was not added to the dataset. Food and feed production by country and food item from 1961 to 2013, including geocoding. Y1961 - Y2011 are production years that show the amount of food item produced in 1000 tonnes."

Therefore, it makes sense that due to geopolitical evolutions in some countries we have unavailability of data.

* From 1992 to 2006 we have a range of 2.80% to 0.5% of missing values as well.

In [147]:
# Create a subset of the df containing at least one missing value
food_production_null = food_production_df[food_production_df.isnull().any(axis=1)]

# Set a ramdom seed for reproducibility
np.random.seed(0) #ramdom seed is used to ensure that the random numbers generated are the same each time the code is run

# Select a random sample of 10 rows from the DataFrame
food_production_null.sample(n=5, random_state=0)

Unnamed: 0,area abbreviation,country,food_item,element_type,unit_of_measurement,latitude,longitude,1961,1962,1963,...,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
17328,SVK,Slovakia,Millet and products,Feed,1000 tonnes,48.67,19.7,,,,...,2.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1,1
13020,MNE,Montenegro,Apples and products,Food,1000 tonnes,42.71,19.37,,,,...,,,11.0,19.0,34.0,38.0,33.0,27.0,14,13
17526,SVN,Slovenia,Pigmeat,Food,1000 tonnes,46.15,15.0,,,,...,88.0,88.0,89.0,84.0,84.0,80.0,81.0,76.0,69,58
11128,LVA,Latvia,Beer,Food,1000 tonnes,56.88,24.6,,,,...,124.0,144.0,147.0,158.0,156.0,146.0,164.0,172.0,163,156
16103,RUS,Russian Federation,Coconuts - Incl Copra,Food,1000 tonnes,61.52,105.32,,,,...,47.0,52.0,45.0,59.0,42.0,39.0,48.0,63.0,57,63


In these five randomly selected countries: Slovakia, Montenegro, Slovenia, Latvia, Russian Federation, production data are missing from 1961 to 1991 (Montenegro until 2005).

Source: https://www.wikipedia.org/
* Slovakia: From 1948 to 1989, Slovakia was part of communist Czechoslovakia, with strict control over its western borders and alignment with the Soviet bloc. Around 600 people died trying to flee the country, and over 8,000 were sent to forced labor camps. In 1960, it became the Czechoslovak Socialist Republic. The 1968 Prague Spring, a short-lived liberalization led by Alexander Dubček, was crushed by a Warsaw Pact invasion, killing 137 civilians. In 1969, the country became a federation of Czech and Slovak republics. Czechoslovakia supported communist allies such as North Korea, North Vietnam, and Cuba. The Velvet Revolution in 1989 ended communist rule peacefully. Slovakia declared sovereignty in 1992, and Czechoslovakia dissolved peacefully in 1993. The early years of Slovak independence were marked by political instability, crime, and economic hardship. Reformist Prime Minister Mikuláš Dzurinda took office in 1998, initiating market reforms and leading Slovakia into NATO, the EU, and the OECD.

* Montenegro: Between 1961 and 2005, Montenegro evolved from a Socialist Republic within Yugoslavia into a quasi-independent member of a new state union with Serbia. The 1963 constitution renamed it the Socialist Republic of Montenegro, and the 1974 federal constitution granted increased autonomy. With the collapse of communism in the early 1990s, Montenegro transitioned into a dominant-party parliamentary republic, joining the Federal Republic of Yugoslavia in 1992 after a referendum. That same year, its capital reverted from Titograd to Podgorica and a new tricolor flag was adopted in 1993  ￼. In 1996, under President Milo Đukanović, Montenegro began distancing its economic policy from Serbia, notably by adopting the Deutsche Mark and later the euro  ￼. In 2003, the Federal Republic of Yugoslavia restructured as the looser State Union of Serbia and Montenegro under the Belgrade Agreement and Constitutional Charter  ￼. By 2005, Montenegro was on a firm path toward full sovereignty, setting the stage for its 2006 referendum on independence.

* Slovenia: Between 1961 and 1991, Slovenia—then the Socialist Republic of Slovenia within Yugoslavia—underwent economic, political, and constitutional transformation. Economically, it was Yugoslavia’s most advanced republic, producing roughly one-fifth of its GDP and one-third of its exports  ￼. Politically, the 1974 federal constitution granted Slovenia greater autonomy, nurturing a more open socialist system. In the late 1980s, a growing civic and intellectual movement—sparked by calls for democratization—led to constitutional amendments in 1989, introducing parliamentary democracy . On 8 April 1990, Slovenia held its first multiparty elections, which ushered in the DEMOS coalition under Jože Pučnik, and on 7 March 1990 the republic dropped “Socialist” from its name  ￼. A decisive independence referendum on 23 December 1990 resulted in over 88% approval  ￼, leading to a declaration of independence on 25 June 1991 and a brief Ten‑Day War with the Yugoslav People’s Army  ￼. After the Brijuni Agreement, Slovenia achieved full sovereignty, paving the way toward its future as an independent democratic state.

* Latvia: Between 1961 and 1991, Latvia was part of the Soviet Union as the Latvian Soviet Socialist Republic, enduring Russification and industrial expansion. In 1961, traditional Latvian customs such as Midsummer celebrations were banned under Party leader Arvīds Pelše, who also purged nationalist elements and saw an influx of Russian-speaking workers to staff new factories  ￼. During the mid-1980s, reforms under Gorbachev—perestroika and glasnost—ignited national revival, leading to the foundation of the Popular Front of Latvia and other civic groups  ￼. On 4 May 1990, Latvia’s Supreme Council declared restoration of independence and dropped the “Soviet” designation  ￼. In January 1991, during “The Barricades,” civilians resisted Soviet force in Riga, suffering several casualties  ￼. A decisive referendum on 3 March 1991 saw nearly 75% in favour of independence  ￼. Following the failed August Soviet coup, Latvia formally restored its independence on 21 August 1991, with widespread international recognition shortly thereafter.

* Russian Federation: Between 1961 and 1991, the Russian Soviet Federative Socialist Republic (RSFSR) — the largest constituent of the USSR — saw dramatic political shifts. Following massive post-Stalin industrial and agricultural expansion, it functioned as a tightly controlled one-party state under the Communist Party. In June 1990, amidst growing demands for reform, the RSFSR declared state sovereignty, and on 12 June 1991, Boris Yeltsin was elected its first—and only—president, marking a pivotal break from Soviet rule  ￼. The failed August 1991 coup against Gorbachev weakened the Soviet structure, with Yeltsin emerging as the leading figure. On 8 December 1991, Russia, Ukraine, and Belarus signed the Belovezha Accords, effectively dismantling the USSR; it was formally dissolved on 26 December 1991  ￼. On 25 December, the RSFSR renamed itself the Russian Federation. As the USSR’s legal successor, Russia assumed its UN seat, nuclear arsenal, and international obligations.

The results are in line with the documentation, justifying the background regarding the presence of missing values. It is conveniente to delete the lines with missing values from the df, or fill them with the mean.

Checking for Missing Values
data.isnull().sum()

| % Missing values | Take action | Watch out!
| :- |:-| :- |< threshold|data.dropna()| Check the final number of rows that you get |> threshold|data.fillna()|

- Categorical:  data.variable.mode() [0]

- Numerical: data.variable.mean() |> 50-60%| data.drop()|Check the final number of rows that you get

In [148]:
# check for missing values
missing_values = food_production_null.isnull().sum()
print(missing_values[missing_values > 0])

1961    3539
1962    3539
1963    3539
1964    3539
1965    3539
1966    3539
1967    3539
1968    3539
1969    3539
1970    3539
1971    3539
1972    3539
1973    3539
1974    3539
1975    3539
1976    3539
1977    3539
1978    3539
1979    3539
1980    3539
1981    3539
1982    3539
1983    3539
1984    3539
1985    3539
1986    3539
1987    3539
1988    3539
1989    3539
1990    3415
1991    3415
1992     987
1993     612
1994     612
1995     612
1996     612
1997     612
1998     612
1999     612
2000     349
2001     349
2002     349
2003     349
2004     349
2005     349
2006     104
2007     104
2008     104
2009     104
2010     104
2011     104
dtype: int64


Step 2: Drop the columns >30% threshold

In [149]:
# help(select_threshold) is a function that helps to select the threshold for missing values.
help(select_threshold)

Help on function select_threshold in module Functions_DA_DS:

select_threshold(data, thr)
    Function that  calculates the percentage of missing values in every column of your dataset
    input: data --> dataframe



As we have less than 30% of missing values per column I have decided to not dropp those columns. Thus due that my threshold is >30% of NaN per variable

In [150]:
food_production_threshold = select_threshold(food_production_df, 30)
food_production_df.columns
food_production_threshold.columns
food_production_df.head(5)

Columns to keep: 60
Those columns have a percentage of NaN less than 30 :
['1984', '1981', '1974', '1975', '1976', '1977', '1978', '1979', '1982', '1972', '1983', '1985', '1986', '1987', '1988', '1989', '1973', '1980', '1971', '1965', '1961', '1962', '1964', '1963', '1966', '1967', '1968', '1969', '1970', '1990', '1991', '1992', '1996', '1995', '1994', '1993', '1997', '1999', '1998', '2000', '2001', '2003', '2004', '2002', '2005', '2011', '2010', '2009', '2008', '2007', '2006', '2012', 'area abbreviation', 'country', 'longitude', 'latitude', 'unit_of_measurement', 'element_type', 'food_item', '2013']


Unnamed: 0,area abbreviation,country,food_item,element_type,unit_of_measurement,latitude,longitude,1961,1962,1963,...,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
0,AFG,Afghanistan,Wheat and products,Food,1000 tonnes,33.94,67.71,1928.0,1904.0,1666.0,...,3249.0,3486.0,3704.0,4164.0,4252.0,4538.0,4605.0,4711.0,4810,4895
1,AFG,Afghanistan,Rice (Milled Equivalent),Food,1000 tonnes,33.94,67.71,183.0,183.0,182.0,...,419.0,445.0,546.0,455.0,490.0,415.0,442.0,476.0,425,422
2,AFG,Afghanistan,Barley and products,Feed,1000 tonnes,33.94,67.71,76.0,76.0,76.0,...,58.0,236.0,262.0,263.0,230.0,379.0,315.0,203.0,367,360
3,AFG,Afghanistan,Barley and products,Food,1000 tonnes,33.94,67.71,237.0,237.0,237.0,...,185.0,43.0,44.0,48.0,62.0,55.0,60.0,72.0,78,89
4,AFG,Afghanistan,Maize and products,Feed,1000 tonnes,33.94,67.71,210.0,210.0,214.0,...,120.0,208.0,233.0,249.0,247.0,195.0,178.0,191.0,200,200


So we keep all the columns

Step 3: Drop or replace rest of columns of Missing Values using help(fill_na)


In [151]:
# Number of missing values in the food_production_df
food_production_df.isnull().sum()

area abbreviation         0
country                   0
food_item                 0
element_type              0
unit_of_measurement       0
latitude                  0
longitude                 0
1961                   3539
1962                   3539
1963                   3539
1964                   3539
1965                   3539
1966                   3539
1967                   3539
1968                   3539
1969                   3539
1970                   3539
1971                   3539
1972                   3539
1973                   3539
1974                   3539
1975                   3539
1976                   3539
1977                   3539
1978                   3539
1979                   3539
1980                   3539
1981                   3539
1982                   3539
1983                   3539
1984                   3539
1985                   3539
1986                   3539
1987                   3539
1988                   3539
1989                

In [152]:
# Function to fill missing values
help(fill_na)

Help on function fill_na in module Functions_DA_DS:

fill_na(data)
    Function to fill NaN with mode (categorical variabls) and mean (numerical variables)
    input: data -> df



In [153]:
# replacing the missing values with the mean of the column
food_production_df = fill_na(food_production_df)

Number of missing values on your dataset are

area abbreviation      0
country                0
food_item              0
element_type           0
unit_of_measurement    0
latitude               0
longitude              0
1961                   0
1962                   0
1963                   0
1964                   0
1965                   0
1966                   0
1967                   0
1968                   0
1969                   0
1970                   0
1971                   0
1972                   0
1973                   0
1974                   0
1975                   0
1976                   0
1977                   0
1978                   0
1979                   0
1980                   0
1981                   0
1982                   0
1983                   0
1984                   0
1985                   0
1986                   0
1987                   0
1988                   0
1989                   0
1990                   0
1991                   0
1992

In [154]:
food_production_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21477 entries, 0 to 21476
Data columns (total 60 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   area abbreviation    21477 non-null  object 
 1   country              21477 non-null  object 
 2   food_item            21477 non-null  object 
 3   element_type         21477 non-null  object 
 4   unit_of_measurement  21477 non-null  object 
 5   latitude             21477 non-null  float64
 6   longitude            21477 non-null  float64
 7   1961                 21477 non-null  float64
 8   1962                 21477 non-null  float64
 9   1963                 21477 non-null  float64
 10  1964                 21477 non-null  float64
 11  1965                 21477 non-null  float64
 12  1966                 21477 non-null  float64
 13  1967                 21477 non-null  float64
 14  1968                 21477 non-null  float64
 15  1969                 21477 non-null 

##### 3.1.2.1.3. Drop Duplicates

Drop Duplicates

data.drop_duplicates()

| Method | Information | What you should check
| :- |:-| :- |data.drop_duplicates()| Drop duplicates in order to not having duplicated info (not relevant!) |The final size of your dataset, do you have enough rows?

We already have a backup or our dataset (food_production_df_copy)

The first step is create always a backup of our dataset (food_production_df_copy)

In [155]:
# Look for any duplicate rows in the DataFrame
food_production_df.loc[food_production_df.duplicated()].sample(5, random_state=0)

Unnamed: 0,area abbreviation,country,food_item,element_type,unit_of_measurement,latitude,longitude,1961,1962,1963,...,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
7794,GHA,Ghana,Milk - Excluding Butter,Food,1000 tonnes,7.95,-1.02,53.0,55.0,48.0,...,168.0,159.0,197.0,192.0,222.0,140.0,192.0,234.0,202,235
10942,KGZ,Kyrgyzstan,Eggs,Food,1000 tonnes,41.2,74.77,195.262069,200.78225,205.4646,...,16.0,17.0,19.0,20.0,21.0,22.0,24.0,25.0,25,26
14723,OMN,Oman,Eggs,Food,1000 tonnes,21.51,55.92,195.262069,200.78225,205.4646,...,16.0,18.0,19.0,25.0,19.0,19.0,19.0,22.0,29,28
10477,KAZ,Kazakhstan,Milk - Excluding Butter,Food,1000 tonnes,48.02,66.92,195.262069,200.78225,205.4646,...,3455.0,3854.0,3961.0,3901.0,4138.0,4261.0,4517.0,4533.0,4608,4737
5774,DJI,Djibouti,Eggs,Food,1000 tonnes,11.83,42.59,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1


In [156]:
# Create a copy of the df after filling the missing values
food_production_cleaned_copy = food_production_df.copy()

In [158]:
# Delete the duplicates in the food_production_df
food_prod = food_production_df.drop_duplicates()

# Reset the index of the DataFrame
food_production_df.reset_index(drop=True, inplace=True)

In [159]:
# Now let's check both datasets
print('Shape of the original dataset:', food_production_cleaned_copy.shape)
print('Shape of the cleaned dataset:', food_prod.shape)
print('Number of duplicates in the original dataset:', food_production_cleaned_copy.duplicated().sum())
print('Number of duplicates in the cleaned dataset:', food_prod.duplicated().sum())

Shape of the original dataset: (21477, 60)
Shape of the cleaned dataset: (21018, 60)
Number of duplicates in the original dataset: 459
Number of duplicates in the cleaned dataset: 0


##### 3.1.2.1.4. Qualitative vairables - Country names

Create a function
fuzzywuzzy> a function tool to control qualitative variables
* With this function I can evaluate similarity in nomenclature between different elements and distinct if there were encoded with an incogruous way (e.g. USA is also United States, etc).
* Write a function to determine this.

In [None]:
# Create function to evalute the similarity between contry names
def fuzz_finder(dictionary, test, target, treshold, first, last, show):
    "This function finds the closest match to a given string in a dictionary."
    for item in test: # Returns a list of tuples containing the closest match and its score
        matches = fuzzywuzzy.process.extract(item, target, limit=None, scorer=fuzzywuzzy.fuzz.token_sort_ratio) # fuzzywuzzy is a library for string matching
        if matches[1][1]>=treshold and first != last:
            key = item
            values = 
        

##### 3.2.2. Environment Impact of Food Production dataset (enviroment_file_pd)

#### 3.3. Merging datasets

#### 3.4. Creating KPIs

## 4. Exploratory Data Analysis (EDA)
* 4.1. Descriptive Statistics
* 4.2. Trends in Food Production & Emissions
* 4.3. Geographic & Temporal Breakdown

4.1. Descriptive Statistics

4.1.1. Meet your data

In [None]:
# Check the shape of the dataset
print(f"The dataset has {data.shape[0]} rows and {data.shape[1]} columns.")

In [None]:
# Meet your dataset with EDA
data.describe().T.style.background_gradient(cmap='Blues', low=0, high=1, axis=None).set_properties(**{'font-size': '12pt'})
# T.style() is used to apply styles to the DataFrame

## 5. Hypothesis Testing & Statistical Analysis
* 5.1. Correlation Analysis (Pearson)
* 5.2. T-Test for Emission Differences by Food Type
* 5.3. ANOVA for Historical Emission Trends

## 6. Data Visualization & Insights
* 6.1. Food Production vs. Climate Trends
* 6.2. Regional Impact Analysis
* 6.3. Top Food Categories by Emissions

## 7. Findings & Interpretations
* 7.1. Summary of Key Insights
* 7.2. Policy & Sustainability Implications

## 8. Repository Setup & Documentation
* GitHub Organization
* File Structure & README.md
* Future Work