# **FAO Data Pipeline**

This notebook orchestrates an end-to-end data pipeline for acquiring datasets from the Food and Agriculture Organization (FAO).

**Data Download and Extraction:**
Raw bulk data for the Americas is fetched from FAOSTAT sources and organized in a directory named Raw.

**Datasets to be downloaded from FAOSTAT (bulk, Americas scope):**


1. Production_Indices
2. Crops_and_Livestock_Products_Indicators
3. Producer_Prices
4. Fertilizers_by_Product
5. Macro_Indicators
6. Consumer_Price_Indices
7. Value_Shares_by_Industry_and_Primary_Factors
8. Land_Use
9. Credit_to_Agriculture
10. Emissions_Indicators
11. Emissions_Totals
12. Pesticide_Use
13. Value_of_Agricultural_Production


## **1.Setup and Path Configuration**

In [1]:
from pathlib import Path
import requests
import zipfile
import fnmatch
import logging
from datetime import datetime
import os
import shutil

# === Path Configuration ===

BASE_DATA_DIR_PATH = Path("/kaggle/working/")

DATA_DIR = BASE_DATA_DIR_PATH / "FAO_Data_Pipeline"
log_dir = DATA_DIR / "Logs"
data_dir = DATA_DIR / "Raw"


log_dir.mkdir(parents=True, exist_ok=True)
data_dir.mkdir(parents=True, exist_ok=True)

# === Logging Configuration ===
log_filename = datetime.now().strftime("download_log_%Y-%m-%d_%H-%M-%S.log")
log_filepath = log_dir / log_filename

logging.basicConfig(
    filename=str(log_filepath),
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
)

## **2. Dataset List and Download Functions**
Next, we define the list of datasets and the functions that will handle the download, extraction, and cleanup process. The core logic from your script is preserved here.

In [2]:
# === Dataset List ===
# Dictionary containing dataset names and their download URLs.
datasets = {
    "Production_Indices": "https://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Indices_E_Americas.zip",
    "Crops_and_Livestock_Products_Indicators": "https://fenixservices.fao.org/faostat/static/bulkdownloads/Trade_CropsLivestockIndicators_E_Americas.zip",
    "Producer_Prices": "https://fenixservices.fao.org/faostat/static/bulkdownloads/Prices_E_All_Data.zip",
    "Fertilizers_by_Product": "https://fenixservices.fao.org/faostat/static/bulkdownloads/Inputs_FertilizersArchive_E_All_Data.zip",
    "Macro_Indicators": "https://fenixservices.fao.org/faostat/static/bulkdownloads/Macro-Statistics_Key_Indicators_E_Americas.zip",
    "Consumer_Price_Indices": "https://fenixservices.fao.org/faostat/static/bulkdownloads/ConsumerPriceIndices_E_Americas.zip",
    "Value_Shares_by_Industry_and_Primary_Factors": "https://fenixservices.fao.org/faostat/static/bulkdownloads/Value_shares_industry_primary_factors_E_Americas.zip",
    "Land_Use": "https://fenixservices.fao.org/faostat/static/bulkdownloads/Inputs_LandUse_E_Americas.zip",
    "Credit_to_Agriculture": "https://fenixservices.fao.org/faostat/static/bulkdownloads/Investment_CreditAgriculture_E_Americas.zip",
    "Emissions_Indicators": "https://fenixservices.fao.org/faostat/static/bulkdownloads/Climate_change_Emissions_indicators_E_Americas.zip",
    "Emissions_totals": "https://fenixservices.fao.org/faostat/static/bulkdownloads/Emissions_Totals_E_Americas.zip",
    "Pesticide_Use": "https://fenixservices.fao.org/faostat/static/bulkdownloads/Inputs_Pesticides_Use_E_Americas.zip",
    "Value_of_Agricultural_Production": "https://fenixservices.fao.org/faostat/static/bulkdownloads/Value_of_Production_E_Americas.zip",
}

# === Pipeline Functions ===
def cleanup_folder(folder_path):
    """Removes unnecessary files after extraction."""
    keep_patterns = ["*_E_Americas_NOFLAG.csv", "*_E_All_Data_NOFLAG.csv"]
    kept_files = []
    for file in os.listdir(folder_path):
        file_path = os.path.join(folder_path, file)
        if os.path.isfile(file_path):
            if any(fnmatch.fnmatch(file, pattern) for pattern in keep_patterns):
                kept_files.append(file_path)
            else:
                os.remove(file_path)
                logging.info(f"Deleted: {file_path}")
    return kept_files

def download_and_extract(name, url):
    """Downloads, extracts, and cleans a single dataset."""
    logging.info(f"Processing dataset: {name}")
    try:
        zip_path = data_dir / f"{name}.zip"
        r = requests.get(url)
        r.raise_for_status()
        with open(zip_path, "wb") as f:
            f.write(r.content)

        temp_extract_path = data_dir / f"temp_{name}"
        temp_extract_path.mkdir(parents=True, exist_ok=True)
        with zipfile.ZipFile(zip_path, "r") as zip_ref:
            zip_ref.extractall(temp_extract_path)

        kept_files = cleanup_folder(temp_extract_path)
        if kept_files:
            src_file = kept_files[0]
            dest_file = data_dir / f"{name}.csv"
            shutil.move(src_file, dest_file)
            logging.info(f"Moved cleaned file to: {dest_file}")
        else:
            logging.warning(f"No cleaned file found for {name}")

        os.remove(zip_path)
        shutil.rmtree(temp_extract_path)
        print(f"{name} - Completed successfully")
    except Exception as e:
        print(f"Error processing {name}: {e}")
        logging.error(f"Error processing {name}: {e}")

## **3. Run the Pipeline**

In [3]:
def run_fao_download_pipeline():
    """Main function to run the download pipeline."""
    logging.info("Starting FAO download pipeline")
    for name, url in datasets.items():
        download_and_extract(name, url)
    logging.info("FAO download pipeline finished")
    print("All datasets downloaded and processed.")

run_fao_download_pipeline()

Production_Indices - Completed successfully
Crops_and_Livestock_Products_Indicators - Completed successfully
Producer_Prices - Completed successfully
Fertilizers_by_Product - Completed successfully
Macro_Indicators - Completed successfully
Consumer_Price_Indices - Completed successfully
Value_Shares_by_Industry_and_Primary_Factors - Completed successfully
Land_Use - Completed successfully
Credit_to_Agriculture - Completed successfully
Emissions_Indicators - Completed successfully
Emissions_totals - Completed successfully
Pesticide_Use - Completed successfully
Value_of_Agricultural_Production - Completed successfully
All datasets downloaded and processed.
