# Table of Contents

-  [About the dataset](#about)<br>
-  [Load the data](#load_data)<br>

# Imports

In [46]:
# for package auto reload
%load_ext autoreload
%autoreload 2

# for better rendering of plots in jupyter notebook
%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [47]:
# base modules
import os
import sys
import copy
import logging
from collections import OrderedDict

# for manipulating data
import numpy as np
import pandas as pd
import math
import dill

# for Machine Learning
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, BaggingRegressor
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, plot_tree
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score, KFold, GridSearchCV
from sklearn.calibration import CalibratedClassifierCV
from sklearn.inspection import permutation_importance
from scipy.cluster import hierarchy

# for visualization
from IPython.display import display
from matplotlib import pyplot as plt
import graphviz
import streamlit as st
# plotly
# seaborn
# altair


# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

In [48]:
# path to repo
path_to_repo = os.getcwd()
path_to_repo

'/Users/nicolas/Desktop/Etudes/3_emlyon/2021:2022/Machine_Learning/project'

# Introduction: Crops Yield Project

### Project Ideas

Several ideas might pop up looking at this dataset, for example:

**Exploratory Data Analysis**
- Vizualisation of certain crops yields' temporal evolution per country
- Correlation between production & population

**Machine Learning**
- C02 emissions of crops using regression
- Clustering on CO2 crop emissioners
- Crop yields forecasting through time-series analysis using production & pop correlation

# Datasets Overview

## 1. Agricultural Crop Production

### Summary

Crop statistics for 173 products in Africa, America, Asia, Europe and Oceania, collected from 1961 to 2019.

### Description

Data from the Food and Agriculture Organization of the United Nations (FAO)

Achieving food security for all is at the heart of FAO's efforts - ensuring that people have regular access to enough quality food to lead active and healthy lives. Our three main objectives are: the eradication of hunger, food insecurity and malnutrition; the eradication of poverty and the promotion of economic and social progress for all; and the sustainable management and use of natural resources, including land, water, air, climate and genetic resources, for the benefit of present and future generations.

Primary crops, fibre crops. Crop statistics are recorded for 173 commodities, covering the following categories: Primary crops, Primary fibre crops, Cereals, Secondary cereals, Citrus, Fruit, Jute and related fibres, Oilcake equivalent, Primary oilseeds, Dry vegetables, Roots and tubers, Green fruits and vegetables and Melons. Data are expressed in terms of area harvested, quantity produced, yield and quantity of seed. The aim is to provide comprehensive coverage of production of all primary crops for all countries and regions of the world.

Source : Organisation des Nations Unies pour l'alimentation et l'agriculture (FAO)

## 2. Gas emissions Statistics

### Summary

The FAOSTAT domain Emissions Totals disseminates information estimates of CH4, N2O and CO2 emissions/removals and their aggregates in CO2eq in units of kilotonnes (kt, or 106 kg). 

### Description

The FAOSTAT domain Emissions Totals summarizes the greenhouse gas (GHG) emissions disseminated in the FAOSTAT Climate Change Emissions domains, generated from agriculture and forest land. They consist of methane (CH4), nitrous oxide (N2O) and carbon dioxide (CO2) emissions from crop and livestock activities, forest management and include land use and land use change processes. Data are computed at Tier 1 of the IPCC Guidelines for National greenhouse gas (GHG) Inventories (IPCC, 1996; 1997; 2000; 2002; 2006; 2014). Estimates are available by country, with global coverage for the period 1961–2019 with projections for 2030 and 2050 for some categories of emissions or 1990–2019 for others. The database is updated annually.

## 3. Population per country

The FAOSTAT Population module contains time series data on population, by sex and urban/rural. The series consist of both estimates and projections for different periods as available from the original sources, namely:
1. Population data refers to the World Population Prospects: The 2019 Revision from the UN Population Division.
2. Urban/rural population data refers to the World Urbanization Prospects: The 2018 Revision from the UN Population Division.

# Data Preprocessing

## 1. Preprocess the *Agricultural Crop data*

### Get a glimpse of one of the main datasets

In [51]:
# Loading Africa's dataset

url = 'https://raw.githubusercontent.com/nicoboou/ml_eml/main/data/agriculture-crop-production/Production_Crops_E_Africa.csv'
crops_africa = pd.read_csv(url, sep=',',encoding='latin-1')
crops_africa.head(10)

Unnamed: 0,Area Code,Area,Item Code,Item,Element Code,Element,Unit,Y1961,Y1961F,Y1962,...,Y2015,Y2015F,Y2016,Y2016F,Y2017,Y2017F,Y2018,Y2018F,Y2019,Y2019F
0,4,Algeria,221,"Almonds, with shell",5312,Area harvested,ha,13300.0,F,13300.0,...,40403.0,,49983.0,,50100.0,,43043.0,,35380.0,
1,4,Algeria,221,"Almonds, with shell",5419,Yield,hg/ha,4511.0,Fc,4511.0,...,18930.0,Fc,13223.0,Fc,12362.0,Fc,13292.0,Fc,20467.0,Fc
2,4,Algeria,221,"Almonds, with shell",5510,Production,tonnes,6000.0,,6000.0,...,76482.0,,66095.0,,61934.0,,57213.0,,72412.0,
3,4,Algeria,515,Apples,5312,Area harvested,ha,3400.0,F,3100.0,...,41011.0,,46070.0,,44620.0,,39034.0,,32989.0,
4,4,Algeria,515,Apples,5419,Yield,hg/ha,45294.0,Fc,45161.0,...,110086.0,Fc,108716.0,Fc,110766.0,Fc,124970.0,Fc,169399.0,Fc
5,4,Algeria,515,Apples,5510,Production,tonnes,15400.0,,14000.0,...,451472.0,,500855.0,,494239.0,,487808.0,,558830.0,
6,4,Algeria,526,Apricots,5312,Area harvested,ha,4200.0,F,4600.0,...,38857.0,,38239.0,,44307.0,,35500.0,,30861.0,
7,4,Algeria,526,Apricots,5419,Yield,hg/ha,30286.0,Fc,30000.0,...,75530.0,Fc,67149.0,Fc,57980.0,Fc,68237.0,Fc,67789.0,Fc
8,4,Algeria,526,Apricots,5510,Production,tonnes,12720.0,,13800.0,...,293486.0,,256771.0,,256890.0,,242243.0,,209204.0,
9,4,Algeria,366,Artichokes,5312,Area harvested,ha,5000.0,F,5000.0,...,4674.0,,5174.0,,5532.0,,5784.0,,5792.0,


In [52]:
crops_africa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9091 entries, 0 to 9090
Columns: 125 entries, Area Code to Y2019F
dtypes: float64(59), int64(3), object(63)
memory usage: 8.7+ MB


### Load all datasets

In [78]:
def open_all_datasets_df(item):
    df = pd.read_csv('https://raw.githubusercontent.com/nicoboou/ml_eml/main/data/agriculture-crop-production/Production_Crops_E_' + str(item) + '.csv', low_memory = False, encoding='latin1')
    df['continent'] = item
    return df

In [79]:
countries = ['Africa','Americas','Asia','Europe','Oceania']

In [80]:
crops_raw = pd.DataFrame()

for country in countries:
    crops_raw = crops_raw.append(open_all_datasets_df(country))

  crops_raw = crops_raw.append(open_all_datasets_df(country))
  crops_raw = crops_raw.append(open_all_datasets_df(country))
  crops_raw = crops_raw.append(open_all_datasets_df(country))
  crops_raw = crops_raw.append(open_all_datasets_df(country))
  crops_raw = crops_raw.append(open_all_datasets_df(country))


In [81]:
crops_raw

Unnamed: 0,Area Code,Area,Item Code,Item,Element Code,Element,Unit,Y1961,Y1961F,Y1962,...,Y2015F,Y2016,Y2016F,Y2017,Y2017F,Y2018,Y2018F,Y2019,Y2019F,continent
0,4,Algeria,221,"Almonds, with shell",5312,Area harvested,ha,13300.0,F,13300.0,...,,49983.0,,50100.0,,43043.0,,35380.0,,Africa
1,4,Algeria,221,"Almonds, with shell",5419,Yield,hg/ha,4511.0,Fc,4511.0,...,Fc,13223.0,Fc,12362.0,Fc,13292.0,Fc,20467.0,Fc,Africa
2,4,Algeria,221,"Almonds, with shell",5510,Production,tonnes,6000.0,,6000.0,...,,66095.0,,61934.0,,57213.0,,72412.0,,Africa
3,4,Algeria,515,Apples,5312,Area harvested,ha,3400.0,F,3100.0,...,,46070.0,,44620.0,,39034.0,,32989.0,,Africa
4,4,Algeria,515,Apples,5419,Yield,hg/ha,45294.0,Fc,45161.0,...,Fc,108716.0,Fc,110766.0,Fc,124970.0,Fc,169399.0,Fc,Africa
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1649,155,Vanuatu,1720,"Roots and Tubers, Total",5419,Yield,hg/ha,225455.0,Fc,227273.0,...,Fc,82306.0,Fc,82341.0,Fc,82374.0,Fc,82407.0,Fc,Oceania
1650,155,Vanuatu,1720,"Roots and Tubers, Total",5510,Production,tonnes,24800.0,A,25000.0,...,A,52314.0,A,52838.0,A,53362.0,A,53886.0,A,Oceania
1651,155,Vanuatu,1735,Vegetables Primary,5312,Area harvested,ha,200.0,A,210.0,...,A,804.0,A,810.0,A,817.0,A,824.0,A,Oceania
1652,155,Vanuatu,1735,Vegetables Primary,5419,Yield,hg/ha,150000.0,Fc,150000.0,...,Fc,165547.0,Fc,166358.0,Fc,166952.0,Fc,167524.0,Fc,Oceania


It appears that columns representing the data per year come with an indicator "F", let's dig into it !

In [82]:
flags_df = pd.read_csv("https://raw.githubusercontent.com/nicoboou/ml_eml/main/data/agriculture-crop-production/flags.csv", encoding="latin1")
flags_df

Unnamed: 0,"ï»¿""Flag""",Flags
0,,Official data
1,*,Unofficial figure
2,A,"Aggregate, may include official, semi-official..."
3,Bk,Break in series
4,C,Calculated
5,Ce,Calculated data based on estimated data
6,Cv,Calculated through value
7,E,Expert sources from FAO (including other divis...
8,F,FAO estimate
9,Fb,Data obtained as a balance


Each year column comes with another column, stating the source of the figures for this year (whether the figure was calculated with official data, estimated, etc).
It is good info to know, but let's put these data pieces aside and keep a smaller df:

In [87]:
crops_all = copy.deepcopy(crops_raw)
crops_all = crops_all.loc[:, ~crops_all.columns.str.endswith('F')]
crops_all = crops_all[crops_all.columns[~crops_all.columns.str.endswith('F')]]
crops_all

Unnamed: 0,Area Code,Area,Item Code,Item,Element Code,Element,Unit,Y1961,Y1962,Y1963,...,Y2011,Y2012,Y2013,Y2014,Y2015,Y2016,Y2017,Y2018,Y2019,continent
0,4,Algeria,221,"Almonds, with shell",5312,Area harvested,ha,13300.0,13300.0,13300.0,...,52245.0,49975.0,49011.0,39050.0,40403.0,49983.0,50100.0,43043.0,35380.0,Africa
1,4,Algeria,221,"Almonds, with shell",5419,Yield,hg/ha,4511.0,4511.0,4511.0,...,9689.0,13304.0,12965.0,16601.0,18930.0,13223.0,12362.0,13292.0,20467.0,Africa
2,4,Algeria,221,"Almonds, with shell",5510,Production,tonnes,6000.0,6000.0,6000.0,...,50621.0,66487.0,63545.0,64827.0,76482.0,66095.0,61934.0,57213.0,72412.0,Africa
3,4,Algeria,515,Apples,5312,Area harvested,ha,3400.0,3100.0,2800.0,...,51080.0,48828.0,48064.0,40418.0,41011.0,46070.0,44620.0,39034.0,32989.0,Africa
4,4,Algeria,515,Apples,5419,Yield,hg/ha,45294.0,45161.0,46429.0,...,79112.0,81414.0,94860.0,114507.0,110086.0,108716.0,110766.0,124970.0,169399.0,Africa
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1649,155,Vanuatu,1720,"Roots and Tubers, Total",5419,Yield,hg/ha,225455.0,227273.0,229091.0,...,80645.0,82258.0,82540.0,82518.0,82467.0,82306.0,82341.0,82374.0,82407.0,Oceania
1650,155,Vanuatu,1720,"Roots and Tubers, Total",5510,Production,tonnes,24800.0,25000.0,25200.0,...,50000.0,51000.0,52000.0,52374.0,52251.0,52314.0,52838.0,53362.0,53886.0,Oceania
1651,155,Vanuatu,1735,Vegetables Primary,5312,Area harvested,ha,200.0,210.0,220.0,...,762.0,750.0,779.0,791.0,801.0,804.0,810.0,817.0,824.0,Oceania
1652,155,Vanuatu,1735,Vegetables Primary,5419,Yield,hg/ha,150000.0,150000.0,150000.0,...,161444.0,166667.0,163248.0,163540.0,164819.0,165547.0,166358.0,166952.0,167524.0,Oceania


In [89]:
crops_all["Element"].unique()

array(['Area harvested', 'Yield', 'Production'], dtype=object)

It appears that we have data on 3 main elements thanks to this dataset:
- Area harvested: 
- Yield:
- Production:

In [90]:
crops_all.describe(include = 'all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Area Code,38146.0,,,,130.297069,75.053504,1.0,63.0,126.0,196.0,299.0
Area,38146.0,210.0,"China, mainland",398.0,,,,,,,
Item Code,38146.0,,,,614.361663,547.910943,15.0,236.0,446.0,656.0,1841.0
Item,38146.0,175.0,"Roots and Tubers, Total",618.0,,,,,,,
Element Code,38146.0,,,,5414.258349,82.044592,5312.0,5312.0,5419.0,5510.0,5510.0
Element,38146.0,3.0,Production,13224.0,,,,,,,
Unit,38146.0,3.0,tonnes,13224.0,,,,,,,
Y1961,22639.0,,,,371148.078625,3506653.258574,0.0,3100.0,15417.0,82733.5,176340000.0
Y1962,22658.0,,,,378709.214891,3529416.335406,0.0,3200.0,15970.0,83803.0,164593333.0
Y1963,22656.0,,,,383382.502913,3429900.956439,0.0,3300.0,16235.0,86986.5,174812487.0


## 2. Preprocess the *Gas Emissions data*

In [93]:
# Loading Gas Emissions' dataset

url = 'https://raw.githubusercontent.com/nicoboou/ml_eml/main/data/emissions/emissions_full.csv'
emissions_df = pd.read_csv(url, sep=',',encoding='latin-1',low_memory=False)
emissions_df.head(10)

Unnamed: 0,Code zone,Zone,Code Produit,Produit,Code Élément,Élément,Code source,Source,Unité,Y1961,...,Y2019N,Y2020,Y2020F,Y2020N,Y2030,Y2030F,Y2030N,Y2050,Y2050F,Y2050N
0,2,Afghanistan,5058,Fermentation entérique,7225,Émissions (CH4),3050,FAO TIER 1,kilotonnes,240.6831,...,,,,,453.7474,Fc,,603.6185,Fc,
1,2,Afghanistan,5058,Fermentation entérique,7225,Émissions (CH4),3051,UNFCCC,kilotonnes,,...,,,,,,,,,,
2,2,Afghanistan,5058,Fermentation entérique,724413,Émissions (CO2eq) venant de CH4 (AR5),3050,FAO TIER 1,kilotonnes,6739.1279,...,,,,,12704.9283,Fc,,16901.3173,Fc,
3,2,Afghanistan,5058,Fermentation entérique,724413,Émissions (CO2eq) venant de CH4 (AR5),3051,UNFCCC,kilotonnes,,...,,,,,,,,,,
4,2,Afghanistan,5058,Fermentation entérique,723113,Émissions (CO2eq) (AR5),3050,FAO TIER 1,kilotonnes,6739.1279,...,,,,,12704.9283,Fc,,16901.3173,Fc,
5,2,Afghanistan,5058,Fermentation entérique,723113,Émissions (CO2eq) (AR5),3051,UNFCCC,kilotonnes,,...,,,,,,,,,,
6,2,Afghanistan,5059,Gestion du fumier,7225,Émissions (CH4),3050,FAO TIER 1,kilotonnes,11.6228,...,,,,,27.2114,Fc,,35.27,Fc,
7,2,Afghanistan,5059,Gestion du fumier,7225,Émissions (CH4),3051,UNFCCC,kilotonnes,,...,,,,,,,,,,
8,2,Afghanistan,5059,Gestion du fumier,7230,Émissions (N2O),3050,FAO TIER 1,kilotonnes,0.3992,...,,,,,0.577,Fc,,0.847,Fc,
9,2,Afghanistan,5059,Gestion du fumier,724413,Émissions (CO2eq) venant de CH4 (AR5),3050,FAO TIER 1,kilotonnes,325.4372,...,,,,,761.9188,Fc,,987.5598,Fc,


# Exploratory Data Analysis

### Vizualisation

# Machine Learning

## Supervised Learning

### Regression

### Classification

## Unsupervised Learning

### Clustering

## Time-series Analysis