# Table of Contents

-  [About the dataset](#about)<br>
-  [Load the data](#load_data)<br>

# Imports

In [None]:
# for package auto reload
%load_ext autoreload
%autoreload 2

# for better rendering of plots in jupyter notebook
%matplotlib inline

In [3]:
# base modules
import os
import sys
import copy
import logging
from collections import OrderedDict

# for manipulating data
import numpy as np
import pandas as pd
import math
import dill

# for Machine Learning
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, BaggingRegressor
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, plot_tree
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score, KFold, GridSearchCV
from sklearn.calibration import CalibratedClassifierCV
from sklearn.inspection import permutation_importance
from scipy.cluster import hierarchy

# for visualization
from IPython.display import display
from matplotlib import pyplot as plt
import graphviz
# plotly
# seaborn
# altair


# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

In [4]:
# path to repo
# path_to_repo = os.path.dirname(os.getcwd())
path_to_repo = os.path.dirname(os.path.dirname(os.path.realpath('__file__')))
path_to_repo

'/Users/nicolas/Desktop/Etudes/3_emlyon/2021:2022/Machine_Learning'

# Introduction: Crops Yield Project

### Project Ideas

Several ideas might pop up looking at this dataset, for example:

**Exploratory Data Analysis**
- Vizualisation of certain crops yields' temporal evolution per country
- Correlation between production & population

**Machine Learning**
- C02 emissions of crops using regression
- Clustering on CO2 crop emissioners
- Crop yields forecasting through time-series analysis using production & pop correlation

# Datasets Overview

## 1. Agricultural Crop Production

### Summary

Crop statistics for 173 products in Africa, America, Asia, Europe and Oceania, collected from 1961 to 2019.

### Description

Data from the Food and Agriculture Organization of the United Nations (FAO)

Achieving food security for all is at the heart of FAO's efforts - ensuring that people have regular access to enough quality food to lead active and healthy lives. Our three main objectives are: the eradication of hunger, food insecurity and malnutrition; the eradication of poverty and the promotion of economic and social progress for all; and the sustainable management and use of natural resources, including land, water, air, climate and genetic resources, for the benefit of present and future generations.

Primary crops, fibre crops. Crop statistics are recorded for 173 commodities, covering the following categories: Primary crops, Primary fibre crops, Cereals, Secondary cereals, Citrus, Fruit, Jute and related fibres, Oilcake equivalent, Primary oilseeds, Dry vegetables, Roots and tubers, Green fruits and vegetables and Melons. Data are expressed in terms of area harvested, quantity produced, yield and quantity of seed. The aim is to provide comprehensive coverage of production of all primary crops for all countries and regions of the world.

Source : Organisation des Nations Unies pour l'alimentation et l'agriculture (FAO)

## 2. Gas emissions Statistics

### Summary

The FAOSTAT domain Emissions Totals disseminates information estimates of CH4, N2O and CO2 emissions/removals and their aggregates in CO2eq in units of kilotonnes (kt, or 106 kg). 

### Description

The FAOSTAT domain Emissions Totals summarizes the greenhouse gas (GHG) emissions disseminated in the FAOSTAT Climate Change Emissions domains, generated from agriculture and forest land. They consist of methane (CH4), nitrous oxide (N2O) and carbon dioxide (CO2) emissions from crop and livestock activities, forest management and include land use and land use change processes. Data are computed at Tier 1 of the IPCC Guidelines for National greenhouse gas (GHG) Inventories (IPCC, 1996; 1997; 2000; 2002; 2006; 2014). Estimates are available by country, with global coverage for the period 1961–2019 with projections for 2030 and 2050 for some categories of emissions or 1990–2019 for others. The database is updated annually.

## 3. Population per country

The FAOSTAT Population module contains time series data on population, by sex and urban/rural. The series consist of both estimates and projections for different periods as available from the original sources, namely:
1. Population data refers to the World Population Prospects: The 2019 Revision from the UN Population Division.
2. Urban/rural population data refers to the World Urbanization Prospects: The 2018 Revision from the UN Population Division.

# Data Preprocessing

### Get a glimpse of one of the main datasets

In [5]:
# Loading Africa's dataset

url = 'https://raw.githubusercontent.com/nicoboou/ml_eml/main/agriculture-crop-production/Production_Crops_E_Africa.csv?token=GHSAT0AAAAAABRXBUYTNUDPEVZTZBHGL3D6YR5RE5Q'
crops_africa = pd.read_csv(url, sep=',',encoding='latin-1')
crops_africa.head(10)

Unnamed: 0,Area Code,Area,Item Code,Item,Element Code,Element,Unit,Y1961,Y1961F,Y1962,...,Y2015,Y2015F,Y2016,Y2016F,Y2017,Y2017F,Y2018,Y2018F,Y2019,Y2019F
0,4,Algeria,221,"Almonds, with shell",5312,Area harvested,ha,13300.0,F,13300.0,...,40403.0,,49983.0,,50100.0,,43043.0,,35380.0,
1,4,Algeria,221,"Almonds, with shell",5419,Yield,hg/ha,4511.0,Fc,4511.0,...,18930.0,Fc,13223.0,Fc,12362.0,Fc,13292.0,Fc,20467.0,Fc
2,4,Algeria,221,"Almonds, with shell",5510,Production,tonnes,6000.0,,6000.0,...,76482.0,,66095.0,,61934.0,,57213.0,,72412.0,
3,4,Algeria,515,Apples,5312,Area harvested,ha,3400.0,F,3100.0,...,41011.0,,46070.0,,44620.0,,39034.0,,32989.0,
4,4,Algeria,515,Apples,5419,Yield,hg/ha,45294.0,Fc,45161.0,...,110086.0,Fc,108716.0,Fc,110766.0,Fc,124970.0,Fc,169399.0,Fc
5,4,Algeria,515,Apples,5510,Production,tonnes,15400.0,,14000.0,...,451472.0,,500855.0,,494239.0,,487808.0,,558830.0,
6,4,Algeria,526,Apricots,5312,Area harvested,ha,4200.0,F,4600.0,...,38857.0,,38239.0,,44307.0,,35500.0,,30861.0,
7,4,Algeria,526,Apricots,5419,Yield,hg/ha,30286.0,Fc,30000.0,...,75530.0,Fc,67149.0,Fc,57980.0,Fc,68237.0,Fc,67789.0,Fc
8,4,Algeria,526,Apricots,5510,Production,tonnes,12720.0,,13800.0,...,293486.0,,256771.0,,256890.0,,242243.0,,209204.0,
9,4,Algeria,366,Artichokes,5312,Area harvested,ha,5000.0,F,5000.0,...,4674.0,,5174.0,,5532.0,,5784.0,,5792.0,


In [7]:
crops_africa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9091 entries, 0 to 9090
Columns: 125 entries, Area Code to Y2019F
dtypes: float64(59), int64(3), object(63)
memory usage: 8.7+ MB


In [9]:
crops_africa["Element"].unique()

array(['Area harvested', 'Yield', 'Production'], dtype=object)

### Load all datasets

In [None]:
def open_all_datasets_df(list_to_query):
  df = pd.read_csv('https://raw.githubusercontent.com/Guillem121198/machine_learning/main/players_' + str(fifa_version) + '.csv', low_memory = False)
  df['continent'] = 
  return df

# Exploratory Data Analysis

### Vizualisation

# Machine Learning

## Supervised Learning

### Regression

### Classification

## Unsupervised Learning

### Clustering

## Time-series Analysis