# Milestone I -Final Project

## Problem formulation

We aim to explore the link between farm productivity and pesticide use, considering the effects of temperature and rainfall. We want to identify countries that have boosted crop yields without increasing pesticide use. By studying these nations, we seek to uncover methods for better sustainable farming. Our key question is: “Which countries have raised crop output while stabilizing or reducing pesticide use, and how do weather conditions factor in?” Sustainable farming is crucial to being good sheppard of our planet's finite resources. The overuse of pesticides can degrade soil quality, harm wildlife, and even contaminate water supplies which can be a health hazard to humans. In an era where food security and environmental health are intertwined, it is important to find ways to to reduce our dependence on pesticides. By answering our question, we hope to provide insights that can be applied globally, a roadmap for agricultural practices, a way to nurture land and its inhabitants. We want to highlight sustainable farming and bring value to this context.

In [1]:
import requests
from zipfile import ZipFile
from io import BytesIO
import pandas as pd
import numpy as np
import altair as alt
import geopandas as gpd
from scipy import stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
from matplotlib.colors import Normalize
from IPython.display import display, Markdown

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Load Datasets

yield and pesticide datasets obtained by download from teh following urls, the temperature and rainfall datasets were previously downloaded from kaggle website and saved locally.

In [2]:
yield_url = 'https://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_Livestock_E_All_Data_(Normalized).zip'
pest_url = 'https://fenixservices.fao.org/faostat/static/bulkdownloads/Inputs_Pesticides_Use_E_All_Data_(Normalized).zip'

response = requests.get(yield_url)
zip_file = ZipFile(BytesIO(response.content))
csv_file = 'Production_Crops_Livestock_E_All_Data_(Normalized).csv'
yield_df = pd.read_csv(zip_file.open(csv_file), encoding='ISO-8859-1')

response = requests.get(pest_url)
zip_file = ZipFile(BytesIO(response.content))
csv_file = 'Inputs_Pesticides_Use_E_All_Data_(Normalized).csv'
pest_df = pd.read_csv(zip_file.open(csv_file), encoding='ISO-8859-1')

rain_df = pd.read_csv('./data/rainfall.csv')
temp_df = pd.read_csv('./data/temp.csv')

# Explore Datasets

In [3]:
# Function to print various data attributes
def print_info(df, name):
    # display(Markdown(f'### **{name}**'))
    display(Markdown('**Dataframe info:**'))
    print(df.info())
    print('\n')
    display(Markdown('**Number of unique values per column:**'))
    print(df.nunique())
    print('\n')
    display(Markdown('**Summary statistics per column:**'))
    print(df.describe().round(0).astype(int))
    print('\n')

    display(Markdown('**Columns with missing values in df:**'))
    missing_values = df.isna().sum()
    print(missing_values[missing_values > 0].to_string(header=False))
    print('\n')

## Temperature

The dataset contains information about average annual temperature across different years and countries.ay be needed.

##### Attributes:
 - year: The year in which the data was collected.
 - country: The name of the country.
 - avg_temp: The average annual temperature in presumably degrees Celsius (although the unit is not explicitly stated).

##### Note:
There are some missing values in the avg_temp column. These will need to be addressed in the data preprocessing stage.

In [4]:
print_info(temp_df, 'temp_df')

**Dataframe info:**

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71311 entries, 0 to 71310
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   year      71311 non-null  int64  
 1   country   71311 non-null  object 
 2   avg_temp  68764 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.6+ MB
None




**Number of unique values per column:**

year         271
country      137
avg_temp    3303
dtype: int64




**Summary statistics per column:**

        year  avg_temp
count  71311     68764
mean    1906        16
std       67         8
min     1743       -14
25%     1858        10
50%     1910        16
75%     1962        24
max     2013        31




**Columns with missing values in df:**

avg_temp    2547




## Rainfall

##### Attributes:
 - year: The year in which the data was collected.
 - country: The name of the country.
 - avg_temp: The average annual temperature in presumably degrees Celsius (although the unit is not explicitly stated).

##### Note:
There are some missing values in the avg_temp column. These will need to be addressed in the data preprocessing stage.

In [5]:
print_info(rain_df, 'rain_df')

**Dataframe info:**

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6727 entries, 0 to 6726
Data columns (total 3 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0    Area                          6727 non-null   object
 1   Year                           6727 non-null   int64 
 2   average_rain_fall_mm_per_year  5953 non-null   object
dtypes: int64(1), object(2)
memory usage: 157.8+ KB
None




**Number of unique values per column:**

 Area                            217
Year                              31
average_rain_fall_mm_per_year    173
dtype: int64




**Summary statistics per column:**

       Year
count  6727
mean   2001
std      10
min    1985
25%    1993
50%    2001
75%    2010
max    2017




**Columns with missing values in df:**

average_rain_fall_mm_per_year    774




## Pesticide Use

##### Attributes:
 - Domain: Appears to indicate the subject area of the data, which is 'Pesticides Use' in this case.
 - Area: The geographical area, which looks like the name of the country.
 - Element: Specifies what the data represents, i.e., 'Use' of pesticides.
 - Item: The type of pesticide used. It is listed as 'Pesticides (total)' in the initial rows.
 - Year: The year in which the data was collected.
 - Unit: The unit of measurement, which is 'tonnes of active ingredients'.
 - Value: The actual value, representing the amount of pesticides used.

In [6]:
print_info(pest_df, 'pest_df')

**Dataframe info:**

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112350 entries, 0 to 112349
Data columns (total 12 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   Area Code        112350 non-null  int64  
 1   Area Code (M49)  112350 non-null  object 
 2   Area             112350 non-null  object 
 3   Item Code        112350 non-null  int64  
 4   Item             112350 non-null  object 
 5   Element Code     112350 non-null  int64  
 6   Element          112350 non-null  object 
 7   Year Code        112350 non-null  int64  
 8   Year             112350 non-null  int64  
 9   Unit             112350 non-null  object 
 10  Value            112350 non-null  float64
 11  Flag             112350 non-null  object 
dtypes: float64(1), int64(5), object(6)
memory usage: 10.3+ MB
None




**Number of unique values per column:**

Area Code            254
Area Code (M49)      254
Area                 254
Item Code             47
Item                  47
Element Code           4
Element                4
Year Code             32
Year                  32
Unit                   4
Value              33890
Flag                   4
dtype: int64




**Summary statistics per column:**

       Area Code  Item Code  Element Code  Year Code    Year    Value
count     112350     112350        112350     112350  112350   112350
mean         804       1338          5159       2006    2006     6569
std         1760         17             5          9       9    65038
min            1       1309          5157       1990    1990        0
25%           74       1321          5157       1998    1998        0
50%          147       1341          5157       2006    2006        8
75%          217       1357          5157       2014    2014      270
max         5817       1357          5173       2021    2021  3535375




**Columns with missing values in df:**

Series([], )




## Crop Yields
The dataset contains 56,717 records and 12 columns.

##### Attributes:
 - Domain Code, Domain, Element Code, Element, and Unit: These columns contain only 1 unique value and are likely to be redundant for analysis.
 - Area Code and Area: Represent the geographical location with 212 unique areas.
 - Item Code and Item: Indicate the type of crop, with 10 unique crop types.
 - Year Code and Year: Represent the year data was collected, with 56 unique years.
 - Value: Represents crop yield and contains 36,815 unique val

##### Notes:

The data types are mixed: integer types (Area Code, Element Code, Item Code, Year Code, Year, Value) and object types (Domain Code, Domain, Area, Element, Item, Unit).ues.

In [7]:
print_info(yield_df, 'yield_df')

**Dataframe info:**

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3761168 entries, 0 to 3761167
Data columns (total 13 columns):
 #   Column           Dtype  
---  ------           -----  
 0   Area Code        int64  
 1   Area Code (M49)  object 
 2   Area             object 
 3   Item Code        int64  
 4   Item Code (CPC)  object 
 5   Item             object 
 6   Element Code     int64  
 7   Element          object 
 8   Year Code        int64  
 9   Year             int64  
 10  Unit             object 
 11  Value            float64
 12  Flag             object 
dtypes: float64(1), int64(5), object(7)
memory usage: 373.0+ MB
None




**Number of unique values per column:**

Area Code             245
Area Code (M49)       245
Area                  245
Item Code             301
Item Code (CPC)       301
Item                  301
Element Code           18
Element                 9
Year Code              61
Year                   61
Unit                   12
Value              941901
Flag                    5
dtype: int64




**Summary statistics per column:**

       Area Code  Item Code  Element Code  Year Code     Year       Value
count    3761168    3761168       3761168    3761168  3761168     3761168
mean        1487        830          5409       1994     1994     2477943
std         2304       1165           103         17       17    27727895
min            1         15          5111       1961     1961           0
25%           89        339          5312       1979     1979        2600
50%          170        600          5419       1995     1995       23164
75%         5000       1058          5510       2008     2008      156153
max         5817      17530          5513       2021     2021 -2147483648




**Columns with missing values in df:**

Series([], )




## Year Ranges

In [13]:
def year_ranges():
    print(f"Year range in temp.csv {temp_df['Year'].min()} - {temp_df['Year'].max()}")
    print(f"Year range in rainfall.csv {rain_df['Year'].min()} - {rain_df['Year'].max()}")
    print(f"Year range in yield.csv {yield_df['Year'].min()} - {yield_df['Year'].max()}")
    print(f"Year range in pesticides.csv {pest_df['Year'].min()} - {pest_df['Year'].max()}")

year_ranges()

Year range in temp.csv 1743 - 2013
Year range in rainfall.csv 1985 - 2017
Year range in yield.csv 1961 - 2021
Year range in pesticides.csv 1990 - 2021


# Data Cleaning

## Columns (Data fields)

### Column Transformations
We will rename some columns to be consistent among the dataframes and make it easier to merge data later on. Also, we will subset the dataframes to take only columns of interest

In [8]:
temp_df.rename(columns = {'year':'Year','country':'Country','avg_temp':'Temperature'},inplace = True)
rain_df.rename(columns = {' Area':'Country','average_rain_fall_mm_per_year':'Rainfall'},inplace = True)
yield_df.rename(columns = {'Area':'Country','Value':'Yield'},inplace = True)
pest_df.rename(columns = {'Area':'Country','Value':'Pesticides'},inplace = True)

pest_df = pest_df[['Country', 'Year', 'Pesticides']]
yield_df = yield_df[['Country', 'Item', 'Year', 'Yield', 'Element Code']]

### Yield Transformations
The yield dataset contains information on individual crops, for our analysis we will combine these and sum the totals for that country by year. Based on the metadata for the dataset from the FAO, the code 5419 hast the most to do with crop yields, we will subset on this. 

In [10]:
yield_df['Element Code'].unique()

array([5312, 5419, 5510, 5111, 5320, 5112, 5410, 5413, 5513, 5313, 5417,
       5424, 5321, 5420, 5318, 5114, 5314, 5422], dtype=int64)

In [11]:
yield_df = yield_df[yield_df['Element Code'] == 5419].groupby(['Country','Year'])['Yield'].sum().reset_index()

In [12]:
yield_df.head(3)

Unnamed: 0,Country,Year,Yield
0,Afghanistan,1961,1921425.0
1,Afghanistan,1962,1993718.0
2,Afghanistan,1963,1987768.0


## Data Integrity Findings

### Country Names
The country names between the pesticides and yield datasets have some inconsistencies, we can view the values that are present in the yield dataset but not in the pesticide dataset. As we can see there are several that don't make sense, such as Netherlands, United Kingdom, and Turkey. ON review of the countries in the pesticides dataset, we note that 'Türkiye', 'United Kingdom of Great Britain and Northern Ireland', and 'Netherlands (Kingdom of the)' are listed. 

In [15]:
unique_values_yield = yield_df.loc[~yield_df['Country'].isin(pest_df['Country']), 'Country'].unique()
unique_values_yield

array(['Afghanistan', 'Dominica', 'Guadeloupe', 'Guyana',
       'Marshall Islands', 'Martinique', 'Micronesia', 'Netherlands',
       'Puerto Rico', 'Réunion', 'Singapore', 'South Sudan',
       'United Arab Emirates', 'Uzbekistan'], dtype=object)

In [16]:
replacements = {
    'Türkiye': 'Turkey',
    'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom',
    'Netherlands (Kingdom of the)':'Netherlands'
}

pest_df['Country'] = pest_df['Country'].replace(replacements)

In [19]:
temp_df[temp_df["Year"] >= 1900].isnull().sum()

Year             0
Country          0
Temperature      0
Year_Bin       345
dtype: int64

In [None]:
temp_df_cleaned = temp_df[temp_df["Year"] >= 1900].copy().reset_index()
temp_df_cleaned = temp_df_cleaned.drop(columns=['Year_Bin'])
temp_df_cleaned