# CAN-mBio Project
An analysis of global crop productivity and its relationship to intellectual property rights for plant breeders.

## Stakeholders
[mBio](https://datascience.uchicago.edu/news/new-mbio-data-portal-brings-transparency-to-genetically-modified-crops-in-africa/) is a collaboration between researchers based at **The University of San Francisco**, **The University of Chicago**, and **The University of Cambridge**. The project exists to understand the effects of GMO crop development primarily on the African continent.

The project, funded by the [11th hour foundation](https://11thhourproject.org/), has successfully built a [database](https://mbioproject.org/about) of information about the use of GMOs and other related material by scraping public websites and using AI, ML and other data techniques to source information. This databases provides a unique perspective on who and how these products are used.

This project has resulted in both [popular press mentions](https://www.thenation.com/article/world/new-colonialist-food-economy/) and published [academic articles](https://nph.onlinelibrary.wiley.com/doi/full/10.1002/ppp3.10453).

The **[University of Chicago Data Science Institute (DSI)](https://datascience.uchicago.edu/)**, founded in 2021, executes the university's bold, innovative vision of Data Science as a new discipline. Through a grant received from the 11th Hour Project run by the Schmidt Family Foundation, its staff consults with social impact organizations to provide technical solutions at no cost.

## Background
Plant breeding is essential for achieving food security in the context of modern population growth and climate change. Improved plant fertility, pest and disease resistance, salt and drought tolerance, and nutrient absorption, among other adaptations, result in larger crop yields while minimizing harm to the natural environment. However, successful plant breeding requires expertise, time (10-15 years for many species), and to scale, significant investments in land and specialized equipment like greenhouses, growth chambers, and laboratories.

The <u>[International Union for the Protection of New Varieties of Plants (UPOV) ](https://www.upov.int/portal/index.html.en)</u> —an intergovernmental organization founded in 1961 and based in Geneva, Switzerland—argues that plant breeders must be incentivized to continue their work given that a new plant variety can easily be reproduced and reused by others once discovered. To safeguard economic profits for breeders, UPOV has developed a "blueprint" regulatory framework of intellectual property rights (i.e., the UPOV Convention) that awards breeders monopolies on new plant varieties and requires other farmers to seek their authorization for marketing and sale of the varieties (e.g., through licensing fees). Today, a country or intergovernmental organization can become a member of UPOV by adapting the intellectual property laws in the UPOV Convention to their own legal jurisdictions. As of May 2023, UPOV had 78 members representing more than a third of the world’s countries.

However, this growth in membership has not been without controversy. Grassroots organizations and nonprofits and other experts argue that UPOV favors seeds from “Big Ag” companies like Bayer (who acquired Monsanto), Syngenta, Corteva, and BASF over those from small farmers. These companies typically grow export crops using monoculture farming, an unsustainable practice that depletes the soil of nutrients over time and requires extensive irrigation, fertilizers, and pesticides that harm the environment. On June 1st, 2023, a coalition of farmers’ organizations, women’s organizations, trade activists, and consumer groups released a joint statement to express their concern over Benin’s potential membership in UPOV, urging the nation to protect its food sovereignty from foreign companies and promote the use of indigenous seeds better adapted to the local environment. The outcome of these efforts is still to be determined, but they hope to continue lobbying against UPOV.

## Problem
Researchers and other experts suspects that the growth of the UPOV has not resulted in larger crop yields over time and would like to test their hypothesis by analyzing publicly available data from the [Food and Agriculture Organization (FAO) of the United Nations](https://www.fao.org/faostat/en/#data/QCL). This dataset records hectares of land area harvested and tons of output produced for different primary crops (e.g., almonds, papayas, avocadoes) each year from 1961 to 2021 for over 200 countries. It is also supplemented by additional datasets listing standardized country codes, units of measurement, and quality control flags and countries' membership in UPOV over time.

An exploratory data analysis (EDA) could explore the following questions:

* How are crop types currently distributed across countries, and how has this changed over time, if at all?
* For each crop type, has there been an overall increase or decrease in the number of tons produced? In the area harvested? Are there certain countries where production or land area of the crop has dramatically increased or decreased? Explain.
* For each country, how has the total area harvested and tons of crops produced changed over time? Cross-reference these time series with current events to better understand why this might be the case.
* How are the quality flags (official figure, estimate value, imputed value, etc.) distributed across the dataset? Is data collected for certain years or geographic areas seemingly more robust?

Meanwhile, our main research questions are as follows:

* Is membership in UPOV positively correlated with, or predictive of, higher crop yields? Is the effect noticeable after a certain amount of time has passed since joining?
* What effect does joining UPOV have on the area harvested of specific crops? Are there crops that seem to have been more affected? If so, conduct a case study of one to two crops to better understand what factors may have contributed to that outcome.

## Expected Deliverables
Each team is expected to turn in:

* A Python script with functions to clean and standardize the crop dataset.
An EDA that loads the data and then answers the questions above using one or more Jupyter notebooks.
* A Jupyter notebook that explores the correlation between UPOV membership and crop yields and determines the extent to which UPOV membership is predictive of yields.
* A Jupyter notebook that explores the effect joining UPOV had on the area harvested for different crops.
* A written (2-3 pages) or digital report that walks through your data, methodology, analysis, and conclusions and provides data visualizations.

## Working with the Data
This repository only contains the data for use by this project. All datasets are saved as comma-separated files and can be read by any analysis tool which opens standard CSV.

The data in this repository should be considered a starting point for this project. There are numerous directions that the analysis could go and leveraging additional datasets to support your conclusions is strongly encouraged.

## Data Dictionary
### **faostat_country_codes.csv**
This CSV file maps country names to United Nations country codes and ISO2 and ISO3 codes from the [International Organization for Standardization](https://www.iso.org/about-us.html). It was downloaded directly from the FAOSTAT website.

<table>
<thead>
<tr>
<th>Column Name</th>
<th>Data Type</th>
<th>Description</th>
<th>Example Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Country Code</td>
<td>String</td>
<td>A likely auto-generated code.</td>
<td><code>"2"</code></td>
</tr>
<tr>
<td>Country</td>
<td>String</td>
<td>The country name.</td>
<td><code>"Afghanistan"</code></td>
</tr>
<tr>
<td>M49 Code</td>
<td>String</td>
<td>The United Nations M49 standard code.</td>
<td><code>"'004"</code></td>
</tr>
<tr>
<td>ISO2 Code</td>
<td>String</td>
<td>The ISO2 standard code.</td>
<td><code>"AF"</code></td>
</tr>
<tr>
<td>ISO3 Code</td>
<td>String</td>
<td>The ISO3 standard code.</td>
<td><code>"AFG"</code></td>
</tr>
<tr>
<td>Start Year</td>
<td>Integer</td>
<td>---</td>
<td>---</td>
</tr>
<tr>
<td>End Year</td>
<td>Integer</td>
<td>---</td>
<td>---</td>
</tr>
</tbody>
</table>

### **faostat_crops.csv**
This CSV file contains information on crop production and harvested area size over time for different countries. It was downloaded directly from the FAOSTAT website. Please note that data cleaning should drop irrelevant columns and only keep the Area harvested and Production metrics.

<table>
<thead>
<tr>
<th>Column Name</th>
<th>Data Type</th>
<th>Description</th>
<th>Example Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Area Code</td>
<td>String</td>
<td>A unique identifer for the country. Likely automatically generated.</td>
<td><code>"2"</code></td>
</tr>
<tr>
<td>Area Code (M49)</td>
<td>String</td>
<td>A unique identifer for the country. Uses the United Nations' M49 standard.</td>
<td><code>"'004"</code></td>
</tr>
<tr>
<td>Area</td>
<td>String</td>
<td>The country name.</td>
<td><code>"Afghanistan"</code></td>
</tr>
<tr>
<td>Item Code</td>
<td>String</td>
<td>A unique identifier for the crop grown. Likely automatically generated.</td>
<td><code>"221"</code></td>
</tr>
<tr>
<td>Item Code (CPC)</td>
<td>String</td>
<td>A unique identifier for the crop grown. Uses the Central Product Classification System standard.</td>
<td><code>"'01371"</code></td>
</tr>
<tr>
<td>Item</td>
<td>String</td>
<td>The crop name.</td>
<td><code>"Almonds, in shell"</code></td>
</tr>
<tr>
<td>Element Code</td>
<td>String</td>
<td>A unique identifier for the metric observed.</td>
<td><code>"5312"</code></td>
</tr>
<tr>
<td>Element</td>
<td>String</td>
<td>The name of the metric observed.</td>
<td><code>"Area harvested"</code></td>
</tr>
<tr>
<td>Unit</td>
<td>String</td>
<td>The units used to measure the metric.</td>
<td><code>"ha"</code></td>
</tr>
<tr>
<td>Y****</td>
<td>String</td>
<td>The value of the metric for the given year. In the column name, the asterisks are replaced by a year (e.g., <code>Y1976</code>).</td>
<td><code>"5900"</code></td>
</tr>
<tr>
<td>Y****F</td>
<td>String</td>
<td>The quality control flag for the metric in the given year. In the column name, the asterisks are replaced by a year (e.g., <code>Y1976F</code>).</td>
<td><code>"E"</code></td>
</tr>
<tr>
<td>Y****N</td>
<td>String</td>
<td>Notes about the quality control flag used for the metric. In the column name, the asterisks are replaced by a year (e.g., <code>Y1976N</code>).</td>
<td><code>"Unofficial figure"</code></td>
</tr>
</tbody>
</table>

### **faostat_flags.csv**
This CSV file describes the meaning of each quality control flag in the crop dataset. It was downloaded directly from the FAOSTAT website.
<table>
<thead>
<tr>
<th>Column Name</th>
<th>Data Type</th>
<th>Description</th>
<th>Example Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Flag</td>
<td>String</td>
<td>The flag value.</td>
<td><code>"Q"</code></td>
</tr>
<tr>
<td>Flags</td>
<td>String</td>
<td>The flag meaning/interpretation.</td>
<td><code>"Missing value; suppressed"</code></td>
</tr>
</tbody>
</table>

### **faostat_units.csv**
This CSV file describes the meaning of each unit of measurement in the crop dataset. It was downloaded directly from the FAOSTAT website.
<table>
<thead>
<tr>
<th>Column Name</th>
<th>Data Type</th>
<th>Description</th>
<th>Example Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Unit Name</td>
<td>String</td>
<td>The unit value.</td>
<td><code>"100 g/ha"</code></td>
</tr>
<tr>
<td>Description</td>
<td>String</td>
<td>The unit meaning/interpretation.</td>
<td><code>"hundred Grams per hectare"</code></td>
</tr>
</tbody>
</table>

### **upov_members.csv**
This CSV file contains information on when different countries joined the UPOV. This was collected and cleaned by the Data Science Institute staff.

<table>
<thead>
<tr>
<th>Column Name</th>
<th>Data Type</th>
<th>Description</th>
<th>Example Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Country</td>
<td>String</td>
<td>The country name.</td>
<td><code>"Australia"</code></td>
</tr>
<tr>
<td>Entry Year</td>
<td>Integer</td>
<td>The year the country joined the UPOV.</td>
<td><code>1989</code></td>
</tr>
</tbody>
</table>

# Extract, Transform, Load

In [1]:
# Load the Dataset
# Mount the Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [2]:
# Import the libraries
import numpy as np                  # Scientific Computing
import pandas as pd                 # Data Analysis
import matplotlib.pyplot as plt     # Plotting
import seaborn as sns               # Statistical Data Visualization

# Let's make sure pandas returns all the rows and columns for the dataframe
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Force pandas to display full numbers instead of scientific notation
# pd.options.display.float_format = '{:.0f}'.format

# Library to suppress warnings
import warnings
warnings.filterwarnings('ignore')

#### Bash clone Repo

In [3]:
# Clone the GitHub Repo
# Start bash
%%bash
# Change directory to the project folder
cd "/content/drive/MyDrive/Spring 2024/DATA 304 Applied Data Science for Social Impact/Group Project"
# Create a container for the data files
mkdir -p data
# Change directory to new container
cd data
# Clone the repo
git clone git@github.com:laketalkemp/DATA-302_SP24.git

Cloning into 'DATA-302_SP24'...
Host key verification failed.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.


CalledProcessError: Command 'b'# Change directory to the project folder\ncd "/content/drive/MyDrive/Spring 2024/DATA 304 Applied Data Science for Social Impact/Group Project"\n# Create a container for the data files\nmkdir -p data\n# Change directory to new container\ncd data\n# Clone the repo\ngit clone git@github.com:laketalkemp/DATA-302_SP24.git\n'' returned non-zero exit status 128.

This method fails, likely an authentication error. Produces a Process error, cannot locate repo from link. Likely the wrong link, consider the sharing link if other options fail.

In [None]:
# Change directory to the project folder
! cd "/content/drive/MyDrive/Spring 2024/DATA 304 Applied Data Science for Social Impact/Group Project"
# Create a container for the data files
! mkdir -p data
# Change directory to new container
! cd data
! ls

data  drive  sample_data


In [None]:
# Clone the repo using the permalink to public repo
! git clone https://github.com/laketalkemp/DATA-302_SP24/tree/3890b4645546351a09c0bea221f7636312dfccdc/data

Cloning into 'data'...
fatal: repository 'https://github.com/laketalkemp/DATA-302_SP24/tree/3890b4645546351a09c0bea221f7636312dfccdc/data/' not found


Similar error to above. Consider trying the `wget` method. Stores in resident memory so use a hosted runtime to ensure performance without delays.

In [None]:
! wget https://raw.githubusercontent.com/laketalkemp/DATA-302_SP24/main/data/faostat_crops.csv

--2024-02-12 05:54:22--  https://raw.githubusercontent.com/laketalkemp/DATA-302_SP24/main/data/faostat_crops.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 45158996 (43M) [text/plain]
Saving to: ‘faostat_crops.csv’


2024-02-12 05:54:24 (271 MB/s) - ‘faostat_crops.csv’ saved [45158996/45158996]



In [None]:
! wget https://raw.githubusercontent.com/laketalkemp/DATA-302_SP24/main/data/faostat_country_codes.csv

--2024-02-12 05:54:29--  https://raw.githubusercontent.com/laketalkemp/DATA-302_SP24/main/data/faostat_country_codes.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12601 (12K) [text/plain]
Saving to: ‘faostat_country_codes.csv’


2024-02-12 05:54:29 (24.3 MB/s) - ‘faostat_country_codes.csv’ saved [12601/12601]



In [None]:
! wget https://raw.githubusercontent.com/laketalkemp/DATA-302_SP24/main/data/faostat_flags.csv

--2024-02-12 05:54:34--  https://raw.githubusercontent.com/laketalkemp/DATA-302_SP24/main/data/faostat_flags.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 407 [text/plain]
Saving to: ‘faostat_flags.csv’


2024-02-12 05:54:34 (20.2 MB/s) - ‘faostat_flags.csv’ saved [407/407]



In [None]:
! wget https://raw.githubusercontent.com/laketalkemp/DATA-302_SP24/main/data/faostat_units.csv

--2024-02-12 05:54:40--  https://raw.githubusercontent.com/laketalkemp/DATA-302_SP24/main/data/faostat_units.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1857 (1.8K) [text/plain]
Saving to: ‘faostat_units.csv’


2024-02-12 05:54:40 (19.7 MB/s) - ‘faostat_units.csv’ saved [1857/1857]



In [None]:
! wget https://raw.githubusercontent.com/laketalkemp/DATA-302_SP24/main/data/upov_members.csv

--2024-02-12 05:54:48--  https://raw.githubusercontent.com/laketalkemp/DATA-302_SP24/main/data/upov_members.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1323 (1.3K) [text/plain]
Saving to: ‘upov_members.csv’


2024-02-12 05:54:48 (55.8 MB/s) - ‘upov_members.csv’ saved [1323/1323]



In [None]:
# Convert to dataframes
#faostat_crops = pd.read_csv("faostat_crops.csv")
faostat_country_codes = pd.read_csv("faostat_country_codes.csv")
faostat_flags = pd.read_csv("faostat_flags.csv")
faostat_units = pd.read_csv("faostat_units.csv")
upov_members = pd.read_csv("upov_members.csv")

This method works well but there is a unicode error for the `faostat_crops.csv` file. It appears to be as specific error in a particular row. The data may need examination in a visual editor. Consider looking at the data in Excel, there isn't anything in the data dictionary indicating the data should not load. For now the problematic file is commented out.

In [None]:
# View the first 10 rows of the datasets
faostat_country_codes.head(10)

Unnamed: 0,Country Code,Country,M49 Code,ISO2 Code,ISO3 Code,Start Year,End Year
0,2,Afghanistan,4,AF,AFG,,
1,5100,Africa,2,F5100,X06,,
2,284,Åland Islands,248,F284,ALA,,
3,3,Albania,8,AL,ALB,,
4,4,Algeria,12,DZ,DZA,,
5,5,American Samoa,16,AS,ASM,,
6,5200,Americas,19,F5200,X21,,
7,6,Andorra,20,AD,AND,,
8,7,Angola,24,AO,AGO,,
9,258,Anguilla,660,AI,AIA,,


In [None]:
faostat_flags.head(10)

Unnamed: 0,Flag,Flags
0,A,Official figure
1,B,Time series break
2,C,"Aggregate, may include official, semi-official..."
3,E,Estimated value
4,F,Forecast value
5,I,Imputed value
6,M,"Missing value (data cannot exist, not applicable)"
7,N,Not significant (negligible)
8,O,Missing value
9,P,Provisional value


In [None]:
faostat_units.head(10)

Unnamed: 0,Unit Name,Description
0,%,Percent
1,%LSU,Percent of Total Livestock Units
2,°c,Degrees celsius
3,0.1 g/An,tenth Grams per animal
4,100 g,hundred Grams
5,100 g/An,hundred Grams per animal
6,100 g/ha,hundred Grams per hectare
7,100 g/t,hundred Grams per tonne
8,100 mg/An,hundred Milligrams per animal
9,1000 An,thousand Animals


In [None]:
upov_members.head(10)

Unnamed: 0,Country,Entry Year
0,Albania,2005
1,Argentina,1994
2,Australia,1989
3,Austria,1994
4,Azerbaijan,2004
5,Belarus,2003
6,Belgium,1976
7,Bolivia (Plurinational State of),1999
8,Bosnia and Herzegovina,2017
9,Bulgaria,1998


Datasets can be merged to create one table. At first glance,
* `upov_members` can be left joined with `faostat_country_code` on `Country`.
* `faostat_flags` should be mapped to the `faostat_crops` dataset.
* `faostat_units` should be used as labels on any plots generated but otherwise represent categorical data requiring encoding later. Consider using this to ensure scaling is appropriate and values are in the correct order of magnitude based on the units.