In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt

from groclient import GroClient
import groclient.utils

# Useful to display dataframes within a cell - remove when converting to script
from IPython.display import display

In [2]:
ACCESS_TOKEN = os.environ['GROAPI_TOKEN']
client = GroClient('api.gro-intelligence.com', ACCESS_TOKEN)

# Crop weighted data series for world top producers
In this notebook we will:
1. Retrieve the world top producers for a given crop
2. Compute a crop weighted data series for each country from 1.

We use Soybeans weighted Drought Index as an example, but any pair of crop/data series will work

# Preliminary
## Defining items and metrics
In this section we are going to define everything we need to retrieve the data we want. See [Data Series Definition](https://developers.gro-intelligence.com/data-series-definition.html) for more information  
There are **4** sections  
1. Source for world production ranking - This is the data source we are going to use to retrieve the world top soybeans producers
2. Geo information - Defining the data granularity (geospatial level) for the final output: province level (`4`) or district level (`5`). See [Gro Ontology](https://developers.gro-intelligence.com/gro-ontology.html) for more information  
3. Main data series - [Gro Drought Index](https://app.gro-intelligence.com/dictionary/items/17388)  
4. Crop acreage used to compute weights - [Soybeans](https://app.gro-intelligence.com/dictionary/items/270) [land cover](https://app.gro-intelligence.com/dictionary/metrics/2120001)

Each data point will be structured as described [here](https://developers.gro-intelligence.com/data-point-definition.html)

In [3]:
# 1. Source used to compute top crop producers
world_production_source = 'FAO'
# 2. Selecting the level of the data (granularity/resolution); 4 for province level, 5 for district level
input_level = 4 #

# 3. Data series that we want to weight
weighted_item = 'Gro Drought Index' #Item that we want to weight
weighted_metric = 'Gro Drought Index'

#4. Data series that we will use to compute the weighting
input_crop = 'soybeans'
# And a metric, so we choose `land cover (area)`, but it could very well be something else like `area harvested` or `area planted`
input_metric = 'land cover (area)'


'''Optional source selection (should be the source name, as a string) - If not specified (None) the source will be fetched automatically'''
weighted_source = None

The [search_for_entity](https://developers.gro-intelligence.com/api.html#groclient.GroClient.search_for_entity) method allows us to convert a name (e.g. Brazil) to a Gro ID (Brazil = 1029)

In [4]:
# 1.
world_production_source_id = client.search_for_entity('sources', world_production_source)
# 2.
'''Geospatial information '''
# Selecting the level of the data (granularity/resolution)
region_level = input_level # 4 for province level, 5 for district level
# 3.
'''Data series (drought indicator) '''
# item_id = the series we are interested in
item_id = client.search_for_entity('items', weighted_item)
# metric_id = the metric to use for our time series (here we want an index) 
metric_id = client.search_for_entity('metrics', weighted_metric)
# 4.
'''Crop acreage information (to compute weights) '''
# Now that we have our time series, we want to weight it using a specific crop.
# For that we need the crop_id, here we use soybeans (ID = 270)
crop_id = client.search_for_entity('items', input_crop)
# And a metric, so we choose `land cover (area)`, but it could very well be something else like `area harvested` or `area planted`
crop_metric = client.search_for_entity('metrics', input_metric)

## Retrieving the top 5 producers for the designated crop

We need to find the world's largest producers for the designated crop (in this example soybeans).    
The **Gro API** has a built in function that allows us to do just that: [get_top](https://developers.gro-intelligence.com/api.html#groclient.GroClient.get_top)

In [5]:
# Gro ID for the production metric
production_metric = client.search_for_entity('metrics', "Production Quantity mass")
# Use of the `get_top` method
top_countries_df = pd.DataFrame(client.get_top('regions',  num_results=5, metric_id=production_metric, item_id=crop_id, frequency_id=9, source_id=world_production_source_id))
# We only retrieve the regionId of each country, as this is all we need for the next part
top_countries_id = top_countries_df['regionId']
'''
As alternative (better suited when we don't need to visualize the output) to the last two lines, we could have written:
top_countries_id = [c['regionId'] for c in client.get_top('regions',  num_results=5, metric_id=production_metric, item_id=crop_id, frequency_id=9, source_id=2)]
'''

# Tabular Visualization of the output - This is optional
top_countries_df['country_name'] = top_countries_df['regionId'].apply(lambda x: client.lookup('regions', x)['name'])
top_countries_df['unit_name'] = top_countries_df['unitId'].apply(lambda x: client.lookup('units', x)['name'])
top_countries_df['item_name'] = input_crop
top_countries_df['source_name'] = client.lookup('sources', world_production_source_id)['name']
top_countries_df[['country_name', 'value', 'unit_name', 'source_name']]

Unnamed: 0,country_name,value,unit_name,source_name
0,United States,4627579028,tonne,FAO
1,Brazil,2849075654,tonne,FAO
2,"China, mainland",1436657134,tonne,FAO
3,Argentina,1386334920,tonne,FAO
4,India,375579720,tonne,FAO


## Defining necessary functions

1. `get_source` will be useful to retrieve the best `source` for our data series, from the **Gro API**. This function is derived from [rank_series_by_source](https://developers.gro-intelligence.com/api.html#groclient.GroClient.rank_series_by_source)
2. `get_data_points_wrapper` is a wrapper around the **Gro API** [get_data_points](https://developers.gro-intelligence.com/api.html#groclient.GroClient.get_data_points) function. It will automatically fetch the best source and return the data as a DataFrame. (Source can also be specified manually)
3. `compute_weights` will compute the `weights` from the `crop acreage` series
4. `compute_crop_weighted_series` will compute the final output - `Crop weighted drought index`

These functions can be used for any crop weighted series, no need to change anything except for the IDs defined above 

In [6]:
# We define the necessary functions
import itertools
import numpy as np

def indexOf(lst, elem, key=lambda x: x, missing_value_func=lambda x: len(x)):
    '''Returns the index of 'elem' in 'lst', or len(lst) if elem is missing'''
    return next((i for i, x in enumerate(lst) if key(x) == elem), missing_value_func(lst))

def get_source(client, queries):
    """Retrieve best source for a given level.
    Parameters
    ----------
    client : Client
        Gro Client.
    queries: List
        List of dictionaries containing Gro queries

    Returns
    -------
    Gro source id.
    """

    ranking = list(client.rank_series_by_source(queries))
    source_score = {s: 0 for s in set(dic.get("source_id") for dic in ranking)}
    
    if len(ranking) > 0:
        # Get the source ranking for each group of regions (each chunk has a ranking - so potentially multiple sources)
        # _, v -> _ is the chunk of regions (we don't explicitely need it) v is a list the dict containing the sources
        group_ranking = [
            [i['source_id'] for i in v] for _, v in itertools.groupby(ranking, lambda x: x.get('region_id'))]
        # For each group of regions increment the score value of the source by it's position in the ranking
        # Lower score is better
        for g in group_ranking:
            for s in source_score.keys():
                source_score[s] += indexOf(g, s)
        return min(source_score, key=source_score.get)
    else:
        raise exceptions.SourceDataNotAvailableError("No source available for this query")

def get_data_points_wrapper(regions, metric_id, item_id, frequency_id, source_id=None, start_date=None):        
    # Slicing our query in chunks is very useful when dealing with a lot of regions.
    # Creating queries
    queries = [{
            "metric_id": metric_id,
            "item_id": item_id,
            "start_date": start_date,
            "frequency_id": frequency_id,
            "region_id": r, } for r in groclient.utils.list_chunk(region_ids, chunk_size=200)]
    
    # Fetching source
    if source_id is None:
        source_id = get_source(client, queries)
    
    # Updating query with source
    queries = [dict({**cq, 'source_id': source_id}) for cq in queries]
    # Downloading data
    # For each query in queries, download and convert to pandas DF
    output = pd.concat(
        [pd.DataFrame().from_dict(client.get_data_points(**q)) 
         for q in queries]).reset_index(drop=True)
    output.loc[:, 'source_id'] = source_id
    return output

def compute_weights(df, regions):
    def mapper(region):
        return df[(df['region_id'] == region['id'])]['value'].mean(skipna=True)
    means = list(map(mapper, regions))
    # Normalize into weights
    total = np.nansum(means)
    if not np.isclose(total, 0.0):
        return [float(mean)/total for mean in means]
    return means

def compute_crop_weighted_series(crop_df, df, regions, weighting_func=lambda w, v: w*v):
    weights = compute_weights(crop_df, regions)
    series_list = []
    for (region, weight) in zip(regions, weights):
        series = df[(df['region_id'] == region['id'])].copy()
        series.loc[:, 'value'] = weighting_func(weight, series['value'])
        series_list.append(series)
    return pd.concat(series_list)

## Fetching data and Computing the final result

1. We fetch the two dataframes (`Gro Drought Index` and `soybeans land cover`) using the IDs and frequencies defined above for each country in the top 5
2. We compute the `crop weighted drought index`

### Steps by country
1. Retrieve the regions within the country at the selected level (first cell)
2. Define the series frequency (default is the highest available -> if daily and weekly are available, daily will be selected)
3. Download the data
4. Compute the crop weighted series and concatenate the results

In [10]:
result_list = []

for country_id in top_countries_id:
    # Fetching all subregions at level `region_level`.
    regions = client.get_descendant_regions(country_id, region_level, include_historical=False)
    region_ids = [r['id'] for r in regions] # Get only the 'id' field for each record (region)
    country_name = client.lookup('regions', country_id)['name']
    print("There are {} regions for {} at the selected geospatial level ({})"\
          .format(len(regions), country_name, region_level))
    
    # Drought index
    # Three options are available: daily, weekly or monthly
    series_freq = client.get_available_timefrequency(**{"region_id":country_id, "item_id":item_id, "metric_id":metric_id})[0]['frequency_id']  # This will be daily
    # Crop Acreage
    # One option is available : Point in time 
    crop_freq = client.get_available_timefrequency(**{"region_id":country_id, "item_id":crop_id, "metric_id":crop_metric})[0]['frequency_id']  # This will be point in time

    crop_df = get_data_points_wrapper(**{
        "regions": region_ids,
        "metric_id": crop_metric,
        "item_id": crop_id,
        "frequency_id": crop_freq,
        "source_id":None,
        "start_date": None, })

    series_df = get_data_points_wrapper(**{
        "regions": region_ids,
        "metric_id": metric_id,
        "item_id": item_id,
        "frequency_id": series_freq,
        "source_id": weighted_source,
        "start_date": None, })
    
    crop_w_series = compute_crop_weighted_series(crop_df, series_df, regions)
    crop_w_series['country_id'] = country_id
    crop_w_series['country_name'] = country_name
    result_list.append(crop_w_series)
    
    # Displaying head and tail of the end result   
    display(crop_w_series.head(3).append(crop_w_series.tail(3)))

There are 51 regions for United States at the selected geospatial level (4)


Unnamed: 0,start_date,end_date,value,unit_id,metadata,input_unit_id,input_unit_scale,reporting_date,metric_id,item_id,region_id,partner_region_id,frequency_id,source_id,country_id,country_name
0,2010-01-17T00:00:00.000Z,2010-01-17T00:00:00.000Z,2.4e-05,189,{},189,1,,15852252,17388,13051,0,1,145,1215,United States
1,2010-01-18T00:00:00.000Z,2010-01-18T00:00:00.000Z,2.8e-05,189,{},189,1,,15852252,17388,13051,0,1,145,1215,United States
2,2010-01-19T00:00:00.000Z,2010-01-19T00:00:00.000Z,1.7e-05,189,{},189,1,,15852252,17388,13051,0,1,145,1215,United States
205956,2021-02-08T00:00:00.000Z,2021-02-08T00:00:00.000Z,4e-06,189,{},189,1,,15852252,17388,13101,0,1,145,1215,United States
205957,2021-02-09T00:00:00.000Z,2021-02-09T00:00:00.000Z,4e-06,189,{},189,1,,15852252,17388,13101,0,1,145,1215,United States
205958,2021-02-10T00:00:00.000Z,2021-02-10T00:00:00.000Z,4e-06,189,{},189,1,,15852252,17388,13101,0,1,145,1215,United States


There are 27 regions for Brazil at the selected geospatial level (4)


Unnamed: 0,start_date,end_date,value,unit_id,metadata,input_unit_id,input_unit_scale,reporting_date,metric_id,item_id,region_id,partner_region_id,frequency_id,source_id,country_id,country_name
0,2010-01-17T00:00:00.000Z,2010-01-17T00:00:00.000Z,0.029023,189,{},189,1,,15852252,17388,10402,0,1,145,1029,Brazil
1,2010-01-18T00:00:00.000Z,2010-01-18T00:00:00.000Z,0.029176,189,{},189,1,,15852252,17388,10402,0,1,145,1029,Brazil
2,2010-01-19T00:00:00.000Z,2010-01-19T00:00:00.000Z,0.030052,189,{},189,1,,15852252,17388,10402,0,1,145,1029,Brazil
109077,2021-02-08T00:00:00.000Z,2021-02-08T00:00:00.000Z,0.020023,189,{},189,1,,15852252,17388,10428,0,1,145,1029,Brazil
109078,2021-02-09T00:00:00.000Z,2021-02-09T00:00:00.000Z,0.020154,189,{},189,1,,15852252,17388,10428,0,1,145,1029,Brazil
109079,2021-02-10T00:00:00.000Z,2021-02-10T00:00:00.000Z,0.020251,189,{},189,1,,15852252,17388,10428,0,1,145,1029,Brazil


There are 32 regions for China, mainland at the selected geospatial level (4)


Unnamed: 0,start_date,end_date,value,unit_id,metadata,input_unit_id,input_unit_scale,reporting_date,metric_id,item_id,region_id,partner_region_id,frequency_id,source_id,country_id,country_name
0,2010-01-17T00:00:00.000Z,2010-01-17T00:00:00.000Z,0.013018,189,{},189,1,,15852252,17388,13317,0,1,145,1231,"China, mainland"
1,2010-01-18T00:00:00.000Z,2010-01-18T00:00:00.000Z,0.011992,189,{},189,1,,15852252,17388,13317,0,1,145,1231,"China, mainland"
2,2010-01-19T00:00:00.000Z,2010-01-19T00:00:00.000Z,0.01286,189,{},189,1,,15852252,17388,13317,0,1,145,1231,"China, mainland"
121929,2021-02-08T00:00:00.000Z,2021-02-08T00:00:00.000Z,0.02572,189,{},189,1,,15852252,17388,13348,0,1,145,1231,"China, mainland"
121930,2021-02-09T00:00:00.000Z,2021-02-09T00:00:00.000Z,0.025125,189,{},189,1,,15852252,17388,13348,0,1,145,1231,"China, mainland"
121931,2021-02-10T00:00:00.000Z,2021-02-10T00:00:00.000Z,0.018624,189,{},189,1,,15852252,17388,13348,0,1,145,1231,"China, mainland"


There are 24 regions for Argentina at the selected geospatial level (4)


Unnamed: 0,start_date,end_date,value,unit_id,metadata,input_unit_id,input_unit_scale,reporting_date,metric_id,item_id,region_id,partner_region_id,frequency_id,source_id,country_id,country_name
0,2010-01-17T00:00:00.000Z,2010-01-17T00:00:00.000Z,0.699017,189,{},189,1,,15852252,17388,10136,0,1,145,1010,Argentina
1,2010-01-18T00:00:00.000Z,2010-01-18T00:00:00.000Z,0.694085,189,{},189,1,,15852252,17388,10136,0,1,145,1010,Argentina
2,2010-01-19T00:00:00.000Z,2010-01-19T00:00:00.000Z,0.689701,189,{},189,1,,15852252,17388,10136,0,1,145,1010,Argentina
96957,2021-02-08T00:00:00.000Z,2021-02-08T00:00:00.000Z,0.032793,189,{},189,1,,15852252,17388,10159,0,1,145,1010,Argentina
96958,2021-02-09T00:00:00.000Z,2021-02-09T00:00:00.000Z,0.031288,189,{},189,1,,15852252,17388,10159,0,1,145,1010,Argentina
96959,2021-02-10T00:00:00.000Z,2021-02-10T00:00:00.000Z,0.026157,189,{},189,1,,15852252,17388,10159,0,1,145,1010,Argentina


There are 36 regions for India at the selected geospatial level (4)


Unnamed: 0,start_date,end_date,value,unit_id,metadata,input_unit_id,input_unit_scale,reporting_date,metric_id,item_id,region_id,partner_region_id,frequency_id,source_id,country_id,country_name
0,2010-01-17T00:00:00.000Z,2010-01-17T00:00:00.000Z,0.000113,189,{},189,1,,15852252,17388,11174,0,1,145,1094,India
1,2010-01-18T00:00:00.000Z,2010-01-18T00:00:00.000Z,0.00012,189,{},189,1,,15852252,17388,11174,0,1,145,1094,India
2,2010-01-19T00:00:00.000Z,2010-01-19T00:00:00.000Z,0.000124,189,{},189,1,,15852252,17388,11174,0,1,145,1094,India
137268,2021-02-08T00:00:00.000Z,2021-02-08T00:00:00.000Z,0.00205,189,{},189,1,,15852252,17388,13475,0,1,145,1094,India
137269,2021-02-09T00:00:00.000Z,2021-02-09T00:00:00.000Z,0.002365,189,{},189,1,,15852252,17388,13475,0,1,145,1094,India
137270,2021-02-10T00:00:00.000Z,2021-02-10T00:00:00.000Z,0.002478,189,{},189,1,,15852252,17388,13475,0,1,145,1094,India


#### Exporting the data as CSV

In [8]:
pd.concat(result_list).reset_index().to_csv('top5_countries_{}_w_{}.csv'.format(input_crop, weighted_item.replace(' ', '_').lower()), index=False)