# Python Learn by Doing: ENSO, Your Turn! Option 2 Answer Key

**Developed By:** Dr. Kerrie Geil, Mississippi State University

**Original Development Date:** July 2024

**Package Requirements:** xarray, netcdf4, numpy, pandas, scipy, matplotlib, cartopy, jupyter, geopandas, shapely, xagg

**Links:** **[OSF project link](https://osf.io/zhpd5/)**, [link to this notebook on github](https://github.com/kerriegeil/MSU_py_training/blob/main/learn_by_doing/enso/assignments/enso_analysis_option2.ipynb)

---
**Assignment:**

This specific assignment was requested by a past python learn by doing workshop participant. Thank you, it's great to have suggestions! 

Using a shapefile with country boundaries, show a table (use a pandas dataframe) of the percent area of each country in South America where there is statistically significant anomalous temperature during strong El Nino events. Also show the mean anomaly by country.

Pay attention to the hints below, as this assignment uses data, packages, and coding techniques that weren't covered in the main notebook. As such, this is more of a learning assignment than a practice assignment. 


&emsp;Hints:
- Use World_Countries_Generalized.shp for country boundaries and subset to South America. Here's the code to download the shapefile

```python

from urllib.request import urlretrieve

# create a folder for data downloads
if not os.path.exists('../data/World_Countries'):
    os.makedirs('../data/World_Countries')

# filenames to save data to and download urls
base_filename='../data/World_Countries/World_Countries'

shpfile_info=  {'.cpg':'https://osf.io/5xrgc/download',
                '.dbf':'https://osf.io/3a6rp/download',
                '.prj':'https://osf.io/43mnp/download',
                '.shp':'https://osf.io/r4dez/download',
                '.shp.xml':'https://osf.io/s4cvy/download',
                '.shx':'https://osf.io/kp6cm/download'}    
for ext,url in shpfile_info.items():
    filename=base_filename+ext
    print('downloading',filename)
    urlretrieve(url,filename) # download and save data

```
- Use geopandas to read the shapefile and subset the rows to South American countries with `LAND_TYPE` of `Primary Land`.
- Calculate the area of each country in square kilometers and add that information to the geodataframe as a new column. The most accurate way to calculate areas of polygons spread across the globe is to do the calculations on an ellipsoid. You would use data that is in the geographic coordinate reference system epsg:4326 (not projected) and use the corresponding ellipsoid parameters. The other way to calculate shape areas which would be an acceptable substitute here is to project the geodataframe to an equal area projection (epsg: 6933), then use geopandas `.area` function on the geodataframe, and divide by 1000^2 to get km2. Here is a function that will do the calculation on an ellipsoid.
```python

def area_calc(geodf):
    # input geodf is a geodataframe with crs epsg:4326

    # function that operates on individual geometry objects
    def get_area(geom,geod):        
        if geom.geom_type not in ['MultiPolygon','Polygon']:
            return np.nan
        
        # orient to ensure a counter-clockwise traversal. 
        # See https://pyproj4.github.io/pyproj/stable/api/geod.html
        # geometry_area_perimeter returns (area, perimeter)
        if geom.geom_type == 'Polygon':
            return geod.geometry_area_perimeter(orient(geom, 1))[0]/1E6
        # For MultiPolygon do each separately and sum
        if geom.geom_type == 'MultiPolygon':
            return np.sum([get_area(poly,geod) for poly in geom.geoms])

    # check presence of geographic crs and execute or raise error
    if geodf.crs and geodf.crs.is_geographic:
        # apply the get_area function to each country (row)
        geod = geodf.crs.get_geod()
        geodf['AREA_KM2']=geodf.apply(lambda row : get_area(row.geometry,geod),axis=1)
        return geodf
    else:
        raise TypeError('geodataframe should have geographic coordinate reference system') 
            
```

- From enso_analysis.ipynb, copy the appropriate data cleaning steps for the nino index `nino` and t anomaly `t_anom` data
- From enso_analysis.ipynb, copy the appropriate analysis steps from science questions 1 and 3 to get `t_nino_DJF_composite` and `t_nino_pval` 
- To get the area impacted use the xagg package (after creating the weightmap, recommend trying the remaining steps for one country first, then creating a function to apply it to all countries):
    - use `weightmap = xagg.pixel_overlaps(input, input, subset_bbox=False)` to compute the overlaps between all grid cell polygons and all country polygons
    - for each country, use the weightmap from above to create a country-specific pandas dataframe `country_df` where each row contains information for a single grid cell that overlaps the country polygon. You'll have columns for lat and lon (use `weightmap.agg['coords'][indexrow]`) and rel_area (use `weightmap.agg['rel_area'][indexrow][0]`) to pull the relevant information from the weightmap.
    - convert `t_nino_DJF_composite` and `t_nino_pval` to dataframes with columns for lat and lon (use `.reset_index(level=[0,1])`)
    - use pandas `.merge` to merge the temperature anomaly data and pvalue into the country_df (use parameters `how='left',on=['lat','lon']`)
    - use pandas `.loc` to sum rel_area where pval <= 0.1
    - if you haven't already, create a function that takes country-specific info from the weightmap and outputs area_impacted. Then, generate results for all countries using a for loop.
- To get the country-mean temperature anomaly use `xagg.core.aggregate()` along with the weightmap generated in the previous steps
- Optional: calculate the country-mean temperature anomaly only where pval<=0.1. Hint: input a set of weights to xagg.pixel_overlaps() where insignificant pixels have weight 0 and significant pixels have weight 1.
- Format your results into a pandas dataframe. Your final dataframe should have columns for `COUNTRY`, `T_PERCENT_AREA`, and `MEAN_T_ANOMALY`.

# Import packages and define workspace

In [None]:
import os
from urllib.request import urlretrieve

import geopandas as gpd
import pandas as pd
import numpy as np
from shapely.geometry.polygon import orient
import xarray as xr
import scipy.stats as ss
import xagg

import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cf

In [None]:
# create a folder for data downloads
if not os.path.exists('../data/World_Countries'):
    os.makedirs('../data/World_Countries')

In [None]:
# filenames to save data to and download urls
base_filename='../data/World_Countries/World_Countries'

shpfile_info=  {'.cpg':'https://osf.io/5xrgc/download',
                '.dbf':'https://osf.io/3a6rp/download',
                '.prj':'https://osf.io/43mnp/download',
                '.shp':'https://osf.io/r4dez/download',
                '.shp.xml':'https://osf.io/s4cvy/download',
                '.shx':'https://osf.io/kp6cm/download'}

# files we've already downloaded
nino_f = '../data/nino34_anomalies_monthly_NOAA.txt'
t_f = '../data/tavg_monthly_BerkeleyEarth.nc'

# Obtaining the data

We'll use a shapefile of country boundaries called [World Countries, originally obtained from ESRI ArcGIS Hub](https://hub.arcgis.com/datasets/esri::world-countries/explore) in June 2024 and copied to the [enso component of the MSU_py_training OSF project](https://osf.io/e726y/). 

<br>
<font color="green"><b>
**NOTE: You only need to run the following urlretrieve cell once. The data files will then be located on your computer. Files total approximately 130MB in size.**
</b></font> 

In [None]:
# download shapefile

for ext,url in shpfile_info.items():
    filename=base_filename+ext
    print('downloading',filename)
    urlretrieve(url,filename) # download and save data

# Data Cleaning

We'll subset global country boundaries to South American countries and calculate country areas, as well as copy over the relevant data cleaning steps from enso_analysis.ipynb.  

In [None]:
# load shapefile of global country boundaries
countries=gpd.read_file(base_filename+'.shp')
countries

In [None]:
# check the crs
countries.crs

In [None]:
# subset to south america
countries=countries.loc[(countries['CONTINENT']=='South America')
                        &(countries['LAND_TYPE'].str.contains('Primary land'))].reset_index(drop=True)

# showing how to use equal area projection to get country areas
# for global data this may be less accurate than calculating areas with unprojected data on an ellipsoid
# but I'm showing it because it is much simpler
countries=countries.to_crs("epsg:6933")
countries['AREA_6933'] = countries.area/(1E6)

# reproject to a geographic crs
countries=countries.to_crs("epsg:4326")
countries

In [None]:
# calculate shape areas on an ellipsoid (geographic crs)

def area_calc(geodf):
    # input geodf is a geodataframe with crs epsg:4326
    # function that operates on individual geometry objects
    def get_area(geom,geod):        
        if geom.geom_type not in ['MultiPolygon','Polygon']:
            return np.nan
        
        # orient to ensure a counter-clockwise traversal. 
        # See https://pyproj4.github.io/pyproj/stable/api/geod.html
        # geometry_area_perimeter returns (area, perimeter)
        if geom.geom_type == 'Polygon':
            return geod.geometry_area_perimeter(orient(geom, 1))[0]/1E6
        # For MultiPolygon do each separately and sum
        if geom.geom_type == 'MultiPolygon':
            return np.sum([get_area(poly,geod) for poly in geom.geoms])

    # check presence of geographic crs and execute or raise error
    if geodf.crs and geodf.crs.is_geographic:
        # apply the get_area function to each country (row)
        geod = geodf.crs.get_geod()
        geodf['AREA_KM2']=geodf.apply(lambda row : get_area(row.geometry,geod),axis=1)
        return geodf
    else:
        raise TypeError('geodataframe should have geographic coordinate reference system')    

In [None]:
countries=area_calc(countries)
countries

In [None]:
# data cleaning copied from enso_analysis.ipynb

year_start = '1948'
year_end = '2023'
base_start = '1981'
base_end = '2010'

dates=pd.date_range('1870-01-01','2024-12-01',freq='MS')

# Nino 3.4 data
nino_raw=pd.read_csv(nino_f,sep=r'\s+',skiprows=1,skipfooter=7,header=None,index_col=0,na_values=-99.99,engine='python')
nino=nino_raw.to_numpy().flatten() 
nino=xr.DataArray(nino,name='nino',dims='time',coords={'time':dates}) 
nino.attrs['standard_name']='nino3.4 index'
nino.attrs['units']='C'
nino=nino.sel(time=slice(year_start,year_end))

# temperature data
ds=xr.open_dataset(t_f)
dates=pd.date_range('1750-01-01','2024-03-01',freq='MS')
ds['time']=dates
ds=ds.rename({'month_number':'month','latitude':'lat','longitude':'lon'})
t_anom_5180=ds.temperature
clim_5180=ds.climatology
t=t_anom_5180.groupby(t_anom_5180.time.dt.month)+clim_5180
t_base=t.sel(time=slice(base_start,base_end))  # subset in time
clim_8110 = t_base.groupby(t_base.time.dt.month).mean('time')  # long term means for each month
t_anom=t.groupby(t.time.dt.month)-clim_8110
t_anom=t_anom.sel(time=slice(year_start,year_end))
t_anom=t_anom.rename('tavg')
t_anom.attrs['standard_name']='T anomaly'
t_anom.attrs['units']='C'

# check first and last time is the same for all data
variables=[nino, t_anom] # list of arrays
for var in variables:
    print(var.name, var.time[0].data,var.time[-1].data)

# clean up
del ds, nino_raw, t, t_anom_5180, t_base

# Begin Main Analysis

First, we need to know when El Nino and La Nina events occurred so we'll copy over code from question 1 in enso_analysis.ipynb (How many strong El Nino and La Nina events have occurred from 1948 to 2023?). We only need the code that creates the array `nino_events` 

In [None]:
# copied from question 1 in enso_analysis.ipynb

# constants based on our criteria
nmonths=5
event_thresh=0.6

# first calculate the 5-month rolling mean
nino_rollmean=nino.rolling(time=nmonths,center=True).mean()

# create an array to hold our results and initialize to nan
# this array is where we will fill values with +1,-1
nino_events=nino_rollmean.copy() 
nino_events[:]=np.nan

# now loop through months and fill +1, -1 for windows of 5 months that meet our criteria
for i,value in enumerate(nino_rollmean):
    # La Nina conditions
    if  value < -event_thresh:
        
        # possible La Nina conditions, look forward 4 more months
        window=nino_rollmean[i:i+nmonths]
        if all(window < -event_thresh):
            nino_events[i:i+nmonths] = -1

    # El Nino conditions
    if  value > event_thresh:
        # possible El Nino conditions, look forward 4 more months
        window=nino_rollmean[i:i+nmonths]
        if all(window > event_thresh):
            nino_events[i:i+nmonths]=1    

Next, we need to build our El Nino temperature anomaly composite and determine statistical significance. We'll copy over the relevant code from question 3 in enso_analysis.ipynb

In [None]:
# copied from question 3 in enso_analysis.ipynb

# starting with el nino conditions, temperature
# get temperature anomalies only for times during strong el nino conditions
t_nino=t_anom.where(nino_events==1,drop=True)

# now separate out winter DJF months
# this is sample 1: winter months during strong el nino conditions
t_nino_DJF=t_nino.groupby(t_nino.time.dt.season)['DJF'] 

# make a composite
t_nino_DJF_composite=t_nino_DJF.mean('time',keep_attrs=True)

# create a t sample that include all winter months DJF when there are not strong el nino conditions

# all months that don't fall in strong nino events
t_other=t_anom.where(nino_events!=1,drop=True) 

# pull out just DJF months
# this is sample 2: all winter months that are NOT during strong el nino conditions
t_other_DJF=t_other.groupby(t_other.time.dt.season)['DJF'] 

print('t nino and non-nino sample sizes:',t_nino_DJF.shape[0],t_other_DJF.shape[0]) 

# t-test for difference in means 
t_sigtest = ss.ttest_ind(t_nino_DJF, t_other_DJF, axis=0, equal_var=False)
# numpy --> xarray
t_nino_pval = xr.DataArray(t_sigtest.pvalue, coords={'lat':('lat',t_nino.coords['lat'].data),'lon':('lon',t_nino.coords['lon'].data)})


# plot el nino temperature anomalies where statistically significant at 90% level
pval=0.1

fig=plt.figure(figsize=(12,8))

ax=fig.add_subplot(111,projection=ccrs.PlateCarree())
ax.add_feature(cf.COASTLINE.with_scale("50m"),lw=0.3)
ax.add_feature(cf.BORDERS.with_scale("50m"),lw=0.3)
t_nino_DJF_composite.where(t_nino_pval<pval).plot(cmap='RdBu_r',cbar_kwargs={'shrink':0.9,'orientation':'horizontal','pad':0.05})
plt.title(f'winter mean temperature anomalies\n during strong El Nino conditions (p < {pval})')

plt.show()


Now we can use our shapefile to answer the question: what percent area of each country in South America experiences statistically significant anomalous temperature during strong winter El Nino events?

First, we use the xagg package to compute the overlaps between grid cell polygons and country polygons.

`xagg.pixel_overlaps()` computes the relative area of overlap for each grid cell polygon. It takes an xarray dataset and a geopandas dataframe as inputs.

see the xagg documentation for more info [xagg.wrappers.pixel_overlaps](https://xagg.readthedocs.io/en/latest/xagg.html#xagg.wrappers.pixel_overlaps), [xagg.core.get_pixel_overlaps](https://xagg.readthedocs.io/en/latest/xagg.html#xagg.core.get_pixel_overlaps)

In [None]:
weightmap = xagg.pixel_overlaps(t_nino_DJF_composite,countries,subset_bbox=False)

`xagg.pixel_overlaps()` returns an object that contains 
1) a pandas dataframe with the grid cell polygon overlap information (access it with .agg), 
2) a dictionary containing the xarray data array source grid info (access it with .source_grid), 
3) a pandas series of geometry objects containing the geopandas source geometry info (access it with .geometry)
4) if weights are input to pixel_overlaps, a pandas series of source grid weights, otherwise a string 'nowghts' (access it with .weights)

In [None]:
# access the dataframe with .agg
weightmap.agg.head()

The relative area of a country polygon that each grid cell intersecting the country occupies is stored in the column `rel_area` and the corresponding centroid lat and lon of each grid cell is stored in the column `coords`.

Here's how you would access the relative area of each pixel intersecting the first polygon/mulipolygon (Argentina) from the dataframe. This indexing returns a pandas series of 351 values. Meaning, there are 351 grid cells of our xarray data array that intersect the Argentina polygon.


In [None]:

weightmap.agg['rel_area'][0][0]

The coordinates (grid cell polygon centroids) that correspond to the `rel_area` values above would be indexed as follows. There are a total of 351 (lat,lon) tuples, which are returned as a list of tuples.

In [None]:
print(weightmap.agg['coords'][0][0:4])

print(len(weightmap.agg['coords'][0]),'total tuples in the list')

For each country we want to create a dataframe where each row contains data for a single grid cell. We want columns for lat, lon, relative area, data value, p value. We'll merge data on grid cell latitude and longitude, since those are the fields that all our data has in common. We'll test this merge of data on one country first using Boliva- index row 1 in the `weightmap.agg` dataframe.

In [None]:
# first, convert t_nino_DJF_composite and t_nino_pval to pandas dataframes with columns for lat and lon
t_nino_DJF_composite_df=t_nino_DJF_composite.to_dataframe().reset_index(level=[0,1])

t_nino_pval.name='pval' # data array needs a name to be successfully converted to dataframe
t_nino_pval_df=t_nino_pval.to_dataframe().reset_index(level=[0,1])

t_nino_pval_df

In [None]:
# now combine with the information from the weightmap for one country 
indexrow=1 # bolivia, index row 1

# put coords (lat, lon) of grid cells overlapping this country into dataframe columns
country_df=pd.DataFrame(weightmap.agg['coords'][indexrow],columns=['lat','lon'])

# add rel_area as a dataframe column
country_df['rel_area']=weightmap.agg['rel_area'][indexrow][0].reset_index(drop=True)

# join the tavg and pval info
country_df=country_df.merge(t_nino_DJF_composite_df, how='left',on=['lat','lon'])
country_df=country_df.merge(t_nino_pval_df, how='left',on=['lat','lon'])
country_df

We can see that the merge subsets the 64800 rows in `t_nino_pval_df` and `t_nino_DJF_composite_df` to just the 124 that overlap with the Bolivia polygon. This is because we chose to merge 'left' (merge into `country_df`). Merge 'right' (merge into `t_nino_DJF_composite_df`, for example) would have kept all 64800 rows (global grid cells) and inserted nans in the `rel_area` outside of Bolivia.

If we want to see which grid cells are ovelapping the Boliva polygon and how much each cell contributes to the polygon area, xagg has a function for that. Find more details in the xagg documentation [xagg.classes.weightmap](https://xagg.readthedocs.io/en/latest/xagg.html#xagg.classes.weightmap), [xagg.diag.diag_fig](https://xagg.readthedocs.io/en/latest/xagg.html#xagg.diag.diag_fig)

Notice how the darkest grid cells are the northernmost cells that fall completely inside the polygon, while at the south end of the country grid cells completely inside the polygon are slightly less dark colored. This is because grid cells with evenly spaced lat, lon bounds are larger (more area) closer to the equator and smaller closer to the poles. So this is what we expect to see.

In [None]:
# show the relative grid cell weights in a figure
weightmap.diag_fig(indexrow,t_nino_DJF_composite.to_dataset())


To get the relative area of Bolivia where nino winter temperature anomalies meet the 90% confidence level, we need to sum the `rel_area` column where the `pval` column is <= 0.1

In [None]:
area_impacted=country_df.loc[country_df['pval']<=0.1].rel_area.sum()
area_impacted=round(area_impacted*100.) # fraction to percent and limit precision

print(area_impacted,'% of Bolivia experiences statistically significant temperature anomalies during strong winter el nino events')

To get the result for all countries we can write a function and call it in a loop. 

In [None]:
def percent_area_impacted(coords,rel_area,xr_pval):
    df=pd.DataFrame(coords,columns=['lat','lon'])
    df['rel_area']=rel_area.reset_index(drop=True)
    
    xr_pval.name='pval'
    pval_df=xr_pval.to_dataframe().reset_index(level=[0,1])
    df=df.merge(pval_df, how='left',on=['lat','lon'])

    area_impacted=round(df.loc[df['pval']<=0.1].rel_area.sum()*100.)
    return area_impacted

In [None]:
# loop through countries (rows of weightmap.agg dataframe)

results={} # empty dictionary

for index,row in weightmap.agg.iterrows():
    answer=percent_area_impacted(row.coords,row.rel_area[0],t_nino_pval)
    results[row.COUNTRY]=answer

# create new dataframe with the results
results_df=pd.DataFrame.from_dict(results,orient='index',columns=['T_PERCENT_AREA']).reset_index(names='COUNTRY')
results_df

Now find the mean temperature anomaly per country using `xagg.core.aggregate()`. 

The result will be the area-weighted mean of `t_nino_DJF_composite` at all grid cells that overlap each country (including values that are and are not statistically significant).

Here, the result appears in a column called `tavg` because the column name is based on the data variable name stored in the metadata of our xarray data array `t_nino_DJF_composite`.

In [None]:
mean_anomaly=xagg.core.aggregate(t_nino_DJF_composite,weightmap)
mean_anomaly=mean_anomaly.to_dataframe()
mean_anomaly

In [None]:
# subset the dataframe to the relevant columns and adjust column names
mean_anomaly=mean_anomaly[['COUNTRY','tavg']] # drop all but two columns
mean_anomaly=mean_anomaly.rename(columns={'tavg':'MEAN_T_ANOMALY'})

# merge into our results_df dataframe on the country name
results_df=results_df.merge(mean_anomaly,how='left',on='COUNTRY')
results_df

Finding the mean per country over only the grid cells with statistically significant anomalies is easy to do since xagg.pixel_overlaps() allows input of a set of weights. In our case, these weights will simply be a 0 and 1 mask where values of 1 represent t_nino_pval<=0.1. 

We need to create the mask, generate a new weightmap with the mask, and aggregate the data based on the new weight map.

In [None]:
# weights to input to xagg.pixel_overlaps
mask = xr.where(t_nino_pval<=0.1,1,0) # 1 means statistically significant

# generate new weightmap
weightmap_mod = xagg.pixel_overlaps(t_nino_DJF_composite,countries,subset_bbox=False,weights=mask)

# aggregate the t anomaly values only for grids with statistical significance
# and export result to dataframe
mean_anomaly_sigloc=xagg.core.aggregate(t_nino_DJF_composite,weightmap_mod)
mean_anomaly_sigloc=mean_anomaly_sigloc.to_dataframe()

# drop unneccesary columns
mean_anomaly_sigloc=mean_anomaly_sigloc[['COUNTRY','tavg']]

# rename the aggregated column
mean_anomaly_sigloc=mean_anomaly_sigloc.rename(columns={'tavg':'MEAN_T_ANOMALY_SIGLOC'})

# merge into our results dataframe
results_df=results_df.merge(mean_anomaly_sigloc,how='left',on='COUNTRY')
results_df

## Side note about calculating polygon areas from gridded/raster data

Here, our gridded data is nan over ocean grids (or grids that are mostly ocean). This means that a grid cell with a value of nan could slighly overlap where land is present. We have to take that into consideration when interpreting the accuracy/precision of these results.

Take, for example, Trinidad and Tobago. The area we calculated is 98% instead of 100% because of the way the grid cells align with the coastline. This could be the case for any country with a coast.

Let's take a closer look using xagg's diagnostic figure to look at which grid cells overlap Trinidad and Tobago (row 11 of weightmap.agg dataframe)

In [None]:
indexrow=11
weightmap.diag_fig(indexrow,t_nino_DJF_composite.to_dataset())

We'll now make a plot to look at which grid cells of our data are considered ocean (nan)

In [None]:
# set up figure with coastlines
fig=plt.figure(figsize=(6,6))
ax=fig.add_subplot(111,projection=ccrs.PlateCarree())
ax.add_feature(cf.COASTLINE.with_scale("50m"),lw=0.3)

# color nan grey
cmap = plt.colormaps.get_cmap("RdBu_r").copy()
cmap.set_bad('grey') 

# subset based on the above plot's lat and lon
latmin,latmax = 9.5, 11.5
lonmin,lonmax = -62,-60
t_nino_DJF_composite.sel(lat=slice(latmin,latmax),lon=slice(lonmin,lonmax)).plot(cmap=cmap)

As you can see above, a small part of the island is covered by an ocean grid where the data value is nan (colored grey). This is why our T_PERCENT_AREA is 98%. The nan grid cell accounts for the other 2% area of the country. It's just something to be aware of and to take into consideration when choosing a precision for presenting your results. It wouldn't make sense to present the results in this case to multiple decimal places because of this limitation of our data/analysis. 