# Bringing it all together: cleaning data, do analyses, put those analyses in space

For the last portion of this workshop session we will look at a very real example of how my students and I bring together all sorts of exploratory data analysis and tons of different kinds of Python packages - from mathematical packages like `numpy` and `scipy` to data analysis and visualization tools like `pandas` and `seaborn` to geospaital tools in `geopandas`. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

import geopandas as gpd
import seaborn as sns

import scipy as sp
from scipy import stats

Sade took this [fairly messy data](https://www2.gwu.edu/~calm/data/north.htm) and turned it into something great!

This `CALM_export.csv` is actually an export from the shapefile listed on the website above - it was the least messy format of the data. But I just exported it as a .csv so we had an example of turning columns with location data into `GeoDataframes` :)

In [None]:
data= pd.read_csv('../arctic-data/CALM_export.csv')
data

# Major cleaning!!

The first thing Sade did was rename the columns that should be years for the ALT measurements

In [None]:
year_names = np.arange(1990, 2016, 1)

old_columns = data.columns[6:32]

mapping = {old_columns[i]: year_names[i] for i in range(len(old_columns))}

data = data.rename(columns=mapping)

Then, when imported or exported the "no measurements" turned into zeros, which is bad, and some of the data had symbols associated with them

In [None]:
data = data.fillna(np.nan)
data = data.replace(r'^\s*$', np.nan, regex=True)
data.replace(0.0, np.nan, inplace= True)
data.replace(">263", np.nan, inplace= True)
data.replace(">260", np.nan, inplace= True)
data.replace(">235", np.nan, inplace= True)

Finally, we want to make sure our year columns are being read as floats and not objects (strings)

In [None]:
data.iloc[:, 6:32].dtypes

# iloc is index location
# The : in the first half of the bracketed list means "all rows"
# and then "column numbers 6 through 32"

Oops! Let's fix that:

In [None]:
data.iloc[:, 6:32] = data.iloc[:, 6:32].apply(pd.to_numeric, errors='coerce')

In [None]:
data.dtypes

# How is the active layer changing every year?

We're going to have to get a little clever here, because each site has its own unique dataset issues - some sites are missing most years, some sites have data gaps... 

## Write an algorithm that grabs an x and y array for years and measurements

In [None]:
for sites in range(len(data)): # for all the rows in our dataset

  y_floats = np.array(data.iloc[sites, 6:32].values, dtype=float) # read in the data from columns that are yearly ALT measurements as floats

  y = y_floats[~np.isnan(y_floats)] # find the indices for which that year's measurement is NOT a nan

  x = year_names[~np.isnan(y_floats)] # find the corresponding years for valid data 

  if np.sum([np.isnan(y)==False])>10: # if there are at least 10 valid measurements for the time period

    data.loc[(sites, "average")] = np.mean(y) # grab the mean of those measurements

    res = stats.linregress(x,y) # and regress the year array against the measurement array 
    # res is the result of the linregress function and spits out a list of important numbers

    data.loc[(sites, "slope")] = res[0] # ...and store all that data in our data frame
    data.loc[(sites, "intercept")] = res[1]
    data.loc[(sites, "rvalue")] = res[2]
    data.loc[(sites, "pvalue")] = res[3]
    data.loc[(sites, "stderr")] = res[4]
  else: # if we don't have enough valid data, go to the next row
    continue

Now what do we have?

In [None]:
data.head()

# Now make it a map!

Any dataframe with lats and longs can be convered into a dataframe if we specify the geometry as the appropriate columns and the appropriate crs (lat long will usually be WGS84, EPSG:4326)

In [None]:
gdf = gpd.GeoDataFrame(
    data, geometry=gpd.points_from_xy(data.Longitude, data.Latitude), crs='epsg:4326')

Let's see what this looks like

In [None]:
# GeoPandas has a simple map of the Earth built in
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

fig, ax = plt.subplots(figsize=(5,5),dpi=300)
im = world.plot(
    color='white', edgecolor='black', ax=ax)

gdf.plot(
        ax=ax,
        column='slope',
        vmin=-2,
        vmax=2,
        cmap='seismic',
        s=10, #size of point
        legend=True,
        legend_kwds={
            'label': "Change in Active Layer Thickness (cm/yr) from 1991 to 2015",
            'orientation': "horizontal"
            }
        )

ax.set_ylim(40,90)
plt.show()

I will leave [axis labeling](https://matplotlib.org/stable/api/axes_api.html) to you :)

## Let's export this as a shapefile for use later down the road

In [None]:
# A goofy little thing where the numbers have to be strings to export lol
gdf.columns = gdf.columns.astype(str)

gdf.dropna(inplace=True, subset='Site_Name')

gdf.to_file("CALM_points.shp")