## G489/589 Advanced Geospatial Data Analysis in Python: In-class Assessment 3
### Alternate Exam
**Indiana University  
Spring 2019  
Dr. Natasha MacBean**

### Part 1 (Question and Answer Section - 20 points)
Double click on each markdown box to write the answer.

1. What python or numpy function would you use to create a numerical array that started with the number 1 and ended with the number 10 (with a step of 1 for each number)? **[1 point]**

ANSWER:  
range(1,11)
np.arange(1,11)

2. Which library and method would you use to create a list containing all the .csv files you have in a given directory. **[1 point]**

ANSWER: glob.glob("\*.csv")

3. You have a numerical 3-dimensional array of floats called "data", which consists of the dimensions (ntsteps, nrows, ncols), where ntsteps is the number of timesteps, and nrows/ncols are the number of rows/columns. Write the numpy function you would use to calculate the standard deviation over the column dimension. **[1 point]**

ANSWERR: np.std(data, axis=2)

4. What is the numpy function for calculating a sum over a *masked* array? **[1 point]**

ANSWER: np.ma.sum() 

5. You have a 1-dimensional numerical array of length 200. What numpy function would you use to convert this array into a 2-dimensional array with 2 rows and 100 columns? (*For now do not worry about the order that it is filling out the rows and columns)*. **[1 point]**

ANSWER: array.reshape((2,100)) or array = np.reshape(array, (2,100))

6. You know that each of the .csv files in your filelist has the format "site_name.csv" (where "site_name" is the name of each site in the list). Write a loop that loops over the filelist and extracts the **site name for each site** into a new list called "sites" using the method string.split **[2 points]**

ANSWER:   
sites = []  
for f in filelist:  
    sites.append(f.split('.')[0])

7. Which numpy function would you use to check the sizes of all dimensions of an array? **[1 point]**

ANSWER: np.shape()

8. You have two variables: a = 100 and b = 25. Write an **if** statement to check if b is less than or equal to a, and prints "ok" if that statement is true, **else** if the statement is not true, it prints "not ok". **[1 point]**

ANSWER:   
if b <= a:  
    print("ok")  
else:  
    print("not ok")  

9. Write out both the steps needed, and the code you would use, to perform a KMeans clustering algorithm to a set of data that contains 1000 observations (nsamples or nrows) and 10 features (nfeatures or ncols). Create 4 clusters and save your resultant cluster labels to an array called "labels4".  Your input data are in a numerical floating point arrray in a file called "data.txt". There are no column headings in the data.txt file. Describe the steps you would take to complete this task from the beginning, including importing the library and reading in the data etc. For each step also write the code you would need. *Note: Do not execute the code, just write down the steps you would take and the syntax of code you would use.* **[10 points]**

ANSWER:
- *# load the libraries needed*   
from sklearn.cluster import KMeans  
import numpy as np  
- *# load the data using np.loadtxt as they are saved as a simple numerical array*  
data = np.loadtxt("data.txt")  
- *# first set-up the KMeans model*  
model = KMeans(n_clusters=4)  
- *# fit the data to the model*   
model.fit(data)  
- *# label the clusters in the model and save it to an arrray called "labels2"*   
labels2 = model.predict(data)  


10. Following on from the previous question, what would the length of the final array "labels4" be? **[1 point]**

ANSWER:
1000

### Part 2 (Writing Code - 30 points)

In this exercise we'll be working with GPP and SIF data from three separate sites: 1) temperate forest site in the US; 2) tropical forest site in Brazil; and 3) boreal forest site in Russia. The files are saved in the data.zip folder and are called:
1. US_temp_gpp_sif.csv
2. BR_trop_gpp_sif.csv
3. RU_bore_gpp_sif.csv

The data are monthly values between 2007 and 2011 - therefore, the number of observations is 60 for each. GPP are in the 1st column and SIF are in the 2nd column.

The objective of this exericse is to work out which site has the highest correlation between GPP and SIF data. I.e. the task is to calculate the R value between the GPP and SIF timeseries.

You can calculate the correlation using any of the following methods, inlcuding:

- np.corrcoef()
- scipy.stats.pearsonr (see Exercise 11)
- or any of the linear regression methods we learned (Exericses 12 and 13)

*Note: Beware of which linear regression methods give you R, and which directly give you R$^{2}$. If you calculate R$^{2}$ then you need to use np.sqrt to get the correlation.*

You will get the most marks for making your code as clean and efficient as possible. For example, your code will be "cleaner" if you use a for loop to loop over the sites.

**First, in the markdown box below write your logic in normal/plain english to explain the steps you will need to complete this exercise. Think of this as a plan for your script. [5 points out of 30]**

**LOGIC FOR YOUR CODE:** 
- import libraries
- create a filelist using glob.glob
- if using a loop, set-up a "r" array to save the r for each site within the loop
- loop over sites
- read in data using pandas and check with printing head() that it's read in correctly
- set-up X as GPP
- set-up Y as SIF
- (optional) plot a linear regression plot using seaborn
- calculate R using numpy, scipy.stats pearsonr, or any of the linear regression method to perform the linear regression and save/print the r to answer the question below.

**Now in the code boxes below write the actual python script you need to complete this excercise. [20 points out of 30]** 

- Do not forget to comment your code **[3 points out of 30]**! (I have given you a head start in the box below).

*NOTE! After you complete the coding task, answer the final question in the markdown below based on your calculations **[2 points out of 30]**.*

In [15]:
# - import libraries
import pandas as pd
import numpy as np
from sklearn import linear_model 
import matplotlib.pyplot as plt
import seaborn as sns
import glob, sys

In [16]:
# - create filelist
filelist = glob.glob('../data/gpp_sif/data/*.csv')
print(filelist)

['../data/gpp_sif/data/RU_boreal_gpp_sif.csv', '../data/gpp_sif/data/US_temp_gpp_sif.csv', '../data/gpp_sif/data/BR_trop_gpp_sif.csv']


In [17]:
# - create an empty list to hold the correlation values
site_corr = []
site_corr_sklearn = []
site = []

In [18]:
# - loop over sites
for f in filelist:
    
    # - save site just for ease of understanding which site
    site.append(f.split('/')[-1].split('_')[0])
    
    # - read in data
    data = pd.read_csv(f)
    
    # - print data head to check
    print(data.head())
    
    # - use numpy to calculate correlation
    site_corr.append(np.corrcoef(data["GPP"].to_numpy(), data["SIF"].to_numpy())[0,1])
 
    # - sklearn
    lm = linear_model.LinearRegression()
    X = data["GPP"].to_numpy().reshape(-1,1)
    y = data["SIF"].to_numpy().reshape(-1,1)
    model = lm.fit(X,y)
    print(site[-1],model.score(X,y))
    site_corr_sklearn.append(np.sqrt(model.score(X,y)))
    

        GPP       SIF
0  0.000000 -0.155889
1  0.000000 -0.035329
2  0.000000  0.211935
3  0.000012  0.090419
4  0.000078  0.624613
('RU', 0.9071430177197362)
            GPP       SIF
0  1.454290e-05  0.311468
1  1.177860e-07  0.455039
2  2.293960e-05  0.167898
3  1.109360e-04  0.917273
4  1.637100e-04  1.811897
('US', 0.3863954642715991)
        GPP       SIF
0  0.000126  1.458244
1  0.000122  2.908480
2  0.000123  2.251486
3  0.000118  1.946829
4  0.000115  2.425157
('BR', 0.20317746499310485)


In [14]:
# - print out sites and print out correlations
print(site)
print(site_corr)
print(site_corr_sklearn)

['RU', 'US', 'BR']
[0.9524405586280628, 0.6216071623393666, -0.4507521103590142]
[0.9524405586280628, 0.6216071623393662, 0.45075211035901414]


QUESTION:

1) Based on the R (correlation) value, which site has the highest correlation between GPP and SIF

ANSWER: Russian boreal forest site.

### Remember to save your answers and upload them to Canvas by the end of the class. CHECK you have uploaded the right Jupyter Notebook to Canvas. EMAIL yourself a copy of the Jupyter Notebook (or save it to usb) in case something goes wrong with the Canvas submission.