ex03.Rmd

---
title: "Exercise 3"
output: 
  html_document:
    fig_caption: no
    number_sections: no
    toc: no
    toc_float: false
    collapsed: no
    css: html-md-01.css
---

```{r set-options, echo=FALSE}
options(width = 105)
knitr::opts_chunk$set(dev='png', dpi=300, cache=FALSE)
pdf.options(useDingbats = TRUE)
klippy::klippy(position = c('top', 'right'))
```
<p><span style="color: #00cc00;">NOTE:  This page has been revised for Winter 2021, but may undergo further edits.</span></p>
**Geography 4/595:  Geographic Data Analysis**  
**Winter 2021**

**Exercise 3:  Bivariate Plots & Descriptive Statistics**  
**Finish by Friday, January 22**

**1.  Introduction**  

The purpose of this exercise is to introduce some bivariate plots and basic descriptive statistics. 

**Read through the exercise before attempting to complete it.**

Before starting the exercise, you may want to load the workspace contained in the geog495.RData file, which contains the individual .csv and shapefiles used in the exercises.  See Section 6 of the "packages-and-data" instructions:  [[packages-and-data.html]](https://pjbartlein.github.io/GeogDataAnalysis/packages-and-data.html)  Downloading this workspace eliminates having to read individual files, but the instructions for doing so are still included below.

Begin by starting RStudio.  Check to make sure that you're back in your working directory.  You can verify that R is looking at the correct folder by clicking on File > Change dir... on the RGUI menu (Windows) or Misc > Change Working Directory… (Mac).  If you're in the folder you created in the first exercise, fine, otherwise you could browse to it.


**2. Scatter Diagrams**

*Simple scatter diagrams*

Check to make sure the Summit Creek "data frame" (the data set after it has been read into R's workspace) is still in the workspace: 

```{r echo=TRUE, eval=FALSE}
ls()
```

If not, re-read the sumcr.csv file as in Section 8 of Exercise 1.  Also, use the `names()` function to get a reminder of the names of the variables in the data frame, and attach the data frame so you can refer to individual variables by their simple names (e.g. `WidthWS` as opposed to the complete full name `sumcr$WidthWS`):

```{r echo=TRUE, eval=FALSE}
names(sumcr)
attach(sumcr)
```

The scatter diagram (sometimes also referred to as a "scatter plot", "scattergram" or "X-Y plot" is the most basic of multivariate (more that one variable) plots, and is the workhorse among bivariate plots.  Produce a scatter diagram (with water-surface depth (`DepthWS`) as the Y-axis variable, and cumulative length (`CumLen`) as the X-axis variable:

```{r echo=TRUE, eval=FALSE}
plot(CumLen, DepthWS)   # X-axis variable first, Y-axis variable second
```

Produce a second scatter diagram using the above steps for water-surface width (`WidthWS`) versus bankfull width (`WidthBF`).  On Windows, before creating this second plot, click on the R graphics window to give it the focus, and then use the menu command History > Recording to start saving copies of the plots so that you can use PgUp and PgDn to review previous plots.

>Q1:  What do the overall relationships between `DepthWS` and `CumLen` and between `WidthWS` and `WidthBF` look like?  (Contrast them:  Which is stronger, which is weaker?  What is the sense of the relationships (positive or negative)?)

Detach the `sumcr` data frame before going on:

```{r echo=TRUE, eval=FALSE}
detach(sumcr)
```

*Annotated scatter diagrams*

Download the Oregon climate station data from this link:  [[orstationc.csv]](https://pjbartlein.github.io/GeogDataAnalysis/data/csv/orstationc.csv) and save it in your working directory.  Read the data into R's workspace using the following code:

	
```{r echo=TRUE, eval=FALSE}
orstationc <- read.csv("../data/orstationc.csv", as.is=T)
```
	
Note that you may have to modify the path to point to the particular folder where your data is.  The path in the code above assumes that the folder structure that you set up in exercise 1 looks like

	../geog495
	      /data
 … and that you downloaded the `orstationc.csv` dataset to `../geog495/data/`.

(The "`as.is=T`" argument in the `read.csv()` function tells R not to convert text information (station and Name) into group-membership factors, which is the default behavior.)

If you’ve accidentally downloaded the file to some different place than your working directory, you can search for it while loading the file using the `file.choose()` function:

```{r echo=TRUE, eval=FALSE}
orstationc <- read.csv(file.choose(), as.is=T)
```

(but don’t do this if you were successful the first time.)

Attach the data frame, and create a crude map by plotting `longitude` on the X-axis and `latitude` on the Y-axis:

```{r echo=TRUE, eval=FALSE}
attach(orstationc)
plot(lon, lat)
```

Label the points with the station-name abbreviations using the `text()` function

```{r echo=TRUE, eval=FALSE}
text(lon, lat, labels=station)
```

(Note that `text()` won't work unless there is a plot with the same X-axis and Y-axis variables already open.)

Now plot annual precipitation (`pann`) as a function of elevation (`elev`) (i.e. `elev` on the X-axis, `pann` on the Y-axis):

```{r echo=TRUE, eval=FALSE}
plot(elev, pann)
```

Although it might seem more intuitive to plot elevation on the vertical axis, this particular arrangement of variables is a strong convention in data analysis--elevation is the control and is plotted along the x-axis while precipitation is the response and plotted along the y-axis.  

>Q2:  Describe the relationship.  Are the data "well behaved" in the sense that most points fall in a distinct cloud of points, or are there outliers or unusual points?

Label individual points using the `identify()` function:

```{r echo=TRUE, eval=FALSE}
identify(elev, pann, labels=station)
```

To stop identifying points, hit Esc.  The behavior of `indentify()` in RStudio is a little different than in the individual Windows or MacOS GUIs-—individual points are indicated with a "drop pin"-type marker that appears as they are selected, but the labels don't appear until Esc is hit.  The labels that appear then are "sticky" in that they remain identified if the `identify()` function is used again.

>Q3:  Use the crude map of station names as well as simple inspection of the data to describe local variations in the basic relationship between elevation and annual precipitation.  It may be useful to create other scatter plots, say, of elevation and longitude, or latitude and longitude and annual precipitation to get further insight.  (A quick way of seeing lots of plots is to use the `plot()` function with a list of numeric variables, e.g., `plot(orstationc[,2:10])` -- the "construct" `[,2:10]` means all rows, and columns 2 through 10 of the dataframe).)  Unfortunately `identify()` only works on simple scatter plots.

Detach the `orstationc` data frame as above.

**3. Descriptive Statistics**

*Simple descriptive statistics*

This part of the exercise involves producing some descriptive statistics for the Summit Cr. data.  The easiest way to get a sense of what information is provided by a particular statistic (say, the standard deviation of water-surface width within a particular reach) is to compare descriptive plots for the variable with the values of a descriptive statistic.

Produce some descriptive statistics for the variable `WidthWS` (This is almost too easy!) :

```{r echo=TRUE, eval=FALSE}
attach(sumcr)
summary(WidthWS)
```

By themselves, the statistics don't mean much.  However, you can contrast the amount of information provided by the numerical and graphical descriptions, and the efficiency with which it is presented.  Get the summary statistics for bankfull width (`WidthBF`) as well.

>Q4:  Compare the two variables using the summary statistics and whatever descriptive plots that may seem useful.  Which variable has typically larger values, which varies more from one observation to another?

*Descriptive statistics by observation-group membership*

The calculation of descriptive statistics for subsets of a data set provides one means for investigating the influence of another variable on the variations within a data set of a variable of interest.

Remember that the Summit Cr. study area is divided into three sections: an upstream reach (reach A, represented by an A in column 2) in which cattle are permitted to graze, a middle reach (reach B, represented by a B) from which cattle have been excluded, and a downstream reach (reach C, represented by a C), in which cattle were again permitted to graze.

Use the `tapply()` function to get descriptive statistics for `WidthWS`, stratified by reach.

```{r echo=TRUE, eval=FALSE}
tapply(WidthWS, Reach, summary)
```

The arguments of the function are 1) the name of the variable (`WidthWS`), the name of the grouping variable used to stratify the data or assign to groups (`Reach`), and the function to apply to the stratified data (in this case, `summary()`).

You should see three blocks of summary statistics, one for each reach.  Before attempting to answer the following, also produce a univariate scatter diagram (see Section 1 of Exercise 2) that shows the values of `WidthWS` plotted versus observation number.

>Q5: Describe the variations in the location (e.g. mean) and dispersion (i.e. variance or standard deviation) of water-surface width across the reaches (i.e. from the upstream grazed reach, the middle ungrazed reach, and the downstream grazed reach). Note: This is a purposefully vague question. Think about what kind of information you will need to present to adequately describe each reach, and to compare the values of `WidthWS` from one reach to another. You might think about a one-page display that would contain a couple of plots and an abbreviated table.

You can leave the `sumcr` data frame attached, because there is no overlap among variable names in `sumcr` and `orstationc`.

**4.  Correlations**

Inferring the strength of a relationship (and sometimes even the sign) from a scatter plot is somewhat of a judgment call—e.g., what kind of pattern on a scatter plot indicates a "strong" relationship?  Consequently, it would be convenient to have some kind of more objective measure of the strength of a relationship than simple visual interpretation.  This is provided by the correlation coefficient—a single number description of the extent to which the relationship between two variables can be described by a straight line.

Attach the `orstationc` data frame again, and also plot `pann` "as a function of" `elev` (i.e. plot `pann` along the y-axis and `elev` on the x-axis.  Then get the correlation coefficient between annual precipitation and elevation:

```{r echo=TRUE, eval=FALSE}
cor(elev, pann)
```

By itself, a single correlation coefficient is difficult to interpret, so you may wish to calculate correlations for other pairs of variables, and compare these to their scatter plots.  Any easy way to do this to create a "correlation matrix" equivalent to the scatter plot matrix you created earlier:

```{r echo=TRUE, eval=FALSE}
plot(orstationc[,2:10])
cor(orstationc[,2:10])
```

Detach the `orstationsc` dataframe:

```{r echo=TRUE, eval=FALSE}
detach(orstationc)
```

>Q6:  Which pairs of variables are most highly correlated, and which less so?  Are the correlations consistent with your subjective interpretation of the scatter plots or not?

Next, attach the `sumcr` dataframe again, and produce a scatter plot for `CumLen` vs `WidthWS`, using the alternative "equation" form of specifying the variables to be plotted.

```{r echo=TRUE, eval=FALSE}
attach(sumcr)
plot(WidthWS ~ CumLen)
```

In this form of specifying variables, the tilde ("`~`") can be read as "as it varies with" or "as a function of".

Also get the correlation between these two variables. 

>Q7:  Describe the relationship evident in the scatter plot.  (Is is simply a jumble of points, or is there some pattern in the plot?) Is relationship strong or weak?  Now look at the correlation coefficient.  Is its value consistent with your inference of the strength of the relationship?  If so, does that make sense, if not, why not?

Now construct a smooth curve to summarize the relationship apparent on the scatter diagram using the `loess()` function:  (note--if for some reason the plot of `CumLen` vs `WidthWS` disappeared, regenerate it using the above example.)

```{r echo=TRUE, eval=FALSE}
loess.model <- loess(WidthWS ~ CumLen)
loess.model
hat <- predict(loess.model)
lines(CumLen[order(CumLen)], hat[order(CumLen)], col="red")
```

The `loess()` function finds the smooth curve and saves the details (goodness-of-fit-statistics and fitted values) in the object `loess.model`, typing `loess.model` prints out the summary statistics, the `predict()` function puts the fitted values into the variable `hat`, and the `lines()` function draws the line on the current scatter plot.

Construct a scatter diagram of `WidthWS` plotted as a function of `hat`, and get the correlation between those variables.

>Q8:  What does the relationship between `WidthWS` and the fitted values (`hat`) from the loess curve look like?  (Compare it with the scatter diagram of `WidthWS` and `CumLen`.) What does this `WidthWS` vs `hat` scatter diagram (and companion correlation coefficient) suggest now about the strength of the relationship between `WidthWS` and `CumLen`.?

**5.  What to hand in.**

Answers to the questions, and the scatter plots for Q7 and Q8.