forked from jpfayStudent/EDA-Spring2023
-
Notifications
You must be signed in to change notification settings - Fork 0
/
03_DataExploration_Part2.Rmd
173 lines (115 loc) · 5.52 KB
/
03_DataExploration_Part2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
---
title: "3: Data Exploration"
author: "Environmental Data Analytics | Kateri Salk"
date: "Spring 2023"
output: pdf_document
geometry: margin=2.54cm
fig_width: 5
fig_height: 2.5
editor_options:
chunk_output_type: console
---
## Objectives
1. Import and explore datasets in R
2. Graphically explore datasets in R
3. Apply data exploration skills to a real-world example dataset
## Opening discussion: why do we explore our data?
Why is data exploration our first step in analyzing a dataset? What information do we gain? How does data exploration aid in our decision-making for data analysis steps further down the pipeline?
## Import data and view summaries
```{r, message = FALSE}
# 1. Set up your working directory
getwd()
# 2. Load packages
library(tidyverse)
# 3. Import datasets
USGS.flow.data <- read.csv("./Data/Processed/USGS_Site02085000_Flow_Processed.csv",stringsAsFactors = T)
#View(USGS.flow.data)
# Alternate option: click on data frame in Environment tab
colnames(USGS.flow.data)
str(USGS.flow.data)
dim(USGS.flow.data)
# Check our date column
class(USGS.flow.data$datetime)
USGS.flow.data$datetime <- as.Date(USGS.flow.data$datetime, format = "%Y-%m-%d")
class(USGS.flow.data$datetime)
```
## Visualization for Data Exploration
Although the `summary()` function is helpful in getting an idea of the spread of values in a numeric dataset, it can be useful to create visual representations of the data to help form hypotheses and direct downstream data analysis. Below is a summary of the useful types of graphs for data exploration.
Note: each of these approaches utilize the package "ggplot2". We will be covering the syntax of ggplot in a later lesson, but for now you should familiarize yourself with the functionality of what each command is doing.
### Bar Chart (function: geom_bar)
Visualize count data for categorical variables.
```{r, fig.height = 3, fig.width = 4}
ggplot(USGS.flow.data, aes(x = discharge.mean.approval)) +
geom_bar()
```
### Histogram (function: geom_histogram)
Visualize distributions of values for continuous numerical variables. What is happening in each line of code? Insert a comment above each line.
```{r, fig.height = 3, fig.width = 4}
#
ggplot(USGS.flow.data) +
geom_histogram(aes(x = discharge.mean))
#
ggplot(USGS.flow.data) +
geom_histogram(aes(x = discharge.mean), binwidth = 10)
#
ggplot(USGS.flow.data) +
geom_histogram(aes(x = discharge.mean), bins = 20)
#
ggplot(USGS.flow.data, aes(x = discharge.mean)) +
geom_histogram(binwidth = 10) +
scale_x_continuous(limits = c(0, 500))
#
ggplot(USGS.flow.data) +
geom_histogram(aes(x = gage.height.mean))
```
### Frequency line graph (function: geom_freqpoly)
An alternate to a histogram is a frequency polygon graph (distributions of values for continuous numerical variables). Instead of displaying bars, counts of continuous variables are displayed as lines. This is advantageous if you want to display multiple variables or categories of variables at once.
```{r, fig.height = 3, fig.width = 4}
#
ggplot(USGS.flow.data) +
geom_freqpoly(aes(x = gage.height.mean), bins = 50) +
geom_freqpoly(aes(x = gage.height.min), bins = 50, color = "darkgray") +
geom_freqpoly(aes(x = gage.height.max), bins = 50, lty = 2) +
scale_x_continuous(limits = c(0, 10))
#
ggplot(USGS.flow.data) +
geom_freqpoly(aes(x = gage.height.mean, color = gage.height.mean.approval), bins = 50) +
scale_x_continuous(limits = c(0, 10)) +
theme(legend.position = "top")
```
### Box-and-whisker plots (function: geom_boxplot, geom_violin)
A box-and-whisker plot is yet another alternative to histograms (distributions of values for continuous numerical variables). These plots consist of:
* A box from the 25th to the 75th percentile of the data, called the interquartile range (IQR).
* A bold line inside the box representing the median value of the data. Whether the median is in the center or off to one side of the IQR will give you an idea about the skewness of your data.
* A line outside of the box representing values falling within 1.5 times the IQR.
* Points representing outliers, values that fall outside 1.5 times the IQR.
An alternate option is a violin plot, which displays density distributions, somewhat like a hybrid of the box-and-whiskers and the frequency polygon plot.
```{r, fig.height = 3, fig.width = 4}
#
ggplot(USGS.flow.data) +
geom_boxplot(aes(x = gage.height.mean.approval, y = gage.height.mean))
#
ggplot(USGS.flow.data) +
geom_boxplot(aes(x = gage.height.mean, y = discharge.mean, group = cut_width(gage.height.mean, 1)))
#
ggplot(USGS.flow.data) +
geom_violin(aes(x = gage.height.mean.approval, y = gage.height.mean),
draw_quantiles = c(0.25, 0.5, 0.75))
```
### Scatterplot (function: geom_point)
Visualize relationships between continuous numerical variables.
```{r, fig.height = 3, fig.width = 4}
ggplot(USGS.flow.data) +
geom_point(aes(x = discharge.mean, y = gage.height.mean))
ggplot(USGS.flow.data) +
geom_point(aes(x = datetime, y = discharge.mean))
```
Question: under what circumstances would it be beneficial to use each of these graph types (bar plot, histogram, frequency polygon, box-and whisker, violin, scatterplot)?
> Answer:
## Ending discussion
What did you learn about the USGS discharge dataset today? What separate insights did the different graph types offer?
> Answer:
How can multiple options for data exploration inform our understanding of our data?
> Answer:
Do you see any patterns in the USGS data for the Eno River? What might be responsible for those patterns and/or relationships?
> Answer: