# Assignment 2: Introduction to Reading Data

Assignment Objectives:

* choose the appropriate `tidyverse` `read_*` function and function arguments to load a given plain text tabular data set into R

* use `dplyr` functions to wrangle dataframe

* use `ggplot` to explore data

* _optional:_ scrape data from the web
  - read/scrape data from an internet URL using the `rvest` `html_nodes` and `html_text` functions
  - compare downloading tabular data from a plain text file (e.g. `*.csv`) from the web versus scraping data from a `.html` file

Any place you see `...`, you must fill in the function, variable, or data to complete the code.

In [None]:
### Run this cell before continuing. 
library(tidyverse)
library(repr)
library(rvest)
library(stringr)
options(repr.matrix.max.rows = 6)

## 1. Happiness Report
As you might remember from `worksheet_reading`, we practised loading data from the *Sustainable Development Solutions Network's* [World Happiness Report](http://worldhappiness.report/). That data was the output of their analysis that calculated each country's happiness score and how much each variable contributed to it. In this tutorial, we are going to look at the data at an earlier stage of the study - the aggregated/averaged values (per country and year) for many different social and health aspects that the researchers anticipated might contribute to happiness.

The goal for the assignment is to produce a plot of 2017's positive affect scores against healthy life expectancy at birth, with healthy life expectancy at birth on the x-axis and positive affect on the y-axis. For this study, positive affect was defined as the average of three positive affect measures: happiness, laughter and enjoyment. We would also like to convert the **positive affect score** from a scale of 0 - 1 to a scale from 0 - 10.

1. use `filter` to subset the rows where the year is equal to 2017
2. use `mutate` to convert the "Positive affect" score from a scale of 0 - 1 to a scale from 0 - 10
3. use `select` to choose the "Healthy life expectancy at birth" column and the scaled "Positive affect" column
4. use `ggplot` to create our plot of "Healthy life expectancy at birth" (x - axis) and scaled "Positive affect" (y - axis)

**Tips for success:** Try going through all of the steps on your own, but don't forget to discuss with others (classmates, or your instructor) if you get stuck. If something is wrong and you can't spot the issue, be sure to **read the error message carefully**. Since there are a lot of steps involved in working with data and modifying it, feel free to look back at `worksheet_reading`. 

**Question 1.1** Multiple Choice: 
<br> {points: 1}

What is the maximum value for the "Positive affect" score (in the original data file that you read into R)?

A. 100

B. 10 

C. 1

D. 0.1

E. 5


##### Answer: 

**Question 1.2** Multiple Choice: 
<br> {points: 1}

Which column's values will be used to filter the data?

A. `countries`

B. `generosity`

C. `positive affect`

D. `year`


##### Answer: 

**Question 1.3.0**
<br> {points: 1}

Use the appropriate `read_*` function to read in the `WHR2018Chapter2OnlineData` (ensure you use the correct relative path to read it in).

_Assign the data frame to an object called `happy_df`._

In [None]:
# your code here

happy_df

Look at the column names - they contain spaces!!! This is not a best practice and will make it difficult to use our tidyverse functions... Run the cell below to use the `make.names` function that will replace all the spaces with a `.` so we don't have this problem. The `colnames` function is also needed to access the data frame's column names.

In [None]:
### Run this cell before continuing. 
colnames(happy_df) <- make.names(colnames(happy_df))
happy_df

**Question 1.3.1**
<br> {points: 1}

Using the scaffolding given in the cell below, `filter`, `mutate`, and `select` the `happy_df` data frame as needed to get it ready to create our desired scatterplot. Recall that we wanted to rescale the "Positive affect" scores so that they fall in the range 0-10 instead of 0-1. Call the new, re-scaled column `Positive.affect.scaled`.

_Assign the data frame containing only the columns we need to create our plot to an object called `reduced_happy_df`._

In [None]:
# happy_step1 <- ...(happy_df, year == ...)
# happy_step2 <- mutate(happy_step1, Positive.affect.scaled = ...)
# reduced_happy_df <- ...(happy_step2, ..., ...)

# your code here


reduced_happy_df

**Question 1.4** 
<br> {points: 1}

Using the modified data set, `reduced_happy_df`, generate the scatterplot described above and make sure to label the axes in proper written English.

_Assign your plot to an object called `happy_plot`._

In [None]:
options(repr.plot.width = 8, repr.plot.height = 8)

#... <- ggplot(reduced_happy_df, ...(x = ..., y = ...)) + 
#     geom_...() + 
#     ...("...") + 
#     ylab("Positive affect score (out of ...)")

# your code here

happy_plot

**Question 1.5** 
<br> {points: 3}

In one sentence or two, describe what you see in the scatterplot above. Does there appear to be a relationship between life expectancy at birth and postive affect? If so, describe it.

##### Answer: 

**Question 1.6** 
<br> {points: 3}

Choose any variable (column) in the data set `happy_df` other than `Positive.affect` to plot against healthy life expectancy at birth. **You should NOT scale whichever variable you choose.** Ensure that healthy life expectancy at birth is on the x-axis and that you give your axes human-readable labels.

_Assign your plot to an object called `happy_plot_2`._

In [None]:
# your code here

happy_plot_2

**Question 1.7**
<br> {points: 3}

In a sentence or two, describe what you see in the scatterplot above. Does there appear to be a relationship between healthy life expectancy at birth and the other variable you plotted? If so, describe it.

##### Answer: 

## 2. Whistler Snow

Skiing and snowboarding are huge in British Columbia. Some of the best slopes for snow sports are quite close. In fact, the famous mountain-bearing city of Whistler is just two hours north of Vancouver. With cold weather and plenty of snowfall, Whistler is an ideal destination for winter sports fanatics. 

One thing skiers and snowboarders want is fresh snow! When are they most likely to find this? In the `data` directory, we have two-year-long data sets from [Environment Canada from the Whistler Roundhouse Station](http://climate.weather.gc.ca/historical_data/search_historic_data_stations_e.html?StationID=348&Year=2007&Month=3&Day=1&timeframe=2&type=bar&MeasTypeID=snow&searchType=stnProx&txtRadius=25&optProxType=navLink&txtLatDecDeg=50.128889166667&txtLongDecDeg=122.95483333333&optLimit=specDate&selRowPerPage=25&station=WHISTLER) (on Whistler mountain). This weather station is located 1,835 m above sea level.

To answer the question of "When are skiers and snowboarders most likely to find fresh snow at Whistler?" you will create a line plot with the date is on the x-axis and the total snow per day in centimetres (the column named `Total Snow cm` in the data file) on the y-axis. Given that we have data for two years (2017 & 2018), we will create one plot for each year to see if there is a trend we can observe across the two years.

**Question 2.1** Multiple Choice: 
<br> {points: 1}

What are we going to plot on the y-axis?

A. total precipitation per day in centimetres

B. total snow on the ground in centimetres

C. total snow per day in centimetres

D. total rain per day in centimetres


##### Answer: 

**Question 2.2.0** 
<br> {points: 1}

Read in the file named `eng-daily-01012018-12312018.csv` from the `data` directory. **Make sure you preview the file to choose the correct `read_*` function and argument values to get the data into R.** 

_Assign your data frame to an object called `whistler_2018`._

*Note: You'll see a lot of entries of the form `NA`. This is the symbol R uses to denote missing data. Interestingly, you can do math and make comparisons with `NA`: for example,* `NA + 1 = NA`, `NA * 3 = NA`, `NA > 3 = NA`. *Most operations on `NA` return `NA`. This may seem a bit weird, but it makes things much simpler in R since it removes the need to write any special code to handle missing data!*

In [None]:
# your code here


whistler_2018

**Question 2.2.1** 
<br> {points: 1}

Looking at the column names of the `whistler_2018` data frame, you can see we have white space in our column names again. Use `make.names` to remove the whitespace to make it easier to use our `tidyverse` functions.

In [None]:
# your code here


colnames(whistler_2018)

**Question 2.3** 
<br> {points: 1}

Create a line plot with the date on the x-axis and the total snow per day (in cm) on the y-axis by filling in the `...` in the code below. Ensure you give your axes human-readable labels.

_Assign your plot to an object called `whistler_2018_plot`._

In [None]:
options(repr.plot.width = 12, repr.plot.height = 5)

# ... <- ggplot(..., aes(x = ..., y = ...)) + 
#     geom_line() +
#     xlab(...) +
#     ylab(...) +
#     scale_x_date(date_breaks = "1 month") + # labels every month
#     theme(axis.text.x = element_text(angle = 90, hjust = 1)) # rotates x axis labels to be vertical

# your code here

whistler_2018_plot

**Question 2.4** 
<br> {points: 3}

Looking at the line plot above, for 2018, of the months when it snowed, which 2 months had the **most** fresh snow?

##### Answer: 

**Question 2.5**
<br> {points: 3}

Repeat the data loading and plot creation using the file `eng-daily-01012017-12312017.csv` located in the `data` directory to visualize the same data for the year 2017. 

_Assign your plot to an object called `whistler_2017_plot`._

In [None]:
# whistler_2017 <- ...
# colnames(whistler_2017) <- colnames(whistler_2017) %>% make.names()

# ... <- ggplot(..., aes(x = ..., y = ...)) + 
#    geom_line() + 
#    xlab("...") + 
#    ylab("...") +
#    scale_x_date(date_breaks = "1 month") +
#    theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
#    theme(text = element_text(size = 20))

# your code here


whistler_2017_plot

**Question 2.6**
<br> {points: 3}

Looking at the line plot above, for 2017, of the months when it snowed, which 2 months had the **most** fresh snow?

##### Answer: 

**Question 2.7**
<br> {points: 3}

Are the months  with the most fresh snow the same in 2017 as they were in 2018? **Hint:** you might want to add a code cell where you plot the two plots right after each other so you can easily compare them in one screen view.

You can combine two plots, one atop the other, by using the `plot_grid` function from the `cowplot` package:

```
library(cowplot)
plot_grid(plot1, plot2, ncol = 1)
```
Is there any advantage of looking at 2 years worth of data? Why or why not?

In [None]:
# your code here



##### Answer: 

## 4 (Optional). Reading Data from the Internet

**Question 4.0**
<br> {points: 0}

More practice scraping! To keep out of legal trouble, we will get more practice scraping data using a website that was created for that purpose: http://books.toscrape.com/

Your task here is to scrape the prices of the science fiction novels on [this page](http://books.toscrape.com/catalogue/category/books/science-fiction_16/index.html) and determine the maximum, minimum and average price of science fiction novels at this bookstore. Tidy up and nicely present your results by creating a data frame called `sci_fi_stats` that has 2 columns, one called `stats` that contains the words `max`, `min` and `mean` and once called `value` that contains the calculated value for each of these.

The functions for maximum, minimum and average in R are listed in the table below:

| Calculation to perform | Function in R |
| ---------------------- | ------------- |
| maximum                | `max`         |
| minimum                | `min`         |
| average                | `mean`        |

Some other helpful hints:
- If you end up scraping some characters other than numbers you will have to use `str_replace_all` from the `stringr` library to remove them (similar to what we did with the commas in worksheet_02).
- Use `as.numeric` to convert your character type numbers to numeric type numbers before you pass them into the `max`, `min` and `mean` functions.
- If you have `NA` values in your objects that you need to pass into the `max`, `min` and `mean` functions, you will need to set the `na.rm` argument in these functions to `TRUE`.
- use the function `c` to create the vectors that will go in your data frame, for example, to create a vector with the values 10, 16 and 13 named ages, we would type: `ages <- c(10, 16, 13)`.
- use the function `tibble` to create the data frame from your vectors.

In [None]:
# your code here


sci_fi_stats

In `worksheet_reading` you had practice scraping data from the web. Now that you have the ability, should you scrape that website you have been dreaming of harvesting data from? Maybe, maybe not... You should check the website's Terms of Service first and consider the application you have planned for the data after you scrape it.

Consider one or more websites you might be interested in scraping data from (for fun, profit, or research/education). For each website, search for their Terms of Service page. Take note if such a page exists, and if it does, try to determine if they allow web scraping of their website.

This assignment is adapted from materials associated with Data Science: A First Introduction by Tiffany Timbers, Trevor Campbell, and Melissa Lee which is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.