 ## Assignment 2

 This assignment is aimed to give you some practice using Jupyter Notebooks, R, and interpretting statistical output using real-world data. The notebook below will be used to generate the statistical output and you will also write up answers to specific questions related to the statistical output. The assignment will be guided and much of the R code will be provided for you, but specific aspects of the R code you will be asked to interact with and ultimately make a decision about appropriate values to include. The notebook should be run from the first code cell in sequential order, this means that you must run the beginning cells in order to be able to have access to the R packages needed for the assignment and that the data are read in appropriately.

 If you have an error and want to reset any code chunks to how this document was submitted to the IDAS, you can view this file on [GitHub](https://github.com/lebebr01/psqf_4143/tree/main/assignments). *Note: The same file name is found on the GitHub link.*

Upon completion of generating the statistical code, you will be asked to submit answers to questions on a Microsoft Forms quiz. These questions will be focused on interpreting the statistical output generated from this notebook.

 You may work in groups of up to 3 to complete the assignment. In these situations, please turn in one assignment in ICON with all group members names on the submission.

 *Assignment 2 Due*: **Monday, October 10th, by 11:59 pm**

 ## Description of the Data

 These data are weather data from Australia weather stations between 2007 and 2017.

 + **Date**: The date of weather observation.
 + **Location**: Location of weather observation.
 + **MinTemp**: Minimum temperature.
 + **MaxTemp**: Maximum temperature.
 + **Rainfall**: Daily rainfall (in mm).
 + **WindGustDir**: Wind gust direction
 + **WindGustSpeed**: Wind gust speed (in km/h)
 + **WindDir9am**: Wind gust direction at 9am.
 + **WindDir3pm**: Wind gust direction at 3pm.
 + **WindSpeed9am**: Wind gust speed (in km/h) at 9am
 + **WindSpeed3pm**: Wind gust speed (in km/h) at 3pm
 + **Humidity9am**: Humidity (percent) at 9am
 + **Humidity3pm**: Humidity (percent) at 3pm
 + **Pressure9am**: Atmospheric pressure (hpa) at 9am
 + **Pressure3pm**: Atmospheric pressure (hpa) at 3pm
 + **Temp9am**: Temperature (in Celsius) at 9am
 + **Temp3pm**: Temperature (in Celsius) at 3 pm
 + **RainToday**: Boolean: 1 if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise 0
 + **RainTomorrow**: The target attribute. Did it rain tomorrow? Yes/No
 + **year**: The year of observation
 + **month**: The month of observation, represented as text labels
 + **day**: The day of the observation within each month
 + **day_labels**: The day of the week, (e.g. Sun = Sunday)
 + **year_char**: A character version of the year of observation, useful for plotting or descriptive stats.

 Please don't hesitate to reach out with any data questions about the structure and interpretation of the attributes in the data.

 ## Assignment Setup

 **Run this cell first upon opening the notebook everytime** This cell loads the R packages and prepares the data for you.

In [None]:
library(tidyverse)
library(ggformula)
library(mosaic)
library(rpart)
library(rpart.plot)
library(rsample)
library(lubridate)

theme_set(theme_bw(base_size = 14))

aus_weather <- read_csv("https://raw.githubusercontent.com/lebebr01/statthink/master/data-raw/weatherAUS.csv",
                        guess_max = 100000) %>%
                        select(-Evaporation, -Sunshine, -Cloud9am, -Cloud3pm, -RISK_MM) %>%
                        na.omit() %>%
                        mutate(year = year(Date),
                               month = month(Date, label = TRUE),
                               day = day(Date),
                               day_labels = wday(Date, label = TRUE),
                               RainTomorrow_num = ifelse(RainTomorrow == 'Yes', 1, 0),
                               year_char = as.character(year)) %>%
                       filter(Location %in% c('Perth', 'Sale', 'Wollongong', 'GoldCoast'))

head(aus_weather)

 ## Question 1
 Explore the `MaxTemp` distribution visually using the code provided below.

 Complete the code by filling in the appropriate attribute in place of "^^". Then, fill in the visualization type you are most comfortable with where "??" are located (either density of histogram). Finally, replace the "$$" with an appropriate plot title and x-axis label that are descriptive of what the plot is showing.

In [None]:
gf_??(~ ^^, data = aus_weather) %>%
  gf_labs(title = "$$", 
          x = "$$") 

 ## Question 2
 Compute conditional multivariate descriptive statistics for the `MaxTemp` attribute based on an attribute you think may help to explain differences in the maximum temperature for that day. Pick one of the following attributes to explore if the maximum temperature varies by: `Location`, `month`, `year_char`, or `day_labels`.

 Complete the code by filling in the appropriate outcome attribute in place of "^^", select an attribute of interest in place of "$$" from one of the 4 attributes identified above, and the descriptive functions where the "&&" are located in the code below. Functions that may be useful here could include: `mean`, `median`, `sd`, `var`, `IQR`, `min`, `max`, `length`.

In [None]:
aus_weather %>%
  df_stats(^^ ~ $$, &&)

## Question 3
Before predicting whether it is going to rain tomorrow based on the weather observations from the previous day, let's first explore how often it rains tomorrow.

Complete the code below to generate descriptive statistics on the numeric `RainTomorrow_num` attribute by filling in the descriptive functions where the "&&" are located in the code below. Functions that may be useful here could include: `mean`, `median`, `sd`, `var`, `IQR`, `min`, `max`, `length`.

In [None]:
aus_weather %>%
  df_stats(~ RainTomorrow_num, &&)

  ## Question 4
  We are now going to explore which data attributes (i.e. weather observations) that are most important in predicting if it is going to rain tomorrow based on the weather observations from the previous day. To see which attributes are important, we will fit a classification tree to these data using the `rpart()` function. 
  
  ### Attributes to include 
  Select one attribute from each of the following groups of attribute to include in the classification tree: 

  + temperature: `MinTemp` or `MaxTemp`
  + humidity: `Humidity9am` or `Humidity3pm`
  + pressure: `Pressure9am` or `Pressure3pm`
  + month: `month`
  + day: `day_labels`
  + location: `Location`

  *Note*: You should have a total of 6 attributes in the classification tree equation to the right of the `~` sign below.
  
  Complete the code within the `rpart()` function below by replacing "$$" with attributes included in the data that you think would be important in days in which it will rain tomorrow. These attributes are shown in the list above in this question, again, please pick one attribute from each element of the list. 
  
  *Note*, separate attributes with a `+` symbol and I recommend not using the location attribute.

In [None]:
set.seed(202)
aus_weather_split <- initial_split(aus_weather, prop = .7)
aus_weather_train <- training(aus_weather_split)
aus_weather_test <- testing(aus_weather_split)


# Fit classification tree
aus_weather_class <- rpart(RainTomorrow ~ $$, 
                    method = 'class', data = aus_weather_train)

rpart.plot(aus_weather_class, roundint = FALSE, type = 3, branch = .3)

  ## Question 5
  The code below will use the model fitted above and produce predicted values for the withheld test data. 

In [None]:
aus_weather_test <- aus_weather_test %>%
  mutate(rain_tomorrow_predict = predict(aus_weather_class, 
                                    newdata = aus_weather_test, 
                                    type = 'class'))

aus_weather_test %>%
  mutate(same_class = ifelse(RainTomorrow == rain_tomorrow_predict, 1, 0)) %>%
  df_stats(~ same_class, mean, sum, length)

  ## Question 6
  The code below will create a bar chart is created that shows the accuracy of the predictions compared to the actual observed data.

In [None]:
gf_bar(~ RainTomorrow, fill = ~rain_tomorrow_predict, data = aus_weather_test, 
       position = 'fill') %>% 
       gf_labs(y = "Proportion", 
       x = "Did it really rain tomorrow?", 
       fill = "Model Predicted Rainfall")