   ## Assignment 1

   This assignment is aimed to give you some practice using Jupyter Notebooks, R, and interpretting statistical output using real-world data. The notebook below will be used to generate the statistical output. The assignment will be guided and much of the R code will be provided for you, but specific aspects of the R code you will be asked to interact with and ultimately make a decision about appropriate values to include. The notebook should be run from the first code cell in sequential order, this means that you must run the beginning cells in order to be able to have access to the R packages needed for the assignment and that the data are read in appropriately.

   Upon completion of generating the statistical code, you will be asked to submit answers to questions on ICON. These questions will be focused on interpreting the statistical output generated from this notebook.

   You may work in groups of up to 3 to complete the assignment. In these situations, please turn in one assignment in ICON with all group members names on the submission.

   *Assignment 1 Due*: **Sunday, October 3rd, by 11:59 pm**

  ## Description of the Data

The data used for this assignment are about board games. The data are board games published (i.e., released) between 2010 and 2016 that can be played with fewer than 5 players and have a maximum number of 8 players. The data are also for games that tend to be completed in less than 120 minutes. 

   The data were part of a [tidy tuesday](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-03-12) example and came from the [Board Game Geek](http://boardgamegeek.com/) database. The database crowd sources game ratings and contains more than 90,000 games. The following subsetted version contains data on 3,664 games of those games. The following are the attributes in the data used for the assignment.

   + **game_id**: The unique ID of the game
   + **max_players**: The maximum number of players for the game
   + **min_playtime**: Minimum playtime for the game
   + **min_age**: Minimum age for playing the game
   + **min_players**: Minimum number of players needed to play the game
   + **name**: Name of the game
   + **playing_time**: Playing time of the game
   + **year_published**: Year the game was published (i.e., able to be purchased or released)
   + **average_rating**: Average rating of the game from the crowd-source database.
   + **play_time_groups**: The average play time split into 3 groups, less than 30 minutes, between 30 and 60 minutes, greater than 60 minutes.
   + **category_group**: The primary category of the game, as 6 groups, Abstract Strategy, Adventure, Bluffing, Card Game, Dice Game, or Other category.
   + **year_pub_char**: A character version of when the game was published, useful for visualization.
   + **min_players_char**: A character version of the minimum number of players, useful for visualization.
   + **max_players_char**: A character version of the maximum number of players, useful for visualization.


Please don't hesitate to reach out with any data questions about the structure and interpretation of the variables in the data.

   ## Assignment Setup
   **Run this cell first upon opening the notebook. You will need to run this cell everytime you leave and come back to the notebook.**

In [None]:
library(tidyverse)
library(ggformula)
library(mosaic)

theme_set(theme_bw(base_size = 16))

board_games <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-12/board_games.csv") %>%
  filter(year_published > 2009 & min_players < 4 & max_players < 8 & playing_time < 121) %>%
  select(game_id, max_players, min_playtime, min_age, min_players,
         min_playtime, name, playing_time, year_published, 
         category, average_rating) %>%
  mutate(play_time_groups = ifelse(playing_time < 30, 'less than 30 minutes',
                                   ifelse(playing_time >= 30 & playing_time < 60, 
                                          '30 to 60 minutes', 'greater than 60 minutes')),
         category_group = gsub(",.+$", "", category),
         category_group = fct_lump_n(category_group, n = 5),
         year_pub_char = as.character(year_published),
         min_players_char = as.character(min_players),
         max_players_char = as.character(max_players)) %>%
  select(-category)

  head(board_games)


   ## Question 1
   Explore the distribution of the `average_rating` attribute visually using the code provided below.

   Complete the code by filling in the appropriate attribute where the "^^" are and fill in the visualization type you are most comfortable with where "??" are located. Finally, replace the "$$" with an appropriate plot title and x-axis label that are descriptive.

In [None]:
gf_??(~ ^^, data = board_games) %>%
  gf_labs(title = "$$", 
          x = "$$") 


  ## Question 2
   Create a violin plot that explores the distribution of the `average_rating` attribute for any **one** of the following four attributes: `category_group`,`year_pub_char`, `min_players_char`, or `max_players_char`.

   Complete the code by filling in the attributes in the formula notation in place of "%%" and "^^", and finally include descriptive labels for the plot title, y-axis, and x-axis in place of the "$$".

In [None]:
p <- gf_violin(%% ~ ^^, data = board_games, fill = 'gray85', draw_quantiles = c(.1, .5, .9)) %>%
  gf_labs(y = '$$',
          x = '$$',
          title = '$$') %>%
  gf_refine(coord_flip())
p


   ## Question 3
   Create another violin plot that builds on top of your figure in question 2 above by the `play_time_groups`.
   
   **Note**, the cell in question 2 must be run prior to the creation of this cell as this cell depends on the code running successfully in question 2.

In [None]:
p %>%
  gf_facet_wrap(~ play_time_groups)
