<h1 style="text-align: center;">Data Visualization and Data Manipulation Activities</h1>

<p style="text-align: center;">July, 2024</p>


## Data Visualization Activities

In this activity, we'll be using the datasets `mpg`, `gapminder`, and `diamonds` to construct some visualizations. Run the code below to load the ggplot2 and gapminder R packages, as well as these datasets.


In [1]:
### Run this cell before continuing.

library(ggplot2)
library(gapminder)
data(mpg)
data(gapminder)
data(diamonds)

### Activity 1

1. Generate a scatterplot to answer the following question:

    * `mpg`: How is city driving fuel consumption rate related to engine size and drive train?    



2. Generate a boxplot to answer the following question:

    * `mpg`: How is the drive train related to engine size and vehicle class?



3. Generate faceted histograms to answer the following questions:

    * `diamonds`: How does the price distribution vary by cut?
    
    * `mpg`: How does the distribution of engine size vary by vehicle class?
    


4. Complete the following code to plot different smoothers (experiment with different values for the arguments `method` and `span` in the `geom_smooth` function)

In [None]:
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth(method=..., span=...)


5. Generate `geom_smooth` **line** with scatterplots to answer the following questions. Use `se=FALSE` for this exercise:

    * `gapminder`: What is the life expectancy of each country over time? (Colour AND facet countries by the continent)
    
    * `diamonds`: What are the relationships between the price of a diamond and its carat for each colour of diamond?


## Data Manipulation Activities

We'll be loading the `OJ` (Orange Juice) dataset from the ISLR R package. This dataset contains customer purchase data at five different stores for two brands of orange juice, Citrus Hill (CH) and Minute Maid (MM). Each observation corresponds to the purchase of one of these brands, but price data is provided for both brands at the time of purchase. See the help documentation of `OJ` for more information on what each variable represents.


In [6]:
### Run this cell before continuing.

library(dplyr)
library(ISLR)
data(OJ)

## Data Manipulation Activity 1

Use the `dplyr` R package to complete the following activity items:

1. Create two datasets, one that **ONLY** contains variables for CH OJ and another one that **ONLY** contains variables for MM OJ. This can be done by selecting variables based on the ends of their names. Additionally, make sure the `Purchase` and `STORE` variables are also selected for both datasets.

2. Filter each dataset such that only their corresponding purchases are present in each dataset. For example, for our CH OJ dataset, make sure that the only observations in the dataset are customer purchases of CH.

3. For each dataset, arrange the observations based on the **sale** price (not the price) of the purchase. Break any ties using the discount variable.

4. Generate a new variable (for each dataset) that represents the sale price as a percentage of the original price. Refer to this variable as `Sale_Perc_Orig`. 

HINT: Think of this as the ratio of the sale price and the original price OR the complement percentage of the discount percentage. Try creating the variable both ways.

5. For each dataset, obtain the following summaries for the newly created variables from the previous step: (1) mean, (2) median and (3) standard deviation. Try doing this again, but perform group summaries for each of the stores using the `STORE` variable.


## Data Manipulation Activity 2 (BONUS CHALLENGE)

Use the `dplyr` R package once again to complete the following activity items:


1. Using the original OJ dataset, create a boolean variable (`TRUE/FALSE`) that indicates whether or not the customer purchased MM when the sale price was lower for MM compared to CH. You can interpret this variable as "Did the customer buy MM instead of CH when it was the cheaper choice?". 

HINT: Use the `Purchase` and the `PriceDiff` variables.



2. Make a boolean variable (TRUE/FALSE) like in the previous step, but this time do it for purchases of CH when the sale price was lower for CH compared to MM.



3. Group by the `STORE` variable and summarize the total number of `TRUE` values for each of the above two variables across the different stores. 

HINT: A boolean variable is also interpreted in R as a 0/1 variable, so consider using a summation approach.



4. Repeat the previous step, but this time compute the **proportions** of cheaper purchases for each of the two variables across the store groups (i.e. the percentage of all store-specific sales). 

HINT: This should only require a couple of small alterations to the code and also note that `n()` provides the number of observations within a group. For example, `OJ %>% group_by(STORE) %>% summarise(Group_Nums = n())` gives you the number of observations in each store.





## Solutions to Data Visualization Activity 1

### Part I

<details>
<summary>Click here for Solution Part I</summary>
    
`ggplot(mpg, aes(displ, cty, col=drv)) + geom_point()`
</details>

### Part II



<details>
  <summary>Click here for Solution Part II</summary>
    
  `ggplot(mpg, aes(drv, displ, fill=class)) + geom_boxplot()`
</details>

### Part III




<details>
  <summary>Click here for Solution Part III</summary>
    
  `ggplot(diamonds, aes(price)) + geom_histogram() + facet_wrap(~cut)`

  `ggplot(mpg, aes(displ)) + geom_histogram() + facet_wrap(~class)`
</details>

### Part IV


<details>
  <summary>Click here for Solution Part IV</summary>
    
   You can try various methods including "lm", "loess", "gam", etc. . Span takes values between 0 and 1, as mentioned in the lecture.
    
  `ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth(method="loess", span=0.5)`

</details>





### Part V




<details>
  <summary>Click here for Solution Part V</summary>
    
  `ggplot(gapminder, aes(year, lifeExp, group=country, colour=continent)) + geom_point() + geom_smooth(method="lm", se=FALSE) + facet_wrap(~continent)`
    
  `ggplot(diamonds, aes(carat, price, colour=color)) + geom_point() + geom_smooth(method="lm", se=FALSE)`

</details>




## Solutions to Data Manipulation Activity 1

### Part I


<details>
  <summary>Click here for Solution Part I</summary>
    
  `CH_OJ = select(OJ, Purchase, ends_with("CH"), STORE)`

  `MM_OJ = select(OJ, Purchase, ends_with("MM"), STORE)`
</details>

### Part II





<details>
  <summary>Click here for Solution Part II</summary>
    
  `CH_OJ = filter(CH_OJ, Purchase=="CH")`

  `MM_OJ = filter(MM_OJ, Purchase=="MM")`
</details>

### Part III



<details>
  <summary>Click here for Solution Part III</summary>

  `CH_OJ = arrange(CH_OJ, SalePriceCH, DiscCH)`

  `MM_OJ = arrange(MM_OJ, SalePriceMM, DiscMM)`
</details>


### Part IV



<details>
  <summary>Click here for Solution Part IV</summary>
  
  `CH_OJ = mutate(CH_OJ, Sale_Perc_Orig1 = SalePriceCH/PriceCH, Sale_Perc_Orig2 = 1-PctDiscCH)`

  `MM_OJ = mutate(MM_OJ, Sale_Perc_Orig1 = SalePriceMM/PriceMM, Sale_Perc_Orig2 = 1-PctDiscMM)`
</details>



### Part V



<details>
  <summary>Click here for Solution Part V</summary>

  `summarise(CH_OJ, Mean = mean(Sale_Perc_Orig1), Median = median(Sale_Perc_Orig1), SD = sd(Sale_Perc_Orig1))`

  `summarise(MM_OJ, Mean = mean(Sale_Perc_Orig1), Median = median(Sale_Perc_Orig1), SD = sd(Sale_Perc_Orig1))`


  `CH_OJ %>% group_by(STORE) %>% summarise(Mean = mean(Sale_Perc_Orig1), Median = median(Sale_Perc_Orig1), SD = sd(Sale_Perc_Orig1))`

  `MM_OJ %>%group_by(STORE) %>% summarise(Mean = mean(Sale_Perc_Orig1), Median = median(Sale_Perc_Orig1), SD = sd(Sale_Perc_Orig1))`

</details>


## Solutions to Data Manipulation Activity 2 




### Part I

<details>
  <summary>Click here for Solution Part I</summary>
  
  `OJ_New = mutate(OJ, Cheap_MM_Purchase = Purchase=="MM" & PriceDiff<0)`
</details>


### Part II



<details>
  <summary>Click here for Solution Part II</summary>
  
  `OJ_New = mutate(OJ_New, Cheap_CH_Purchase = Purchase=="CH" & PriceDiff>0)`
</details>


### Part III



<details>
  <summary>Click here for Solution Part III</summary>
  
  `OJ_New %>% group_by(STORE) %>% summarise(Total_Cheap_MM_Purchases = sum(Cheap_MM_Purchase), Total_Cheap_CH_Purchases = sum(Cheap_CH_Purchase))`
</details>


### Part IV



<details>
  <summary>Click here for Solution Part IV</summary>
  
  `OJ_New %>% group_by(STORE) %>% summarise(Perc_Cheap_MM_Purchases = sum(Cheap_MM_Purchase)/n(), Perc_Cheap_CH_Purchases = sum(Cheap_CH_Purchase)/n())`
</details>
