# Lab 2
This lab reviews ggplot material covered over the past week, starts to explore if the Afghanistan election had any foul play, and then covers exercises reviewing mainly dplyr.

## Table of Contents
* [Review](#first)
* [Explore](#second)
* [Exercises](#third)

## Review <a class="anchor" id="first"></a>

### Step One - Don't forget to bring in R packages we'll need

In [None]:
library(tidyverse)

### Downloading my notebooks

Go to github > Click on the notebook you want to download > right click on "Raw" and choose "Save Link As.." > Save file as a jupyter notebook (.ipynb)

### What if I am STILL having issues getting Jupyter working?

Professor Terhorst setup an online environment with Jupyter notebook setup and ready to go. All you need to do is login with your Michigan credentials and you will see all of his lectures and problem sets ready to go. You can edit the problem set notebook directly, save your results, and then download it for submission once you are done.

https://jupyter.stats306.org/hub/login

### Major functions covered so far

#### Visualization (ggplot)

*Here are some of the key functions for plotting (sub bullets are arugments within the function), this is not meant to be an exhaustive list:*

* ggplot(): Tells R that we want to plot something
* labs(): Defines the labels on a plot (e.g. labs(x="x label", y="y label", title="title", subtitle="subtitle"))
* geom_point(): Tells R that we want to use points to plot the data (e.g. scatter plot)
    * color: use this argument to set the color
    * size: use this argument to set the size of the points
    * position: normally you would set this to "jitter" if you want all the points to be viewable
* facet_grid(): Uses categorical variables in the "row ~ column" format to plot previous geoms for each category
* face_wrap(): Very similar to facet_grid(), but only displays overlapping categories that have data, and can manipulate the size of the grid output.
    * nrow, ncol: Defines the number of rows and column the plot should be
* geom_smooth(): Plots a best fit line for the given data with a confidence interval by default
    * linetype: Use categorical variable to split into multiple lines
    * se: Boolean to define if confidence interval should be displayed
    * group: Use categorical variable to split line into different colors
* geom_bar(): Plots a bar chart, default aggregation (or statistic) is going to be a count of each category of x
    * stat: Default here is "count", but if a y variable is to be used, change this to "identity"
    * fill: Tells R what color to fill the bar with (normally use a variable)
    * color: Tells R what color to outline the bars with
    * position: Tells R where to position the bars, "dodge" will make side by side bars, "fill" makes proportions
* geom_boxplot(): Plots a boxplot, note that the x variable has to be categorical and the y variable has to be quantitative.
* coord_flip(): Flips the x and y axis of whatever plot you create, helpful for making horizontal boxplots
* coord_polar(): Converts plot into polor coordinates


*In general, all plots that you will need to create will follow this general format in ggplot:*

ggplot(data = < DATA>) +

< GEOM_FUNCTION>(mapping = aes(< MAPPINGS>),stat = < STAT>,position = < POSITION>) +

< COORDINATE_FUNCTION> +

< FACET_FUNCTION>

#### Manipulation (dplyr)

*Here are some of the main functions we've covered in dplyr to manipulate data frames (or tibbles):*
* filter(): Takes a dataframe and filters it by some condition
* arrange(): Sorts the dataframe by certain columns
* select(): Picks out variables (or columns) in your dataframe you wish to keep
* mutate(): Creates new variables based on user defined functions
* summarize(): Creates a summary table based on conditions you define (e.g. average mpg by car class)
* %>% : Pipe operator (included in dplyr package), essentially lets you string together multiple commands, where the output of what is to the left of the pipe is put in the first argument position of the right function. For example:
```R
exp(diff(log(x))) 
#This is equivalent to this:
x %>% log() %>% diff() %>% exp()
```
* Operators: When you want to define a condition, here are the main operators you can use
    * <, >: less than, greater than
    * <=, >= : less than or equal to, greater than or equal to 
    * ==, !=: equal to, not equal to
    * !x: not x (e.g. is x is TRUE, !x is FALSE)
    * x|y: x or y
    * x & y: x and y
    * x %in% list: show me values of x that exist in the 'list'

### What is a function?

Here is a simple function that adds a and b. We say "a" is the first argument and "b" is the second argument. Anytime you run "library(packagename)" you are essentially telling R to loads hundreds or even thousands of functions that are all defined just like the one below.

In [None]:
addthese = function(a,b) {
    c = a + b
    return(c)
}

In [None]:
addthese(2,3)

You can name variables anything you want in these functions, you can also set a default value by adding a "=something" next to any argument.

In [None]:
gimmeascat = function(hereyougo) {
    nohereyougo = geom_point()
    return(hereyougo + nohereyougo)
}

In [None]:
data = mpg
plot = ggplot(data=data, mapping=aes(x=hwy, y=displ))
gimmeascat(hereyougo = plot)

In [None]:
#Would this work?
gimmeascat(ggplot(data=data, mapping=aes(x=hwy, y=displ)))

### A couple helpful functions (not officially covered in class)

In [None]:
#Takes a dataframe (a table in R) and outputs top 6 rows by default
head(mpg)
#Or try head(mpg, 20)

In [None]:
#Takes a dataframe and outputs helpful stats about each variable (a column)
summary(mpg)

In [None]:
# Gives you the data types and a peak into the data for each variable
str(mpg)

In [None]:
#What if I think 'year' in mpg should be a factor?

#Solution
mpg$year_factor = factor(mpg$year)
str(mpg)

### Troubleshooting Practice!

Problem 1: Why is this throwing an error?

In [None]:
c(1,2,3,4)
mean(c)

# We never defined a variable named "c"! c() is used to combine multiple numbers or characters into a vector

Problem 2: Why is this throwing an error?

In [None]:
aa = 26
bb = 20
cc = aaa + bb
mean(cc + 5)

# Take a look at when we define "cc", notice that "aaa" is never previously defined, it should have been "aa"

Problem 3: Why is this throwing an error?

In [None]:
#Run the function
hitme()

# you haven't defined the hitme() function yet

In [None]:
#Define the function
hitme = function() {
    print("~ Hit! ~")
}

# this should be run first

## Explore <a class="anchor" id="second"></a>

For this exploration, we will take a look at the 2014 Afghanistan election data and try to see if we can find anything fishy with this dataset. Notice that everytime we explore a dataset we are generally following the data science flow shown in the image below. 

**Note:** *This content uses some functions and techniques that are outside the scope of the course, it is meant to show you how what you're learning can be used in interesting problems.*
![Data Science Lifecycle](DS_Lifecycle.png)

In [None]:
#Load the data in from multiple csv's
fnl = read.csv("2014_afghanistan_preliminary_runoff_election_results.csv")
prelim = read.csv("2014_afghanistan_election_results.csv")

In [None]:
#Let's take a peek and make sure the data looks good
head(prelim)
head(fnl)

In [None]:
#It looks like the data is a wide format, which makes it hard to use, so let's convert the data into a better format
#Get all prelim candidate names instead of typing them out
nm = colnames(prelim)[5:15]
#Convert the prelim data to a long format
prelim = gather(prelim, candidate, votes,nm)
#Convert the final data to a long format
fnl = gather(fnl, candidate, votes, c("Abdullah", "Ghani"))
#For this analysis we will just focus on the province, candidate, and votes
prelim = select(prelim,province,PC_number,candidate,votes)
fnl = select(fnl,Province,candidate,PC_number,votes)

In [None]:
#Let's make sure the data looks ok
head(group_by(prelim,province) %>% summarize(tot = sum(votes))) %>% arrange(tot)

In [None]:
#It looks like there are null province values in the data with a not many votes, let's filter them out
prelim = filter(prelim,province != "")
fnl = filter(fnl,Province != "")

In [None]:
#Let's see if there were any large differences in the number of votes between the two elections
#First, we'll aggregate the data by province for both the prelim and final data
tmp = prelim %>% group_by(province) %>% summarize(prelim_votes = sum(votes)) %>% arrange(desc(prelim_votes))
tmp2 = fnl %>% group_by(Province) %>% summarize(fnl_votes = sum(votes)) %>% arrange(desc(fnl_votes))
#Now we can take these two tibbles and join them on matches of Province
cmp_votes = inner_join(tmp,tmp2,by=c("province" = "Province"))
#Now we find the difference between the final and prelim votes
cmp_votes = mutate(cmp_votes, diff = fnl_votes - prelim_votes) %>% arrange(diff)
cmp_votes

In [None]:
#Let's try to make a heatmap to see if we can tell anything unusual about the election
#Remember that Ghani was the end winner of the election
prelim %>% group_by(province,candidate) %>% 
    summarize(totl = sum(votes)) %>% 
    ggplot() + geom_tile(aes(x=province, y=candidate, fill=totl)) +
    scale_fill_gradient(low = "white", high = "steelblue") + theme(axis.text.x = element_text(angle=90))

In [None]:
#Are there certain provinces that are very popular towards a certain candidate
group_by(fnl, Province,candidate) %>% summarize(tot=sum(votes)) %>% arrange(tot) %>% 
ggplot() + geom_bar(mapping = aes(x=Province, y=tot,fill=candidate),position="dodge", stat="identity") + theme(axis.text.x = element_text(angle=90))

In [None]:
#We could try to continue this analysis to the model phase by perhaps seeing if we could predict the winner of the election
# based on the prelim data only. This could be done by modifying the data further and using something like a
# Naive Bayes classifier in R to do the prediction.

## Exercises <a class="anchor" id="third"></a>

### Section 3.7 & 3.8 & 3.9

In [None]:
# What does geom_col() do? How is it different to geom_bar()?

#Solution:
#If you read through the documentation, you will see that geom_col() sets stat to "identity" while geom_bar() has stat set to "count"
?geom_col

In [None]:
# What is the problem with this plot? How could you improve it?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point()

#Solution:
# Lots of the points are overlapping here, update the position to "jitter"
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point(position="jitter")

In [None]:
# What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.

#Solution:
# Using mpg, the default position is "dodge"
ggplot(data = mpg, aes(x = drv, y = hwy, color = class)) + geom_boxplot()
# If we change this to "identity", they would overlap
ggplot(data = mpg, aes(x = drv, y = hwy, color = class)) + geom_boxplot(position="identity")

In [None]:
# Use the boxplot you created in the previous exercise, and make it horizontal with new x, y, and title labels

#Solution:
#Using the previous plot, we define a new plot
previous = ggplot(data = mpg, aes(x = drv, y = hwy, color = class)) + geom_boxplot()
newplot = previous + coord_flip() + labs(x="x", y="y", title="this is a title")
newplot

### Section 4

In [None]:
# Tweak each of the following R commands so that they run correctly:
ggplot(dota = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

fliter(mpg, cyl = 8)

filter(diamond, carat > 3)


#Solution:
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

filter(mpg, cyl == 8)

filter(diamonds, carat > 3)

### Section 5

In [None]:
# Find all flights that:
# 1. Had an arrival delay of two or more hours
# 2. Flew to Houston (IAH or HOU)
# 3. Were operated by United, American, or Delta
# 4. Departed in summer (July, August, and September)
# 5. Arrived more than two hours late, but didn’t leave late
# 6. Were delayed by at least an hour, but made up over 30 minutes in flight
# 7. Departed between midnight and 6am (inclusive)

#Solution:
#1)
flights %>% filter(arr_delay > 120)
#2)
flights %>% filter(dest %in% c("IAH", "HOU"))
#3) Start by looking up the airline code in the 'airline' data set
filter(flights, carrier %in% c("AA", "DL", "UA"))
#4)
filter(flights, between(month, 7, 9))
#5)
filter(flights, !is.na(dep_delay), dep_delay <= 0, arr_delay > 120)
#6)
filter(flights, !is.na(dep_delay), dep_delay >= 60, dep_delay-arr_delay > 30)
#7)
filter(flights, dep_time <=600 | dep_time == 2400)

In [None]:
# How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

#Solution:
filter(flights, is.na(dep_time))

In [None]:
# Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

#Solution:
# Here are a few options
select(flights, dep_time, dep_delay, arr_time, arr_delay)
select(flights, starts_with("dep_"), starts_with("arr_"))
select(flights, matches("^(dep|arr)_(time|delay)$"))

In [None]:
# What does the one_of() function do? Why might it be helpful in conjunction with this vector?
vars <- c("year", "month", "day", "dep_delay", "arr_delay")

#Solution:
select(flights, one_of(vars))

In [None]:
# Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains("TIME"))

#Solution
#Notice that default behavior for dplyr is to ignore the case of the word TIME, which is different than regular R behavior

In [None]:
# Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

#Solution:
#arr_time and dep_time do not take into account different time zones, which is an issue
mutate(flights,
       air_time2 = arr_time - dep_time,
       air_time_diff = air_time2 - air_time) %>%
  filter(air_time_diff != 0) %>%
  select(air_time, air_time2, dep_time, arr_time, dest)

In [None]:
# Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().

#Solution:
mutate(flights,
       dep_delay_rank = min_rank(-dep_delay)) %>%
  arrange(dep_delay_rank) %>% 
  filter(dep_delay_rank <= 10)

In [None]:
# What does 1:3 + 1:10 return? Why?

#Solution:
#Since 1:3 is shorthand for c(1,2,3), adding these two vectors together just combines them into one vector

In [None]:
# Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

#Solution:
#First make a nice dataframe or tibble to organize the data we want to investigate
canceled_delayed = flights %>% mutate(canceled = (is.na(arr_delay) | is.na(dep_delay))) %>%
  group_by(year, month, day) %>%
  summarise(prop_canceled = mean(canceled),avg_dep_delay = mean(dep_delay, na.rm = TRUE))

#Now make a scatter to see if we notice a trend
ggplot(canceled_delayed, aes(x = avg_dep_delay, prop_canceled)) + geom_point() + geom_smooth()

In [None]:
# Which carrier has the worst delays? 

#Solution
#It looks like from this that Frontier is the worst for delays
flights %>%
  group_by(carrier) %>%
  summarise(arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(desc(arr_delay))

In [None]:
# What time of day should you fly if you want to avoid delays as much as possible?

#Solution
#Let's see what hour has the shortest delays
flights %>%
  group_by(hour) %>%
  summarise(arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  ungroup() %>%
  arrange(arr_delay)

In [None]:
# For each destination, compute the total minutes of delay. For each, flight, compute the proportion of the total delay for its destination.

#Solution
flights %>% 
  filter(!is.na(arr_delay), arr_delay > 0) %>%  
  group_by(dest) %>%
  mutate(total_delay = sum(arr_delay),
         prop_delay = arr_delay / sum(arr_delay))

In [None]:
# Delays are typically temporally correlated: even once the problem that caused the initial delay has
# been resolved, later flights are delayed to allow earlier flights to leave. Using lag() explore how the
# delay of a flight is related to the delay of the immediately preceding flight.

#Solution
#Let's make a data set using the lag() function, then plot this data on a scatter
flights %>%
  group_by(year, month, day) %>% filter(!is.na(dep_delay)) %>% mutate(lag_delay = lag(dep_delay)) %>%
  filter(!is.na(lag_delay)) %>%
  ggplot(aes(x = dep_delay, y = lag_delay)) +
  geom_point() +
  geom_smooth()

In [None]:
# Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent
# a potential data entry error). Compute the air time a flight relative to the shortest flight to that
# destination. Which flights were most delayed in the air?

#Solution
flights %>%
  filter(!is.na(air_time)) %>%
  group_by(dest) %>%
  mutate(med_time = median(air_time),fast = (air_time - med_time) / med_time) %>%
  arrange(fast) %>%
  select(air_time, med_time, fast, dep_time, sched_dep_time, arr_time, sched_arr_time) %>%
  head(15)