# Lab 4

For this lab we will start by reviewing some core concepts around dates, ranking, and loops. Then we will explore speed dating data and end with exercises that let you explore cocoa rating data.

## Table of Contents
* [Review](#Review)
* [Explore](#Explore)
* [Exercises](#Exercises)

In [None]:
library(tidyverse)
library(nycflights13)

## Review

## How to date R

Dates in R, like any other programming language, can be fairly complicated and need to be handled a special way. When you give R something like "10/02/2002", you may think it's obvious that this a date. But R just views this as a regular string type, in other words a piece of text like "taco cat". You have to explicitly tell R how to handle these strings so that they are stored properly. For instance is the format of this date MM/DD/YYYY or DD/MM/YYYY? But once you tell R this is a date and it has this format, then doing fairly complex operations on it become much easier (e.g. what day of the week is 03/12/2045? What is the date for 02/10/2060 + 60 days?).

There are multiple functions and packages in R for handling dates, here are a couple of the common ways.

### Base R (No packages required)

In [None]:
#Here we create a date type variable, I used two different formats for the same value
dt = as.Date("10/02/2002", format="%m/%d/%Y")
dt2 = as.Date("October 2, 2002", format = "%B %d, %Y")

In [None]:
#How did I know what format to use? I looked at the documentation
?strptime

In [None]:
#If we look at the class here, we see dt is in fact a date
class(dt)

In [None]:
class("10/02/2002")

In [None]:
#Now we can do nifty things like add days to our date
dt + 10
dt2 + 10

In [None]:
#What if I want the difference between two dates in weeks?
dt2 = as.Date("12/22/2002", format="%m/%d/%Y")
difftime(dt, dt2, units = "secs")

In [None]:
#You can also use dates in regular R functions now, let's say I want 6 weeks in a vector from the original date I set
seq(dt, length = 6, by = 7)

In [None]:
#You can also still use boolean comparisons
# dt2 = "12/22/2002"
# dt = "10/02/2002"
dt2 > dt

In [None]:
#Finally, we can also have datetimes, which are dates with the time also included
#In R, there are two main types for this type:  POSIXct and POSIXlt

#Taking a look at the flight data notice that time_hour is in a datetime format
str(flights)

In [None]:
#Or just run class() to verify this
class(flights$time_hour)

In [None]:
#If I wanted to create a date field from this, here's all I would need to do
flights_new = flights %>% mutate(date = as.Date(time_hour))
str(flights_new)

### Using a Package (Lubridate)

Lubridate is package that uses the underlying datetime R formats (POSIXct and POSIXlt) to make more user friendly functions that make manipulations with dates a lot easier.

In [None]:
library(lubridate)

In [None]:
#First let's create a simple date with lubridate, notice that it always converts dates to datetimes
dt = ISOdate("2017", "02", "08")
dt2 = ymd_hms("2017-02-08 05:00:00")
dt

In [None]:
#Note that under the hood this variable is just a regular R datetime
class(dt)

In [None]:
#Now we have a whole arsenal of functions to play with these datetimes
# dt = "2017-02-08 12:00:00 GMT"
year(dt)
week(dt)
wday(dt)
hour(dt)
tz(dt)

In [None]:
#And there are multiple arguments for each function to get the output you want
wday(dt, label = TRUE)

In [None]:
#We can also add durations to each date
# dt = "2017-02-08 12:00:00 GMT"
dtnew = dt + ddays(5) + dminutes(20) + dhours(10) + dyears(1)
dtnew

## Ranking Rguments

Ranking fields in a data frame is a super common thing that comes up in many problems you'll tackle. Here I'll review some of the basics of doing this in R using the flights dataset.

In [None]:
#Let's first say we want to find the top 10 tail numbers based on the distance traveled
head(flights,1)

In [None]:
#Here we make a data fram that takes the top 10 values based on distance
flights_ranked = flights %>% arrange(desc(distance)) %>% top_n(10,distance)

In [None]:
#Huh? Why are there 342 rows?
nrow(flights_ranked)

In [None]:
#Let's go about it a different way
flights_ranked2 = flights %>% arrange(desc(distance)) %>% mutate(rnk = row_number()) %>% filter(rnk <= 10)

In [None]:
flights_ranked2

In [None]:
#Now what if I just want the unique tailnumbers?
unq_tails = unique(flights_ranked2 %>% .$tailnum)

In [None]:
unq_tails

##  > LoopeR <

Some of you have been using loops to solve the HW problems, so I wanted to make sure the syntax and use of these is clear to everyone. In general, loops are SLOW, so you will only want to use them if you really need to. R provides a lot of optimized functions that don't require loops, so always try to use these first.

What are loops?
Basically, they just a way that you iterate through a vector and perform the same operation to each element. The two main types of loops you will encounter are `for` loops and `while` loops.

In [None]:
#Lets create two simple vectors
vct1 = c(2,4,5,6)
vct2 = c(1,2,3,4)

In [None]:
seq(1,length(vct1),1)

In [None]:
#Now we can cycle through each element of vct1
#for loops only execute for the length of the element to the right (e.g. vct1)
for(i in seq(1,length(vct1),1)) {
    print(i)
}

In [None]:
# We can define variables that use elements in these loops
x = 0
for(i in vct1) {
    x = x+i
}
#see what's in x
x

In [None]:
# More commonly we can also use multiple vectors to do things
#Let's say I want to compare each element of vct1 to vct2 and store this in a new vector
compare_vect = c()
for(i in seq(length(vct1))) {
    compare_vect[i] = vct1[i] >= vct2[i]
}

#Let's see what's in this vector
compare_vect

In [None]:
#While loops can be used if you want to continue to execute until some condition is met

#~~~~~~~~~~~~~~~ BE CAREFUL THOUGH! IF YOUR CONDITION IS NEVER MET THE LOOP WILL NOT STOP~~~~~~~~~~~~~~~#
cnt = 0
while(cnt < 10) {
    print("hi!")
    cnt = cnt + 1
}

## Explore

The following analysis is based on data and work that can be found on Kaggle. This data was collected by Columbia to investigate gender differences in mate selection (their words not mine :) ). They essentially ran speed dating sessions from 2002 to 2004, each participant was allowed 4 minutes with someone from the opposite sex, then after they were asked if they would like to see their date again and judged them based on six attributes: Attractiveness, Sincerity, Intelligence, Fun, Ambition, and Shared Interests. For this exploration we'll try to spot differences with how men and women choose who they want to date!

**Note:** *This content uses some functions and techniques that are outside the scope of the course, it is meant to show you how what you're learning can be used in interesting problems.*

In [None]:
#Library needed for radar charts
library(fmsb)

In [None]:
#Load in the data
rawdat = read.csv('Speed Dating Data.csv', header = T, stringsAsFactors = F)

In [None]:
head(rawdat)

### What are speed daters looking for in their matches?

In [None]:
#Tons of cleanup code to get it ready for plotting
dat = rawdat %>% select(-id, -idg, -condtn, -round, -position, -positin1, -order, -partner, -tuition, -undergra, -mn_sat)
at00 = dat %>%select(iid, pid, dec, gender, attr, sinc, intel, fun, amb, shar, like, prob) %>% filter(!pid == "NA")
at00[is.na(at00)] = 1000
at00$total = rowSums(at00[,c("attr", "sinc", "intel", "fun", "amb", "shar")])
at00 = at00 %>% filter(!total == "6000")
at00[at00 == "1000"] = NA
at00$total = rowSums(at00[,c("attr", "sinc", "intel", "fun", "amb", "shar")], na.rm=TRUE)
at00 = at00 %>% filter(!total == "0")
at00 = at00 %>% mutate(pgender = ifelse(gender == 0, 1, 0))
at11 =dat %>%group_by(gender) %>%select(iid, gender, attr1_1, sinc1_1, intel1_1, fun1_1, amb1_1, shar1_1) %>% unique()
at11[is.na(at11)] = 0
at11$total = rowSums(at11[,c("attr1_1", "sinc1_1", "intel1_1", "fun1_1", "amb1_1", "shar1_1")])
at11 = at11 %>% filter(!total == "0")
at11$attr1_1 = round(at11$attr1_1/at11$total*100, digits = 2)
at11$sinc1_1 = round(at11$sinc1_1/at11$total*100, digits = 2)
at11$intel1_1 = round(at11$intel1_1/at11$total*100, digits = 2)
at11$fun1_1 = round(at11$fun1_1/at11$total*100, digits = 2)
at11$amb1_1 = round(at11$amb1_1/at11$total*100, digits = 2)
at11$shar1_1 = round(at11$shar1_1/at11$total*100, digits = 2)
at11$total = rowSums(at11[,c("attr1_1", "sinc1_1", "intel1_1", "fun1_1", "amb1_1", "shar1_1")])
at11$total = round(at11$total, digits = 0)
test1 = at11 %>%group_by(gender) %>% summarise(Attractive = mean(attr1_1), Sincere = mean(sinc1_1), Intelligent = mean(intel1_1), Fun = mean(fun1_1), Ambitious = mean(amb1_1), Interest = mean(shar1_1))
test1forplot = test1 %>% select(-gender)
maxmin = data.frame(Attractive = c(36, 0),Sincere = c(36, 0),Intelligent = c(36, 0),Fun = c(36, 0),Ambitious = c(36, 0),Interest = c(36, 0))
test11 = rbind(maxmin, test1forplot)
test11male = test11[c(1,2,4),]
test11female = test11[c(1,2,3),]

In [None]:
#Finally the fun part
radarchart(test11,pty = 32,axistype = 0,
           pcol = c(adjustcolor("hotpink1", 0.5), adjustcolor("cadetblue2", 0.5)),
           pfcol = c(adjustcolor("hotpink1", 0.5), adjustcolor("cadetblue2", 0.5)),
           plty = 1,
           plwd = 3,
           cglty = 1,
           cglcol = "gray88",
           centerzero = TRUE,
           seg = 5,
           vlcex = 0.75,
           palcex = 0.75)

legend("topleft", 
       c("Male", "Female"),
       fill = c(adjustcolor("cadetblue2", 0.5), adjustcolor("hotpink1", 0.5)))

### What do speed daters think their same-sex peers are looking for?

In [None]:
#Again get the data ready as above
at41= dat %>%group_by(gender) %>%select(iid, gender, attr4_1, sinc4_1, intel4_1, fun4_1, amb4_1, shar4_1) %>% unique()
at41[is.na(at41)] = 0
at41$total = rowSums(at41[,c("attr4_1", "sinc4_1", "intel4_1", "fun4_1", "amb4_1", "shar4_1")])
at41= at41 %>% filter(!total == "0")
at41$attr4_1 = round(at41$attr4_1/at41$total*100, digits = 2)
at41$sinc4_1 = round(at41$sinc4_1/at41$total*100, digits = 2)
at41$intel4_1 = round(at41$intel4_1/at41$total*100, digits = 2)
at41$fun4_1 = round(at41$fun4_1/at41$total*100, digits = 2)
at41$amb4_1 = round(at41$amb4_1/at41$total*100, digits = 2)
at41$shar4_1 = round(at41$shar4_1/at41$total*100, digits = 2)
at41$total = rowSums(at41[,c("attr4_1", "sinc4_1", "intel4_1", "fun4_1", "amb4_1", "shar4_1")])
at41$total = round(at41$total, digits = 0)
test4 = at41 %>%group_by(gender) %>% summarise(Attractive = mean(attr4_1), Sincere = mean(sinc4_1), Intelligent = mean(intel4_1), Fun = mean(fun4_1), Ambitious = mean(amb4_1), Interest = mean(shar4_1))
test4forplot =test4 %>% select(-gender)
test41 = rbind(maxmin, test4forplot)

In [None]:
radarchart(test41,
           pty = 32,
           axistype = 0,
           pcol = c(adjustcolor("hotpink1", 0.5), adjustcolor("cadetblue2", 0.5)),
           pfcol = c(adjustcolor("hotpink1", 0.5), adjustcolor("cadetblue2", 0.5)),
           plty = 1,
           plwd = 3,
           cglty = 1,
           cglcol = "gray88",
           centerzero = TRUE,
           seg = 5,
           vlcex = 0.75,
           palcex = 0.75)

legend("topleft", 
       c("Male", "Female"),
       fill = c(adjustcolor("cadetblue2", 0.5), adjustcolor("hotpink1", 0.5)))

---

## Exercises

### Explore!

For this exercise, it will be more open ended and let you explore and find interesting things in a new data set. I added a few questions you can try to answer below, but feel free to explore more!

The data set has flavor profiles of cocoa beans from around the world.

In [None]:
#Load in the data
cocoa = read_csv("flavors_of_cacao.csv")
#Rename the fields to get rid of spaces
names(cocoa) = make.names(names(cocoa))
#Take a peek
head(cocoa)

### How many NA values are there? Remove them after you find them.

In [None]:
#Add code here

### Notice that many Bean Types are missing values, fill these missing values in with "Missing".
#### Hint: nchar() can be used to find the length of a string/character

In [None]:
#Add code here

### The Cocoa Percent field is formatted as a character, but we want to use it as a number, convert this field by creating a new column called Cocoa.Percent.Int
#### Hint: You will need to use substr() and as.numeric()

In [None]:
library(stringr)

In [None]:
#Add code here

### Explore if there is a linear relationship between Rating and the newly created Cocoa Percent field (e.g. scatterplot!)

In [None]:
#Add code here

### Which company location has the highest average rating? How many cocoas were included in this rating?

In [None]:
#Add code here

### Perhaps the countries with lower ratings, just have more cocoa ratings in general, which brings the average down. Try looking at the top 6 countries that have the most ratings and see the distribution of the ratings for each.

In [None]:
#Add code here

### Investigate if certain Bean Types appear to be higher rated than others. How many ratings are there for each of these types? 

In [None]:
#Add code here

### Focusing on the bean type with the largest number of ratings, does it appear to be a normal distribution?

In [None]:
#Add code here