# Practicing Data Carpentry with `R`

Data carpentry wears many hats. There is the concept of table structure, which we covered in the lab portion, and then there is getting your variables in proper data types that make analysis and manipulations simpler. This practice will focus on just that and, more specifically, with datetime variables because these are among the most difficult to work with.

We are moving away from the dummy data that we used in the lab and using a dataset full of professional baseball players. Baseball data is abundant. It has been actively collected for over a century. This particular dataset has certain components that could be modified, in terms of table structure but, as mentioned above, we are going to be working with datetime object manipulations.

We'll begin by importing our packages and reading in the data. Many of the fuctions we will be using today are from the `lubridate` package, which makes working with datetime objects a lot simpler (a lot, a lot). Read more about it [here](https://lubridate.tidyverse.org/)

In [1]:
library(tidyr)
library(dplyr)
library(ggplot2)
library(lubridate)

players <- read.csv('/dsa/data/all_datasets/baseball-databank/data/Master.csv')
head(players)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date



playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,⋯,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
<fct>,<int>,<int>,<int>,<fct>,<fct>,<fct>,<int>,<int>,<int>,⋯,<fct>,<fct>,<int>,<int>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>
aardsda01,1981,12,27,USA,CO,Denver,,,,⋯,Aardsma,David Allan,220,75,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
aaronha01,1934,2,5,USA,AL,Mobile,,,,⋯,Aaron,Henry Louis,180,72,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
aaronto01,1939,8,5,USA,AL,Mobile,1984.0,8.0,16.0,⋯,Aaron,Tommie Lee,190,75,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
aasedo01,1954,9,8,USA,CA,Orange,,,,⋯,Aase,Donald William,190,75,R,R,1977-07-26,1990-10-03,aased001,aasedo01
abadan01,1972,8,25,USA,FL,Palm Beach,,,,⋯,Abad,Fausto Andres,184,73,L,L,2001-09-10,2006-04-13,abada001,abadan01
abadfe01,1985,12,17,D.R.,La Romana,La Romana,,,,⋯,Abad,Fernando Antonio,220,73,L,L,2010-07-28,2015-10-03,abadf001,abadfe01


This dataset is rather wide, so much so that not all of the columns are rendered in JupyterHub. Fortunately, the variable names are rather clear as to what type of values they contain (i.e. `birthMonth` is the month that the baseball player was born in). Instead of viewing the the contents of the dataset, let's just take a look at all of the column names to get a better idea of what we are working with.

In [2]:
names(players)

The `names()` function returns a list of the column names of the data frame passed to it. We are going to be looking at several of the date columns for this practice. Let's begin with looking at what the data types of some of these columns are.



In [3]:
head(players$finalGame)   # first 5 values 
typeof(players$finalGame) # type of object

Right now the values in the `finalGame` column are integers and in their current format you can't do anything interesting with them like calculate the time between two dates or extract the day of the week. That's where `lubridate` comes in. 

First, let's demonstrate how you can change something into an interpretable date object. The following three flexible functions make handling different date formats extremely simple.

In [4]:
dmy('01-08-1800') # August 1st, 1800
ymd('2016/9/12')  # September 9th, 2016
mdy('May 12th, 2012') # May 12th, 2012

As you can see, the three examples are meant to handle dates that are input differently. In each scenario, d is for day, m is for month, and y is for year. Putting the three to together in the order of the input, renders a date object with the proper date.

These functions can also be run over data frame columns.

**Exercise 1**: *Overwrite the `players$finalGame`  column to a date object with year month date format using a `lubridate` function.*

In [8]:
# Code for Exercise 1 goes here 
# -----------------------------
players$finalGame <- ymd(players$finalGame)
print(players$finalGame)

    [1] "2015-08-23" "1976-10-03" "1971-09-26" "1990-10-03" "2006-04-13"
    [6] "2015-10-03" "1875-06-10" "1910-09-15" "1896-09-23" "1897-08-19"
   [11] "1890-05-23" "1905-09-20" "1984-08-08" "2001-09-29" "1999-07-21"
   [16] "2001-04-13" "1996-08-24" "1910-10-15" "2004-08-07" "1957-09-11"
   [21] "1871-10-21" "2008-09-28" "1952-09-27" "2005-09-29" "1944-04-29"
   [26] "1972-09-30" "1947-04-17" "1949-05-09" "1911-05-05" "1992-10-03"
   [31] "1956-05-09" "1923-05-04" "1985-10-03" "2014-09-28" "1942-07-11"
   [36] "2015-10-03" "2011-09-27" "2014-07-28" "2009-07-25" "1910-06-02"
   [41] "2012-09-27" "2014-06-02" "2005-09-29" "2003-08-05" "2015-10-04"
   [46] "1992-06-14" "1959-09-20" "2015-10-04" "1964-05-09" "1975-05-05"
   [51] "1972-10-04" "1922-05-12" "2012-10-03" "1918-09-02" "1997-05-10"
   [56] NA           NA           "1970-05-03" "1931-09-07" "2015-10-04"
   [61] "1946-04-24" "2015-09-30" "1926-08-11" "1919-09-28" "1925-09-23"
   [66] "1932-09-17" "1959-04-22" "1977-09-29" "194

**Exercise 2**: *Now convert `players$debut` to a date object with the same year month date format using `lubridate`.*

In [9]:
# Code for Exercise 2 goes here 
# -----------------------------
players$debut <- ymd(players$debut)


Below is a good example of some of the functionality provided by `lubridate` date objects. Imagine I wanted to know what day of the week most players debuted. Well, once the column is a date obeject, we can run the `wday()` function on it to extract the day of the week. The numbers in this case go from 1, Sunday to 7, Saturday.

In [10]:
players$debut <- ymd(players$debut)
head(wday(players$debut), 10)

**Exercise 3**: *Create a column in the `players` data frame called `debutMonth` and assign the month name (as in 1 = Jan, 2 = Feb and so on...) of the `players$debut` column. (Hint: find a function similar to the `wday()` function and take a look at the `label` parameter)*

In [14]:
# Code for Exercise 3 goes here 
# -----------------------------
players %>% mutate(debutMonth = month(players$debut, label = TRUE))


playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,⋯,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,debutMonth
<fct>,<int>,<int>,<int>,<fct>,<fct>,<fct>,<int>,<int>,<int>,⋯,<fct>,<int>,<int>,<fct>,<fct>,<date>,<date>,<fct>,<fct>,<ord>
aardsda01,1981,12,27,USA,CO,Denver,,,,⋯,David Allan,220,75,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,Apr
aaronha01,1934,2,5,USA,AL,Mobile,,,,⋯,Henry Louis,180,72,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01,Apr
aaronto01,1939,8,5,USA,AL,Mobile,1984,8,16,⋯,Tommie Lee,190,75,R,R,1962-04-10,1971-09-26,aarot101,aaronto01,Apr
aasedo01,1954,9,8,USA,CA,Orange,,,,⋯,Donald William,190,75,R,R,1977-07-26,1990-10-03,aased001,aasedo01,Jul
abadan01,1972,8,25,USA,FL,Palm Beach,,,,⋯,Fausto Andres,184,73,L,L,2001-09-10,2006-04-13,abada001,abadan01,Sep
abadfe01,1985,12,17,D.R.,La Romana,La Romana,,,,⋯,Fernando Antonio,220,73,L,L,2010-07-28,2015-10-03,abadf001,abadfe01,Jul
abadijo01,1854,11,4,USA,PA,Philadelphia,1905,5,17,⋯,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01,Apr
abbated01,1877,4,15,USA,PA,Latrobe,1957,1,6,⋯,Edward James,170,71,R,R,1897-09-04,1910-09-15,abbae101,abbated01,Sep
abbeybe01,1869,11,11,USA,VT,Essex,1962,6,11,⋯,Bert Wood,175,71,R,R,1892-06-14,1896-09-23,abbeb101,abbeybe01,Jun
abbeych01,1866,10,14,USA,NE,Falls City,1926,4,27,⋯,Charles S.,169,68,L,L,1893-08-16,1897-08-19,abbec101,abbeych01,Aug


Remember back to the lab where we used the `unite()` function to combine three columns into one column because it was a date. Well, this data frame does a similar thing with each players birthday. The birthYear, birthMonth and birthDay variables are all separate columns. Well, `lubridate` provides a function that allows us to draw from different inputs. The `make_date()` function can take in three inputs to construct a date. 

Let's try it out for leapday 2012.

In [15]:
make_date(2012,2,29)

And just for fun, for a mistaken leapday...

In [16]:
make_date(2013,2,29)

`lubridate` keeps leap years in mind.

**Exercise 4**: *Create a column called `players$birthdate` using the `make_date()` function.*

In [23]:
# Code for Exercise 4 goes here 
# -----------------------------
players$birthdate <- make_date(players$birthYear, players$birthMonth,players$birthDay)
players

playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,⋯,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,birthdate
<fct>,<int>,<int>,<int>,<fct>,<fct>,<fct>,<int>,<int>,<int>,⋯,<fct>,<int>,<int>,<fct>,<fct>,<date>,<date>,<fct>,<fct>,<date>
aardsda01,1981,12,27,USA,CO,Denver,,,,⋯,David Allan,220,75,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,1981-12-27
aaronha01,1934,2,5,USA,AL,Mobile,,,,⋯,Henry Louis,180,72,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01,1934-02-05
aaronto01,1939,8,5,USA,AL,Mobile,1984,8,16,⋯,Tommie Lee,190,75,R,R,1962-04-10,1971-09-26,aarot101,aaronto01,1939-08-05
aasedo01,1954,9,8,USA,CA,Orange,,,,⋯,Donald William,190,75,R,R,1977-07-26,1990-10-03,aased001,aasedo01,1954-09-08
abadan01,1972,8,25,USA,FL,Palm Beach,,,,⋯,Fausto Andres,184,73,L,L,2001-09-10,2006-04-13,abada001,abadan01,1972-08-25
abadfe01,1985,12,17,D.R.,La Romana,La Romana,,,,⋯,Fernando Antonio,220,73,L,L,2010-07-28,2015-10-03,abadf001,abadfe01,1985-12-17
abadijo01,1854,11,4,USA,PA,Philadelphia,1905,5,17,⋯,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01,1854-11-04
abbated01,1877,4,15,USA,PA,Latrobe,1957,1,6,⋯,Edward James,170,71,R,R,1897-09-04,1910-09-15,abbae101,abbated01,1877-04-15
abbeybe01,1869,11,11,USA,VT,Essex,1962,6,11,⋯,Bert Wood,175,71,R,R,1892-06-14,1896-09-23,abbeb101,abbeybe01,1869-11-11
abbeych01,1866,10,14,USA,NE,Falls City,1926,4,27,⋯,Charles S.,169,68,L,L,1893-08-16,1897-08-19,abbec101,abbeych01,1866-10-14


**Exercise 5**: *Now using the 'players$birthdate' column you just created, find the month that most players were born in. (Hint: There are several ways to do this with `dplyr`)*

In [35]:
# Code for Exercise 5 goes here 
# -----------------------------
players %>% group_by(month(birthdate, label = TRUE)) %>% summarise(frequency = n()) %>% arrange(frequency) %>% tail(1)
paste("August is the most frequent birthmonth")

“Factor `month(birthdate, label = TRUE)` contains implicit NA, consider using `forcats::fct_explicit_na`”

"month(birthdate, label = TRUE)",frequency
<ord>,<int>
Aug,1793


# Save your Notebook, then `File > Close and Halt`