# Lecture 4.2: More about Exploratory Data Analysis

<div style="border: 1px double black; padding: 10px; margin: 10px">

**Goals for today's lecture:**
* The covariation between two variables
* Case Studies on EDA
    
This lecture note corresponds to Chapter 7 of your book.
</div>


In [37]:
library(tidyverse)
library(nycflights13)

## More about Histogram

Let us try to plot a histogram for the variable `dep_delay` in our flights data set. 

It seems like there are 8255 rows that have missing values, so maybe let us try to remove those values first before we plot our histogram.  

Since we have already manually removed all of the missing values, `ggplot` will not output a warning message for us now.

Let us zoom into the left part of the plot. Let us only look at flights with departure delays of less than an hour.

We can look at the underlying bins and their count by using the `cut_width` function in ggplot2.

The `cut_width` function basically shows you how many observations are within each bin with bin width equal to five.

#### Remark: 
The appearance of a histogram does depend on your choice of the bin width. It is a good idea to try several values to see if different choices reveal different patterns.

We can also bring in a third variable to our histogram just like we did for `geom_bar` and others.

Let us bring in the categorical variable **carrier** and map the color aesthetic to it.



Oops! The legend is a bit crowded. Let us see who the major carriers are by number of flights.

Maybe let us just plot the historgam with the top 5 carriers. Let us find out which carriers are the top five carriers by using the tools that we have learnt so far.  

Now we can additionally filter out rows that do not belong to the top 5 carriers.

Hmmm... May be not a good idea to stick with histograms here. It is still too crowded and it is hard to see what is going on.  So let us a new geometry **freqpoly** which is like histogram but shows lines. Overlapping lines are easier to see than overlapping bars.

## Case Studies 

### Who is the greatest batter of all time?
The `Lahman` dataset contains information on baseball players.


In [11]:
install.packages("Lahman")


The downloaded binary packages are in
	/var/folders/0l/dj01tr0x49xbx9gr9y98rpj00000gn/T//Rtmpyhw2TV/downloaded_packages


In [12]:
library(Lahman)
bat <- as_tibble(Batting) %>% print

[90m# A tibble: 108,789 × 22[39m
   playerID  yearID stint teamID lgID      G    AB     R     H   X2B   X3B    HR
   [3m[90m<chr>[39m[23m      [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<fct>[39m[23m  [3m[90m<fct>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m
[90m 1[39m abercda01   [4m1[24m871     1 TRO    NA        1     4     0     0     0     0     0
[90m 2[39m addybo01    [4m1[24m871     1 RC1    NA       25   118    30    32     6     0     0
[90m 3[39m allisar01   [4m1[24m871     1 CL1    NA       29   137    28    40     4     5     0
[90m 4[39m allisdo01   [4m1[24m871     1 WS3    NA       27   133    28    44    10     2     2
[90m 5[39m ansonca01   [4m1[24m871     1 RC1    NA       25   120    29    39    11     3     0
[90m 6[39m armstbo01   [4m1[24m871     1 FW1    NA       12    49     9

Each row of the above data is data per player and per year.  For instance, let's take a look at the second row of our data.

Let's try to compute the career batting average for Bob Addy, defined as Hit/(Number of bats).  The playerID for Bob Addy is addybo01.  The variable for Hit is `H` and numner of bats is `AB`. 

Bob Addy was active in the years 1871-1877. During that time he had $118+51+152+213+310+142+245=1231$ at-bats, and $32+16+54+51+80+40+68=341$ hits. Therefore his career batting average was $341/1241=0.277$.

### Exercise
By appropriately grouping and summarizing the data, add up all the hits and at-bats for each player across all the years they played, and compute their career batting average. 

Which player(s) has the highest career batting average?

### Always include counts
It is a good idea to include counts of each group when you do a summary. Some groups may have very low numbers of observations, resulting in high variance for the summary statistics. 




What happens if we restrict our batting average calculation to players that had at least 100 at-bats, and sort it from players that have the highest batting average?

What is wrong with the above?  Why are there so many overlapping rows for the player cobbty01? 

### Exercise
Output all the years in which the batting average for those years for cobbty01 is less than his bat_avg across all years. That is, we are interested in knowing which year cobbty01 underperform.    

## Exercise
Now output all rows for each player that underperform, that is, their bat_avg_year is less than bat_avg.

That is, we are interested in knowing whether there is any specific year that the players underperform. 

## Exercise
Write code to find the player with the highest bat_avg for each team. You may find the function `slice_max` useful.

## Names of baseball players

Let us think more about names. Naming frequencies change a lot over time. There are 19617 baseball players in this data set. How have their names changed over time?

We are going to extract the first name and last name in our data set `Lahman::Master` by linking the playerID.

## Exercise
What were the top five most common first names for players born before 1900? After 1980?

One thing we notice is that there are a lot of nicknames. It might make more sense to look at the "given name", which is usually the first and middle names. To do this, we will need to split up these names. There is a built-in command for doing this in R:

## Exercise
What were the top five most common first names for players born before 1900? After 1980?

## Finding distinct values

Here's an example: I want to know how many distinct values are there? The `n_distinct()` function takes a vector of values, and returns the number of distinct values:

## Exercise
How many distinct names were there among players born before 1900? After 1980?

Are there more or less unique names now than there were in the past? Let's consider the number of distinct names seen in each year: