In [1]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.0.2     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



### The Dataset

* YERockfish.csv contains measurements of fish collected from along the Oregon coast

  * Yelloweye Rockfish (Sebastes rubberimus)
  
* Length (length; to the nearest cm),
* Age (years)
* Maturity (Immature and Mature) state of female  

* Read in the data as a tibble


In [2]:
# Write you code here
df = read_csv("data/YERockfish.csv")

This data contains a date column. R provides an easy way to manipulate dates.
  * The field needs to be converted to a format that R will recognize as a date. 
    * The `as.POSIXct()` function does this conversion. The function takes two arguments
      * The column to convert
      * Format of the date. Since dates come in vairous shapes (ex. yyyy/m/d, mm/dd/yy, etc..)

* Use the function `as.POSIXct()` to convert the date. The approach is exactly the same as converting to factor (`as.factor()`) or integer (`as.integer()`)
  * Here you can pass the following format string for the parameter `format`
    * `%m/%d/%Y`
    
 * Bonus: use a `dplyr` pipeline to mutate the column

In [4]:
head(df)

date,length,age,maturity,stage
<chr>,<dbl>,<dbl>,<chr>,<chr>
9/2/2003,31,10,Immature,1
10/7/2002,32,6,Immature,1
7/18/2000,32,11,Immature,1
6/11/2001,32,11,Immature,2
8/8/2000,32,13,Immature,2
10/4/2003,33,9,Immature,1


* USe the function head on your modified `tibble` to make sure the column is now of type `dttm` (date and time) instead of `char`

In [None]:
* Count the number of lines in your file

In [18]:
nrow(df)

* Plot the count of observations per year.
 * Note that you can easily parse the year from the data using the `year()` function, which takes a column of `dttm`  datatype
 * The function year is part of the `lubridate` package, which you may need to import 
* Hint 1: try the function `year()` on your date column
* Hint 2: Use group by to group the data and `n()` to count the number of entries in each group

* Bonus: use a `dplyr` pipeline to answer this question.

In [21]:
# Try year() on the date colum
library(lubridate)
year(df$date)

In [23]:
# answer question here
df %>% 
  group_by(year(date)) %>%
    summarise(count=n())

`summarise()` ungrouping output (override with `.groups` argument)



year(date),count
<dbl>,<int>
2000,37
2001,14
2002,61
2003,42
2004,3
2008,1


* Remove all entries that below to a year for which there are less than 5 entries
  * E.g. there is only one entry for 2008, so we can remove it. 
* Save the data to a new tibble

In [34]:
(data["count"] < 5)[,1]

In [50]:
data = df %>% 
  group_by(year(date)) %>%
    summarise(count=n())
bad_years = pull(data[(data["count"] < 5),"year(date)"])

`summarise()` ungrouping output (override with `.groups` argument)



In [59]:
df %>% 
  filter(!(year(date) %in% bad_years))

date,length,age,maturity,stage
<dttm>,<dbl>,<dbl>,<chr>,<chr>
2003-09-02,31,10,Immature,1
2002-10-07,32,6,Immature,1
2000-07-18,32,11,Immature,1
2001-06-11,32,11,Immature,2
2000-08-08,32,13,Immature,2
2003-10-04,33,9,Immature,1
2000-07-17,33,10,Immature,1
2002-08-18,34,8,Immature,1
2000-07-12,34,10,Immature,1
2000-07-25,34,11,Immature,1


* Count the number of entries and make sure there are less observations 

In [67]:
nrow(df)

* Model the fish maturity using the fish length 
  * i.e. predict maturity from the lengh data

* You can make any changes to the data needed to build this model

In [None]:
### Write you code here

In [64]:
df$maturity = as.factor(df$maturity)

glm_model_1 = glm(maturity~length,data=df,family=binomial)
summary(glm_model_1)


Call:
glm(formula = maturity ~ length, family = binomial, data = df)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.99494   0.01042   0.18713   0.35637   1.71514  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -16.94826    3.23395  -5.241 1.60e-07 ***
length        0.43718    0.07985   5.475 4.37e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 149.221  on 147  degrees of freedom
Residual deviance:  74.001  on 146  degrees of freedom
  (10 observations deleted due to missingness)
AIC: 78.001

Number of Fisher Scoring iterations: 7


In [None]:
* Generate a plot to show the data and the fit of the model (sigmoid)

In [None]:
# Write you code here

* What is the length at which the probability of picking a mature fish is 0.5? 
  * You can eyeball it or can you compute it formally from the logistic regression

In [65]:
# Write your answer here

* Add an era column to your dataset such that
 * era has the value "pre_2000" if the year of the observation is pre 2002
 * era has the value "era2002 and after" otherwise

* Hint: Check the `if_else` in dplyr
  * https://dplyr.tidyverse.org/reference/if_else.html

* Bonus: implement this operation using a `pipeline` and `mutate`

In [None]:
### Write your code here

* You suspected that some major environmental stressor that occurred in 2002 had an impact on the fish length and maturity
* Build a logistic regression for maturity as an outcome using `era` and `length` as predictive variables
  * Make sure your formula accounts for interactions

* Use an ANOVA to test whether maturity is a function of both length and era
* Does the maturity differ between the two eras?
  i.e. is the era model coefficient significant?

### Final note:

This practical is inspired from:
http://derekogle.com/IFAR/supplements/maturity/index.html#fitting-the-logistic-regression-model

The link above contains most answers, so please do not read until you are done with this practical.