Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New post: data science projects / github workflow #20

Merged
merged 5 commits into from Nov 8, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
173 changes: 173 additions & 0 deletions content/post/2020-07-31-thats-what-nicola-said.en-us.markdown
@@ -0,0 +1,173 @@
---
title: That's what Nicola said
author: Mine Cetinkaya-Rundel
date: '2020-07-31'
slug: thats-what-nicola-said
categories:
- musings
tags:
- web scraping
- sentiment analysis
- text mining
- tidyverse
editor_options:
markdown:
wrap: sentence
---

For the last four months, just about every week day at 12:30pm I have been watching the Fist Minister, Nicola Sturgeon, deliver her daily COVID-19 update.
If I've chatted with you about COVID during this time span, you have probably heard me say that I am *very* impressed by the way she delivers these updates daily.
I'll be honest, they are almost boring, but in the best possible way.
The last thing I want from a leader at this time is surprises, showmanship, or claims with no scientific basis.

<!-
-more-->

About a few weeks into the daily updates, I realized that the text for these speeches are published [on the Scottish Government website](https://www.gov.scot/collections/first-ministers-speeches/ "Texts of First Minister's COVID-19 speeches").
So, naturally, I downloaded the data and started analyzing it.
You can find the full analysis at [github.com/mine-cetinkaya-rundel/fm-speeches-covid19](https://github.com/mine-cetinkaya-rundel/fm-speeches-covid19 "GitHub repo analysing First Minister's COVID-19 speeches").
In this blog post I'll walk through parts of the analysis and show share some visualisations of the First Minister's speeches and how they compare to the speeches by the UK government.
The text of these speeches can be found [here](https://www.rev.com/blog/transcript-tag/united-kingdom-coronavirus-briefing-transcripts "Texts of UK Government COVID-19 speeches").

## Get First Minister's speeches

First, I checked to make sure we can, indeed, scrape these data with [`robotstxt::paths_allowed()`](https://docs.ropensci.org/robotstxt/reference/paths_allowed.html "robotstxt::paths_allowed()").


```r
robotstxt::paths_allowed("https://www.gov.scot/publications")
```

```
##
www.gov.scot
```

```
## [1] TRUE
```

For scraping the data and structuring it as a data frame, I used the [tidyverse](https://tidyverse.org/), [rvest](http://rvest.tidyverse.org/), and [lubridate](https://lubridate.tidyverse.org/) packages and the [here](https://here.r-lib.org/) package for proper file paths when saving the scraped data.


```r
library(tidyverse)
library(rvest)
library(lubridate)
library(here)
```

All speeches are linked from a single webpage.


```r
all_speeches_page_scot <- read_html("https://www.gov.scot/collections/first-ministers-speeches/")
```

Once we read this page, we can extract the URLs of the individual speeches using the [SelectorGadget](http://rvest.tidyverse.org/articles/selectorgadget.html) to locate the nodes and a little bit of string manipulation.


```r
covid_speech_urls_uk_scot <- all_speeches_page_scot %>%
html_nodes(".collections-list a") %>%
html_attr("href") %>%
str_subset("covid-19") %>%
str_c("https://www.gov.scot", .)

head(covid_speech_urls_uk_scot)
```

```
## [1] "https://www.gov.scot/publications/coronavirus-covid-19-update-first-ministers-speech-28-august-2020/"
## [2] "https://www.gov.scot/publications/coronavirus-covid-19-update-first-ministers-speech-27-august-2020/"
## [3] "https://www.gov.scot/publications/coronavirus-covid-19-update-first-ministers-speech-25-august-2020/"
## [4] "https://www.gov.scot/publications/coronavirus-covid-19-update-first-ministers-speech-24-august-2020/"
## [5] "https://www.gov.scot/publications/coronavirus-covid-19-first-ministers-speech-21-august-2020/"
## [6] "https://www.gov.scot/publications/coronavirus-covid-19-update-first-ministers-speech-20-august-2020/"
```

Next, we define a function that we can use to scrape each page and save its contents as a tibble with columns for `title`, `date`, `location`, `abstract`, `text`, and `url`.


```r
scrape_speech_scot <- function(url){

speech_page <- read_html(url)

title <- speech_page %>%
html_node(".article-header__title") %>%
html_text()

date <- speech_page %>%
html_node(".content-data__list:nth-child(1) strong") %>%
html_text() %>%
dmy()

location <- speech_page %>%
html_node(".content-data__list+ .content-data__list strong") %>%
html_text()

abstract <- speech_page %>%
html_node(".leader--first-para p") %>%
html_text()

text <- speech_page %>%
html_nodes("#preamble p") %>%
html_text() %>%
list()

tibble(
title = title,
date = date,
location = location,
abstract = abstract,
text = text,
url = url
)

}
```

And finally, map the function over the elements of `covid_speech_urls_uk_scot` that we defined earlier.
I'm using `map_dfr()` here since we want the result to be a data frame.


```r
covid_speeches_scot <- map_dfr(
covid_speech_urls_uk_scot,
scrape_speech_scot
)
```

This can take a little bit of time to run, so each time I've done this analysis, I save the result as an RDS file.
Note the use of `here::here()` to save the data in a folder called data in my project.
If you don't use the here package already, I strongly recommend [this blog post](https://malco.io/2018/11/05/why-should-i-use-the-here-package-when-i-m-already-using-projects/ "Why should I use the here package when I'm already using projects?") by Malcolm Barrett to convince yourself that it's a good idea!


```r
write_rds(
covid_speeches_scot,
path = here::here("data", "covid-speeches-scot.rds")
)
```

You can find the data as of July 30, 2020 [here](https://github.com/mine-cetinkaya-rundel/fm-speeches-covid19/blob/master/data/covid-speeches-scot.rds "First Minister's speeches data as of July 30, 2020").

This is a recipe I use very regularly for web scraping many pages:

- write a function to scrape a single page

- map the function over the list of URLs for the many pages you want to scrape with `map_dfr()`

- end up with a tibble you can save and analyse

I also think that web scraping is a great motivation for introducing iteration and the purrr package to new learners.[^1]

[^1]: Mine Dogucu and I recently published a paper on webscraping where discuss this approach for teaching web scraping and iteration, with an example for scraping data from [OpenSecrets](https://www.opensecrets.org/ "OpenSecrets.org").
See [here](https://www.tandfonline.com/doi/full/10.1080/10691898.2020.1787116 "Web Scraping in the Statistics and Data Science Curriculum: Challenges and Opportunities") for the paper (open access) and [here](https://github.com/mdogucu/web-scrape "GitHub repo: Web Scraping in the Statistics and Data Science Curriculum: Challenges and Opportunities") for the repo with the code for that example.

## Analyze First Minister's speeches

Alright, now let's take a look at

## Get UK Government's speeches
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
136 changes: 136 additions & 0 deletions content/post/2020-11-08-data-science-project-proposals/index.Rmd
@@ -0,0 +1,136 @@
---
title: GitHub workflow for data science project proposals
author: Mine Cetinkaya-Rundel
date: '2020-11-08'
slug: index
categories:
- teaching
tags:
- rstats
- github
---

Over the past few years I've been working on moving from a mindset of end-of-semester project to semester-long project. Inevitably students end up doing lots of work as the deadline approaches at the end of the semester (and I can't blame them, that's how I work around deadlines too, and how just about anyone I know works), but creating opportunities for them to get started on their projects earlier in the semester is very important.

<!--more-->

Over the past few years I've been working on moving from a mindset of end-of-semester project to semester-long project. Inevitably students end up doing lots of work as the deadline approaches at the end of the semester (and I can't blame them, that's how I work around deadlines too, and how just about anyone I know works), but creating opportunities for them to get started on their projects earlier in the semester is very important.

A natural way of doing this is with a project proposal about mid-way through the semester. A challenging aspect of this is that students have to propose a project before they've learned all methods and techniques, but there is simply no way to do it any other way, and I find that providing feedback along the lines of *"and we'll also learn about X which you can apply here"* can help mitigate this issue.

This blog post is about how I implement project proposals in my Introduction to Data Science course. This is the second year I'm using this approach fully, and I feel like it's working quite well. Two important details about the course that relate to projects are:

- students learn R and turn in all assignment write-ups as R Markdown documents in GitHub repositories, and
- students work in teams on weekly computing labs as well as the semester-long project.

You can read about the curriculum [here](https://datasciencebox.org/), see the course webpage for this semester [here](https://introds.org/), and specifically the project instructions [here](https://www.introds.org/#project).

## Project proposal assignment

I create a GitHub repository for each team with the following structure.

```
.github/
|-- ISSUE_TEMPLATE/
|-- proposal-feedback.md
|-- workflows/
|-- allowed-files.yaml
|-- knit.yaml
data/
|-- README.md
presentation/
|-- presentation.Rmd
proposal/
|-- proposal.Rmd
README.md
README.Rmd
```

For this phase students are only expected to work in the `data/` and `proposal/` folders.

- Any data used for the project gets placed in the `data/` folder and the codebook for the dataset(s) are added to the `README.md` file in that folder.
- The proposal write-up goes in `proposal.Rmd` and ultimately students knit this document and push it to their repos along with the output, `proposal.md`.

## Project proposal review

The proposals are marked out of 10 points, since they make up 10% of the overall project score. A breakdown of the 10 points is outlined below. There is a lot more that goes into the evaluation than just these bullet points, especially because these are open-ended projects and we're encouraging students to be creative with their projects, but these were points I kept in mind to ensure consistency in grading.

- (3 pts) `data/`

- Contains dataset(s) to be used in the project
- Instructions removed from `README` and codebook added
- Metadata on the data clearly stated, e.g. "The dataset has/contains/is >comprised of/etc. R observations, each representing [...], and C columns."

- (5 pts) `proposal/`
- (1 pt) Structure:
- `.Rmd` file is updated and is in the repo
- `.md` file is updated and is in the repo
- Figures are visible in the `.md` file
- Uses section headings to organise each part
- Contents:
- (1 pt) Part 1: Introduction
- Research question(s) clear
- Cases are stated
- Variables to be used are explained
- (1 pt) Part 2: Data
- `glimpse()` or `skim()` output available
- (2 pts) Part 3: Data analysis plan
- Response and explanatory variable(s) clear
- Comparison groups clear
- Preliminary EDA results and interpretation
- Stats methods to be used outlined
- Hypothesized results stated and inline with / can reasonably be supported by the preliminary EDA

- (1 pt) Workflow:
- Data read in from `data/` folder
- Reasonable number of commits
- Meaningful commit messages (or at least not an abundance of not meaningful ones)
- No unexpected/disallowed files

- (1 pt) Teamwork: All members committed to the repo

After the proposal deadline I evaluated each of the proposals -- there are 65 teams in the course, so this was no small endeavour!

After reading through a few projects I took the time to write issue templates for common feedback and pushed the templates to teams' repos using [`ghclass::repo_add_file()`](https://rundel.github.io/ghclass/reference/repo_file.html), which improved my process for giving feedback and greatly reduced the time each review takes. In many cases I didn't use these issue templates verbatim, but it was nice to have starter text for things I wanted to point out like how to structure the codebook, reading in the data from the data folder (as opposed to replicating it in the proposal folder), structuring the research question, etc. Note that I also used a template for *the* issue containing marks and information on the next steps. This included instructions for how to close issues with commits, which students hadn't previously done in class.

I use issue templates for providing feedback for all assignments in the course -- it's a convenient place to add a rubric that all markers can follow and that can be transparently shared with the students along with their final marks. It also makes the markers' jobs easier and helps keep feedback consistent across multiple markers. See [here](https://docs.github.com/en/free-pro-team@latest/github/building-a-strong-community/configuring-issue-templates-for-your-repository) if you want to learn more about GitHub issue templates.

## Project re-proposal

When reviewing proposals, I like being as constructive as possible with my comments as students will be working on their projects for the remainder of the course and the project makes up 20% of their overall marks (so it's no small deal for them). The last thing I want is for students to assume they were on the right track, only to find out at the end that they weren't.

I also want to make sure students actually review the feedback and feel motivated to improve their projects based on the feedback. To address both of these goals, we also have a re-proposal stage. In this stage students are asked to address each of the issues on their proposal, close them with specific commits, and finally tag me on their repos to say they're ready for another look at their work.

<br>

```{r echo=FALSE, fig.cap = "Anonimized issue where the team lets me know their proposal is ready for another look."}
knitr::include_graphics("tagged.png")
```

This stage is optional, so teams who feel happy with the score they got can just take the feedback and not put in the work right away to improve their projects. But for teams who want to improve their score, it gives them an opportunity to do so. Their final proposal score is the average of the scores they receive in the two phases. Based on past experience, most teams get close to a 10/10 at the re-proposal stage.

From the instructor's perspective it's very fulfilling to see students deliberately work on specific improvements to their projects. And while I don't have student feedback on this (yet, I'll ask students about it at the end of the semester), students' commit messages suggest that they're, at least, not frustrated by the experience.

<br>

```{r echo=FALSE, fig.cap = "Anonimized student commits, each color represents a different student in the team.", out.width="80%"}
knitr::include_graphics("commits.png")
```

Students closing these issues one-by-one (roughly 5 issues per repo) generates a lot of emails for me, but that was honestly a nice distraction from election doomscrolling!

<br>

```{r echo=FALSE, fig.cap = "Anonimized issues, each color represents a different team's repo."}
knitr::include_graphics("issues.png")
```

I also used this week's code-along session to go over some of the common issues I observed in the proposals as well as to demo how to close issues with commits. Even though I wrote out instructions for this, it seemed worthwhile to demo the process since they hadn't done this before. I also demoed how to use [`here::here()`](https://here.r-lib.org/reference/here.html) to define the path to data. So far in the class we didn't really need to use this approach because student assignments were always structured with the `data/` folder at the same level as the R Markdown file they write in. The folder structure I outlined above created challenges when knitting the R Markdown document vs. running the code in the console. You can find a recording of the code-along session [here](https://youtu.be/ZQke97ewn8Q).^[I [tweeted before](https://twitter.com/minebocek/status/1308714732950630404?s=20) that my fake student account is named Florence Nightingale. The fake student team is called Florence and the Machine. I'm pretty proud of this, and it gets a chuckle from the students 😃.]

## What's next?

For me, I need to re-read a bunch of project proposals, which sounds draining, but I know students' projects will be better for it at the end. Knowing that it will help them (and, let's be frank, knowing that it will help generate better project presentations to listen to and mark at the end of the semester) is what motivates me to do it.

For the students, once they get their re-proposal feedback, they'll move on to making progress on their projects. After the next project milestone in a couple weeks, students will do peer review on each others' work. The final deliverables for the project are a 5 minute presentation (and a slide deck made with [**xaringan**](https://slides.yihui.org/xaringan/#1)) and an "executive summary" which goes in the README of their project repo. Before they finish their projects they will enable ghpages for their repos, giving them a landing page for their project with their summary and a link to their slides. And at the end of the semester teams will have the option to make their project repos public.^[Think of it as a souvenir from the course!]

This has been a nice distraction from marking those re-proposals. I better get back to it...