# Wrangling and Visualizing Musical Data

In [None]:
#To be able to run tests locally in the notebook
# you need to install the following:
# install.packages("devtools")
# install.packages(testthat")
# devtools::install_github('datacamp/IRkernel.testthat')

# This allows .... to be used as placeholder value in the sample code cells
.... <- NULL 

## 1. Introduction

How do musicians choose the chords they use in their songs? Do guitarists, pianists, and singers gravitate towards different kinds of harmony?

We can uncover trends in the kinds of chord progressions used by popular artists by analyzing the harmonic data provided in the [McGill Billboard Dataset](http://ddmal.music.mcgill.ca/research/billboard). This dataset includes professionally tagged chords for several hundred pop/rock songs representative of singles that made the Billboard Hot 100 list between 1958 and 1991. Using the data-wrangling tools available in the `dplyr` package, and the visualization tools available in the `ggplot2` package, we can explore the most common chords and chord progressions in these songs, and contrast the harmonies of some guitar-led and piano-led artists to see where the "affordances" of those instruments may affect the chord choices artists make.

Read in the McGill Billboard chord dataset.

- Load in the `tidyverse` package.
- Read in `'datasets/bb_chords.csv'` using `read_csv` and assign it to `bb`.
- Display the first rows of `bb`.

<hr>

The `tidyverse` package is a meta-package that includes both `dplyr` and `ggplot2` which will be used throughout the Project. It also includes the `readr` package and the `read_csv` function. 

Make sure to use `read_csv` (with an *underscore*) to read in the data. The `read.csv` function, which is built into R, has a number of problems which the new `read_csv` function avoids.

## Good to know

This project assumes familiarity with standard tidyverse tools for R like the `dplyr`, `ggplot2` and the pipe operator (`%>%`). Before taking on this project we recommend that you have completed the following courses:
  - [Introduction to the Tidyverse](https://www.datacamp.com/courses/introduction-to-the-tidyverse)
  - [Data Visualization with ggplot2 (Part 2)](https://www.datacamp.com/courses/data-visualization-with-ggplot2-2)

RStudio has created some very helpful cheat sheets for working in the tidyverse, including two that will be helpful for this project: [Data Wrangling](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) and [Data Visualization with ggplot2](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf). If you're a serious data wrangler, you might even print them out and laminate them!

If you load in the `tidyverse` package:

```r
library(tidyverse)
```

You can use the `read_csv` function to read in data stored in `csv` files like this:

```r
my_data <- read_csv("path_to/my_data.csv")
```

To display the first rows of a data frame you can use the `head` function:

```r
head(my_data)
```
or `slice`:

```r
my_data %>%
  slice(1:10)
```

In [None]:
# Loading the tidyverse meta-package
# .... YOUR CODE FOR TASK 1 ....

# Reading in the McGill Billboard chord data
bb <- ....

# Taking a look at the first rows in bb
# .... YOUR CODE FOR TASK 1 ....

In [None]:
# Loading the tidyverse meta-package
library(tidyverse)

# Reading in the McGill Billboard chord data
bb <- read_csv("datasets/bb_chords.csv")

# Taking a look at the first rows in bb
head(bb)

In [None]:
# These packages need to be loaded in the first `@tests` cell. 
library(testthat) 
library(IRkernel.testthat)

run_tests({
    test_that("Read in data correctly.", {
        expect_is(bb, "tbl_df", 
            info = 'You should use read_csv (with an underscore) to read "datasets/bb_chords.csv" into bb')
    })
    
    test_that("Read in data correctly.", {
        bb_correct <- read_csv('datasets/bb_chords.csv')
        expect_equivalent(bb, bb_correct, 
            info = 'bb should contain the data in "datasets/bb_chords.csv"')
    })
})

## 2. The most common chords

As seen in the previous task, this is a *tidy* dataset: each row represents a single observation, and each column a particular variable or attribute of that observation. Note that the metadata for each song (title, artist, year) is repeated for each chord -- like "I Don't Mind" by James Brown, 1961 -- while the unique attributes of each chord (chord symbol, chord quality, and analytical designations like integer and Roman-numeral notation) is included once for each chord change.

A key element of the style of any popular musical artist is the kind of chords they use in their songs. But not all chords are created equal! In addition to differences in how they sound, some chords are simply easier to play than others. On top of that, some chords are easier to play on one instrument than they are on another. And while master musicians can play a wide variety of chords and progressions with ease, it's not a stretch to think that even the best musicians may choose more "idiomatic" chords and progressions for their instrument.

To start to explore that, let's look at the most common chords in the McGill Billboard Dataset.

Find the most common chords in the McGill Billboard Dataset.

- Count the number of occurrences of each raw `chord` type in the dataset (`bb`) using `count()`, and sort the results from most common (highest count) to least common (lowest count).
- Store the result in `bb_count`.
- Display the 20 most common chords.

<hr>

For readability (and to do things the tidyverse way!), try to write your code as a string of verb-based commands, one command per line, connected by `%>%`.

To count the results and sort in descending value, try:

```r
bb %>%
  count(chord, sort = TRUE)
```
To return the first `x` results, you can index by range: `bb_count[1:x,]`. Don't forget the comma!

To return the first `x` results without breaking the string of piped commands, try piping `bb_count` to `slice(1:x)`. Note the absence of a comma!


In [None]:
# Counting the most common chords
bb_count <- ....

# Displaying the top 20 chords
# .... YOUR CODE FOR TASK 2 ....

In [None]:
# Counting the most common chords
bb_count <- bb %>%
  count(chord, sort = TRUE)

# Displaying the top 20 chords
bb_count[1:20,]

In [None]:
run_tests({
    test_that("bb_count is correct", {
        correct_bb_count <- bb %>%
          count(chord, sort = TRUE)
        expect_equivalent(bb_count, correct_bb_count, 
            info = "bb_count should contain the count of each type of chord.")
    })
})

## 3. Visualizing the most common chords

Of course, it's easier to get a feel for just how common some of these chords are if we graph them and show the percentage of the total chord count represented by each chord.
Musicians may notice right away that the most common chords in this corpus are chords that are easy to play on both the guitar and the piano: C, G, A, and D major — and to an extent, F and E major. (They also belong to keys, or scales, that are easy to play on most instruments, so they fit well with melodies and solos, as well.) After that, there is a steep drop off in the frequency with which individual chords appear. 

To illustrate this, here is a short video demonstrating the relative ease (and difficulty) of some of the most common (and not-so-common) chords in the McGill Billboard dataset.
<br/><br/>
<a href="https://player.vimeo.com/video/251381886" target="blank_"><img style="max-width: 500px;" src="img/smaller_video_screenshot.jpeg"/></a>

Plot the top 20 chords as a flipped bar plot.

- Starting with the first 20 records from `bb_count`, use `mutate` to create a new column `share` with the percentage of how often each chord type occurs.
- Also using `mutate`, `reorder` the `chord` column according to the value in `share`.
- Pipe the results into `ggplot()` and make a column plot where the X axis represents `chord` and the Y axis is represents `share`.
- Make your plot more readable by adding labels with `xlab()` and `ylab()`, and by flipping the plot using `coord_flip()`.

<hr>

Do your best to make your visualization look like this:

[![](img/chord_frequency_screenshot.png)](img/chord_frequency_screenshot.png)

A picture is worth a thousand words -- perhaps, even more, when visualizing data! That's why we're working so hard to make the visualizations as readable as possible -- using percentages, arranging values in descending order, etc.

You may also try adding a splash of color. (Remember that column plots require color to be added with `fill = chord` rather than `color`.) When color adds to the aesthetic, but not a new dimension of information, I recommend removing the color legend with `theme(legend.position='none')`.

As you're working through the above steps, think about what the plot would look like without some of these options. For example, what advantage does converting raw chords counts to percentages have for those reading the plot? How readable would the plot be without the axis labels? Without reordering columns? What value does `coord_flip()` add to this plot?

You can calculate the percentage of chords represented by each chord count with a single line: 

```r
mutate(share = n / sum(n))
```
Likewise, reordering chords so they appear in order by chord count (or share) can be done with a single `mutate` command:
```r
mutate(chord = reorder(chord, share))
```
Don't forget that `ggplot()` does not use `%>%` between lines. Once you've passed your wrangled data into `ggplot()`, use `+` to combine multiple commands together.

Once you've created your `ggplot` aesthetic with `aes()`, use `geom_col()` to create a column plot using `chord` and `share` as the X and Y axes, respectively.

In [None]:
# Creating a bar plot from `bb_count`
bb_count %>%
  slice(....) %>%
  mutate(share = ....,
         chord = ....) %>%
  .... +
  coord_flip() +
  xlab(....) +
  ylab(....) 

In [None]:
# Creating a bar plot from `bb_count`
bb_count %>%
  slice(1:20) %>%
  mutate(share = n / sum(n),
         chord = reorder(chord, share)) %>%
  ggplot(aes(chord, share, fill = chord)) +
  geom_col() +
  coord_flip() +
  ylab('Share of total chords') +
  xlab('Chord') +
  theme(legend.position="none")

In [None]:
run_tests({
    test_that("bb_count has some data in it", {
    expect_true(length(bb_count) > 0, 
        info = "Looks like you're missing data in `bb_count`.")
    })
})

## 4. Chord "bigrams"

Just as some chords are more common and more idiomatic than others, not all chord *progressions* are created equal. To look for common patterns in the structuring of chord progressions, we can use many of the same modes of analysis used in text-mining to analyze phrases. A chord change is simply a *bigram* — a two-"word" phrase — composed of a starting chord and a following chord. Here are the most common two-chord "phrases" in the McGill Billboard dataset.
To help you get your ear around some of these common progressions, here's a short audio clip containing some of the most common chord bigrams.
<br/><br/>
<audio controls src="http://assets.datacamp.com/production/project_79/img/bigrams.mp3">
  Your browser does not support the audio tag.
</audio>

Create a count of chord _bigrams_.

- Use `mutate()` to add two new columns to `bb`: `next_chord` and `next_title`. These should contain the data from the `chord` and `title` columns, but shifted one row up. Use the `lead()` function inside your `mutate()` command to do this.
- Create a `bigram` column that concatenates `chord` with `next_chord`, with a space in between.
- Use `filter()` to remove any records in our new data frame where `title` and `next_title` are not identical.
- Count the number of occurrences of each bigram type and store the results in `bb_bigram_count`. 
- Display the 20 most common chord bigrams.

<hr>

There are natural language processing (NLP) tools that will _tokenize_ texts by _n-grams_ (phrases of _n_ words). However, our chord data is already in a tidy table, rather than in something that looks like paragraph form. Thankfully, `dplyr` contains functions like `lag()` and `lead()` that make it easy to access data from other rows in the data frame efficiently, and we can use them to construct our bigrams using `paste()` (or `str_c` from `stringr`). 

<!-- Once we have our bigrams as a column in our dataframe, we can perform just about any NLP analyses, just as if we had started with a corpus of sentences, paragraphs, or books. -->

Why we `filter` in step 3 might not be obvious, but it's incredibly important. The last chord of one song combined with the first chord of the next song is _not_ a bigram. Depending on the order of songs in the dataset, if we skip this step, we could end up with chord "progressions" connecting songs that occur perhaps 30 years apart in history!

<!-- When working with a new dataset, it's important to keep checking in on the data -- both through visualizations and through the data table -- to make sure we're not missing these kinds of potential problems. -->

You don't need to call `mutate()` more than once. R will process each mutate command completely before proceeding to the next, so they can all be included together, separated by commas:

```r
bb %>%
  mutate(next_chord = ....,
         next_title = ....,
         bigram = ....) %>%
```

In [None]:
# Wrangling and counting bigrams
bb_bigram_count <- bb %>%
    # .... YOUR CODE FOR TASK 4 ....

# Displaying the first 20 rows of bb_bigram_count
# .... YOUR CODE FOR TASK 4 ....

In [None]:
# Wrangling and counting bigrams
bb_bigram_count <- bb %>%
  mutate(next_chord = lead(chord),
         next_title = lead(title),
         bigram = paste(chord, next_chord)) %>%
  filter(title == next_title) %>%
  count(bigram, sort = TRUE)

# Displaying the first 20 rows of bb_bigram_count
bb_bigram_count[1:20,]

In [None]:
run_tests({
    test_that("bb_bigram_count is correct", {
      correct_bb_bigram_count <- bb %>%
      mutate(next_chord = lead(chord),
             next_title = lead(title),
             bigram = paste(chord, next_chord)) %>%
      filter(title == next_title) %>%
      count(bigram, sort = TRUE)
    expect_equivalent(bb_bigram_count, correct_bb_bigram_count, 
        info = "`bb_bigram_count` should contain the count of each type of bigram. Don't forget to sort by bigram frequency!")
    })
})

## 5. Visualizing the most common chord progressions

We can get a better sense of just how popular some of these chord progressions are if we plot them on a bar graph. Note how the most common chord change, G major to D major, occurs more than twice as often than even some of the other top 20 chord bigrams.

Create a flipped bar plot that shows the 20 most common chord bigrams.

- Copy your code from Step 3, and modify it to work with `bb_bigram_count` instead of `bb_count`. 
- Adjust the plot labels to fit chord _changes_ instead of just chords.

<hr>

Copy-and-paste isn't cheating! In fact, knowing how to successfully copy, paste, and tweak existing code (yours, or someone else's -- with permission, of course) is an integral part of data science. It not only saves time and brain power, it also limits mistakes in your code when you use code you already know works. The iterative process of tweaking that code can also help you write more efficient code in the future.

Of course, if you copy-and-paste the same code several times, you may just want to write a custom function instead!

The same code in Step 3 should work here, but be sure you replace `bb_count` with `bb_bigram_count` and replace _every_ occurrence of `chord` with `bigram`. Missing just one can cause errors.


In [None]:
# Creating a column plot from `bb_bigram_count`
bb_bigram_count %>%
  mutate(share =....,
         bigram = ....) %>%
  .... +
  coord_flip() +
  xlab(....) +
  ylab(....) 

In [None]:
# Creating a column plot from `bb_bigram_count`
bb_bigram_count %>%
  slice(1:20) %>%
  mutate(share = n / sum(n),
         bigram = reorder(bigram, share)) %>%
  ggplot(aes(bigram, share, fill = bigram)) +
  geom_col() +
  coord_flip() +
  ylab('Share of total chord changes') +
  xlab('Chord change') +
  theme(legend.position="none")

In [None]:
run_tests({
    test_that("bb_bigram_count has some data in it", {
    expect_true(length(bb_bigram_count) > 0, 
        info = "Looks like you're missing data in `bb_bigram_count`.")
    })
})

## 6. Finding the most common artists


As noted above, the most common chords (and chord bigrams) are those that are easy to play on both the guitar and the piano. If the degree to which these chords are idiomatic on guitar or piano (or both) *determine* how common they are, we would expect to find the more idiomatic guitar chords (C, G, D, A, and E major) to be more common in guitar-driven songs, but we would expect the more idiomatic piano chords (C, F, G, D, and B-flat major) to be more common in piano-driven songs. (Note that there is some overlap between these two instruments.)

The McGill Billboard dataset does not come with songs tagged as "piano-driven" or "guitar-driven," so to test this hypothesis, we'll have to do that manually. Rather than make this determination for every song in the corpus, let's focus on just a few to see if the hypothesis has some validity. If so, then we can think about tagging more artists in the corpus and testing the hypothesis more exhaustively.

Here are the 30 artists with the most songs in the corpus. From this list, we'll extract a few artists who are obviously heavy on guitar or piano to compare.

Find and display the 30 artists with the most songs in the McGill Billboard Dataset.

- Using `bb`, isolate the `artist` and `title` columns using `select()`.
- We still have one record per _chord_. Use `unique()` to remove duplicates and leave a single record per _song_.
- As in earlier tasks, use `count()` to find how many songs each artist has in the dataset, and sort the results in descending order.
- Display the first 30 records in the sorted table.

<hr>

In order to tag as many songs as possible _quickly_ in the next task, we can simply identify a small number of prolific artists whose songs we can tag all at once. By isolating the 30 most prolific artists in the dataset, we can look at the results and pick a few good candidates.

When used in a piped string of commands, `unique()` does not need to take any arguments, since each command treats the output of the previous command as its first argument.

After using `unique()` to reduce the data frame to one record per song, `count(artist, sort = TRUE)` will return the number of songs per artist in the dataset in descending order.

`slice(1:30)` will return the first 30 records in the sorted data frame without needing to break the pipe.


In [None]:
# Finding and displaying the 30 artists with the most songs in the corpus
bb_30_artists <- bb %>%
    ....

bb_30_artists %>%
  slice(1:30)

In [None]:
# Finding and displaying the 30 artists with the most songs in the corpus
bb_30_artists <- bb %>%
  select(artist, title) %>%
  unique() %>%
  count(artist, sort = TRUE)

bb_30_artists %>%
  slice(1:30)

In [None]:
run_tests({
    test_that("bb artists counted and sorted", {
      correct_bb_30_artists <- bb %>%
        select(artist, title) %>%
        unique() %>%
        count(artist, sort = TRUE)
    expect_equivalent(bb_30_artists, correct_bb_30_artists, 
        info = "`bb_30_artists` should contain the number of soungs (not chords) by each artist in the corpus. Don't forget to sort!")
    })
})

## 7. Tagging the corpus

There are relatively few artists in this list whose music is demonstrably "piano-driven," but we can identify a few that generally emphasize keyboards over guitar: Abba, Billy Joel, Elton John, and Stevie Wonder — totaling 17 songs in the corpus. There are many guitar-centered artists in this list, so for our test, we'll focus on three well known, guitar-heavy artists with a similar number of songs in the corpus: The Rolling Stones, The Beatles, and Eric Clapton (18 songs).

Once we've subset the corpus to only songs by these seven artists and applied the "piano" and "guitar" tags, we can compare the chord content of piano-driven and guitar-driven songs.

Add a new column `instrument` to `bb`, including "piano" or "guitar" for piano- and guitar-driven songs.

- Use `inner_join()` with `tags` to attach an `instrument` column to `bb` and assign the result to `bb_tagged`.
- Display the new data frame `bb_tagged` to make sure the join was successful.

<hr>

When adding a custom column to an entire data frame based on data in another column, it is usually much faster to use the appropriate `join` operation than to write a looping function. `inner_join()` will even remove all rows in `bb` that do not correspond to the artists in `tags`. And in this case, since both `bb` and `tags` have an `artist` column, you do not need to specify a column by which to join.

Try it out with 'left_join()' and 'full_join()', too. What are the differences? Do any produce the same results in this case? What would happen if you started with `tags` and applied a join operation to `bb`? Which join(s) would produce the desired results?

Since there is no need to define a column by which to join the two tibbles/data frames, a simple

```r
tibble_1 %>%
  inner_join(tibble_2)
```
will do the trick.


In [None]:
tags <- tibble(
  artist = c('Abba', 'Billy Joel', 'Elton John', 'Stevie Wonder', 'The Rolling Stones', 'The Beatles', 'Eric Clapton'),
  instrument = c('piano', 'piano', 'piano', 'piano', 'guitar', 'guitar', 'guitar'))

# Creating a new dataframe `bb_tagged` that includes a new column `instrument` from `tags`
bb_tagged <- bb %>%
    ....
    
# Displaying the new dataframe
# .... YOUR CODE FOR TASK 7 ....

In [None]:
tags <- tibble(
  artist = c('Abba', 'Billy Joel', 'Elton John', 'Stevie Wonder', 'The Rolling Stones', 'The Beatles', 'Eric Clapton'),
  instrument = c('piano', 'piano', 'piano', 'piano', 'guitar', 'guitar', 'guitar'))

# Creating a new dataframe `bb_tagged` that includes a new column `instrument` from `tags`
bb_tagged <- bb %>%
  inner_join(tags)

# Displaying the new data frame
bb_tagged

In [None]:
run_tests({
    test_that("bb artists counted and sorted", {
      correct_bb_tagged <- bb %>%
        inner_join(tags)
    expect_equivalent(bb_tagged, correct_bb_tagged, 
        info = "`bb_tagged` should be a successful join of `bb` and `tags` that only contains records cointained in both dataframes.")
    })
})

## 8. Comparing chords in piano-driven and guitar-driven songs

Let's take a look at any difference in how common chords are in these two song groups. To clean things up, we'll just focus on the 20 chords most common in the McGill Billboard dataset overall.

While we want to be careful about drawing any conclusions from such a small set of songs, we can see that the chords easiest to play on the guitar *do* dominate the guitar-driven songs, especially G, D, E, and C major, as well as A major and minor. Similarly, "flat" chords (B-flat, E-flat, A-flat major) occur frequently in piano-driven songs, though they are nearly absent from the guitar-driven songs. In fact, the first and fourth most frequent piano chords are "flat" chords that occur rarely, if at all, in the guitar songs.

So with all the appropriate caveats, it seems like the instrument-based-harmony hypothesis does have some merit and is worth further examination.

Created a faceted plot that shows the frequency of the most common chords side-by-side for songs by piano- and guitar-driven artists.

- Starting with `bb_tagged`, use `filter()` to keep only the `top_20` chords.
- Use `count()` to find the number of times each `chord` occurs for each `instrument`, and sort the results.
- Pipe the results to `ggplot()` and make a bar plot, using `chord` as the X axis and `n` (the result of `count()`) as your Y axis. 
- Use `coord_flip()` for readability, and provide appropriate labels for the X and Y axes.
- Use `facet_grid()` to place guitar and piano plots side by side for comparison.

<hr>

If you like, add a splash of color with `fill` (and if so, set `theme(legend.position='none')`).

`facet_wrap()` and `facet_grid()` are incredibly powerful visualization tools. They allow you to add dimensions to your data visualization story without making things hard on your readers.

Try playing around with faceting a bit. What happens when you count `chord` and `artist` and pass `artist` to `facet_grid()`? What other parameters could you visualize in this way that tell a compelling story?


All `ggplot()` paremeters (separated by `+` instead of `%>%`!) will be the same here as in Task 3, with the exception of adding `facet_grid(~instrument)`. Note that in this case, there is nothing before the `~`.

Here is some code to get you started with creating the plot:

```
bb_tagged %>%
  filter(chord %in% top_20) %>%
  count(chord, instrument, sort = TRUE) %>%
  ggplot(....) +
  ....
```

In [None]:
# The top 20 most common chords
top_20 <- bb_count$chord[1:20]

# Comparing the frequency of the 20 most common chords in piano- and guitar-driven songs
bb_tagged %>%
  filter(....) %>%
  count(....) %>%
  ....
  coord_flip() +
  xlab(....) +
  ylab(....) 


In [None]:
# The top 20 most common chords
top_20 <- bb_count$chord[1:20]

# Comparing the frequency of the 20 most common chords in piano- and guitar-driven songs
bb_tagged %>%
  filter(chord %in% top_20) %>%
  count(chord, instrument, sort = TRUE) %>%
  ggplot(aes(chord, n, fill = chord)) +
  geom_col() +
  facet_grid(~instrument) +
  coord_flip() +
  ylab('Total chords') +
  xlab('Chord') +
  theme(legend.position="none")

In [None]:
run_tests({
    test_that("bb_tagged has some data in it", {
    expect_true(length(bb_tagged) > 0, 
        info = "Looks like you're missing data in `bb_tagged`.")
    })
})

## 9. Comparing chord bigrams in piano-driven and guitar-driven songs

Since chord occurrence and chord bigram occurrence are naturally strongly tied to each other, it would not be a reach to expect that a difference in chord frequency would be reflected in a difference in chord bigram frequency. Indeed that is what we find.

Create the same faceted plot as in Task 8, but for chord bigrams.

- Copy and modify your code from Task 4 to add a `bigram` column, this time to `bb_tagged`.
- Copy and modify your code from Task 8 to produce a faceted plot of bigram frequency from the `top_20_bigram`s that compares guitar- and piano-driven songs.
- Remember to change all references to chords (including in the axis labels) to bigrams.

<hr>

Use the same `mutate` and `filter` commands as in Task 4 to add a `bigram` column to `bb_tagged.`

After that, the code will be identical to Task 8, with the exception of `bigram` replacing `chord` throughout.


In [None]:
# The top 20 most common bigrams
top_20_bigram <- bb_bigram_count$bigram[1:20]

# Creating a faceted plot comparing guitar- and piano-driven songs for bigram frequency
bb_tagged %>%
  # .... MODIFIED CODE FROM TASK 4 ....
  ....  %>% 
  filter(....) %>%
  .... +
  coord_flip() +
  xlab(....) +
  ylab(....) 

In [None]:
# The top 20 most common bigrams
top_20_bigram <- bb_bigram_count$bigram[1:20]

# Creating a faceted plot comparing guitar- and piano-driven songs for bigram frequency
bb_tagged %>%
  mutate(next_chord = lead(chord),
         next_title = lead(title),
         bigram = paste(chord, next_chord)) %>%
  filter(title == next_title) %>%
  count(bigram, instrument, sort = TRUE) %>%
  filter(bigram %in% top_20_bigram) %>%
  ggplot(aes(bigram, n, fill = bigram)) +
  geom_col() +
  facet_grid(~instrument) +
  coord_flip() +
  ylab('Total bigrams') +
  xlab('Bigram') +
  theme(legend.position="none")

In [None]:
run_tests({
    test_that("bb_bigram_count has some data in it", {
    expect_true(length(bb_bigram_count) > 0, 
        info = "Looks like you're missing data in `bb_bigram_count`.")
    })
})

## 10. Conclusion

We set out asking if the degree to which a chord is "idiomatic" on an instrument affects how frequently it is used by a songwriter. It seems that is indeed the case. In a large representative sample of pop/rock songs from the historical Billboard charts, the chords most often learned first by guitarists and pianists are the most common. In fact, chords commonly deemed *easy* or *beginner-friendly* on **both** piano and guitar are far and away the most common in the corpus.

We also examined a subset of 35 songs from seven piano- and guitar-heavy artists and found that guitarists and pianists tend to use different sets of chords for their songs. This was an extremely small (and likely not representative) sample, so we can do nothing more than hypothesize that this trend might carry over throughout the larger dataset. But it seems from this exploration that it's worth a closer look.

There are still more questions to explore with this dataset. What about band-driven genres like classic R&B and funk, where artists like James Brown and Chicago build chords from a large number of instruments each playing a single note? What about "progressive" bands like Yes and Genesis, where "easy" and "idiomatic" may be less of a concern during the songwriting process? And what if we compared this dataset to a collection of chords from classical songs, jazz charts, folk songs, liturgical songs?

There's only one way to find out!

Complete the project by confirming the validity of the hypothesis, as well as the need for further data analysis to draw a conclusion.

- Is this hypothesis that guitar-driven and piano-driven songs have different chord tendencies valid and worth deeper exploration. `TRUE` or `FALSE`? Set `hypothesis_valid` to reflect your answer.
- To draw a conclusion about this hypothesis, do we still need to explore more data? `TRUE` or `FALSE`? Set `more_data_needed` to reflect your answer.

<hr>

Great work! You've uncovered some interesting things about musical chord progressions *and* learned a little about how natural language processing (NLP) analysis techniques can be used to study musical symbolic data.

## Do you want to know more?

If this project got you hungry for more musical data analysis, check out my blog post, [What is computational musicology?](https://pushpullfork.com/computational-musicology/). There are links to academic studies, tools, and datasets for further exploration. The Million Song Dataset is especially cool.

And if you're also a Python fan, check out [music21](http://web.mit.edu/music21/), an advanced toolkit for computational musicology in Python.


No hint here. Just try checking the project!

In [None]:
# Set to TRUE or FALSE to reflect your answer.
hypothesis_valid <- ....

# Set to TRUE or FALSE to reflect your answer.
more_data_needed <- ....

In [None]:
# Set to TRUE or FALSE to reflect your answer
hypothesis_valid <- TRUE

# Set to TRUE or FALSE to reflect your answer
more_data_needed <- TRUE


In [None]:
run_tests({
    test_that("hypothesis is true", {
    expect_true(hypothesis_valid, 
        info = "Are you sure the hypothesis isn't valid?!")
    })
    test_that("more_data_needed is true", {
    expect_true(more_data_needed, 
        info = "Are you sure we don't need more data?!")
    })
})