# Bootcamp Practice Notebooks:  Music Album Rankings Analysis

## Notebook 1:  Introduction to the Topic and Data Sets

### This series of notebooks is provided for the students with three purposes in mind:

1. Give students some practice problems that are put together in a similar manner to what they will encounter on MidTerm 1. While the notebooks are hosted on Colab and do not contain the "test case variables", the exercises are laid out in a manner the copies how the MidTerm exercises are constructed. Each exercise in notebooks 2-5 is set up to show a suggested exercise at that point value.

2. Provide the students with some additional exercises for the students to practice concepts that are typically tested on the first MidTerm.

3. Provide the students with opportunities to build on their knowledge of basic Python data structures, nested data structures, implementing mathematics, and string manipulation.

### There are 5 sets of notebooks in the series.

#### The first notebook (this one) provides an introduction to the topic and the datasets that the students will be working with.

#### The other 4 notebooks are each provided as a 2-notebook set:

1. The student working notebook, with the code cells for students to solve the exercises.
2. The solution notebook, with code provided, for students to see one way of solving each exercise.

### The library of notebooks is as follows:

- NB 1:  Introduction to the Topic and Data Sets (This notebook).
- NB2:  Example 1-point exercises (3 exercises).
- NB3:  Example 2-point exercises (2 exercises).
- NB4:  Example 2-point exercises (2 exercises).
- NB5:  Example 3-point exercises (2 exercises).

### Links to all of the notebooks are provided on the Bootcamp Schedule page.

 # Overview: Music Album Rankings #

`Rolling Stone` magazine is an American monthly magazine that focuses on music, politics, and popular culture. It was founded in San Francisco, California in 1967 and still publishes monthly to this day. The magazine is known for its coverage of music, entertainment, and politics.

In 2003, the magazine released its `“500 Greatest Albums of All Time,”` placing the Beatles’ “Sgt. Pepper’s Lonely Hearts Club Band” in the top slot. It has since released two additional `"500 Greatest"` lists, in 2012 and 2020. While not necessary for this analysis, to gain a full understanding of these rankings, see this Wikipedia article:  https://en.wikipedia.org/wiki/Rolling_Stone%27s_500_Greatest_Albums_of_All_Time



#### Our analysis will focus on the changes in the three lists over time.

- The `2003` list was heavily criticized for being male-dominated, outmoded, and almost entirely Anglo-American in focus.

- The `2012` list was also heavily critcised in a similar manner, with one music critic noting that the only one album in the top 10 was less than 40 years old.

- The `2020` list was much more diverse in its representation of different music genres, musicians, and time periods. Music critics were much more positive in their reviews of the list, noting the lesser representation of white male rock musicians, and the move to recognize more contemporary albums and a wider range of tastes.


#### The dataset itself has **two** main parts:
- The **voters data**, which shows the persons who voted for the 2003 and 2020 lists, along with some of their demographic data. The voter list for 2012 is not publicly available in a format for us to work with.
- The **albums data**, which shows the albums themselves, along with ranking, artist, demographic, and other metadata about each album. An album will be listed in the dataset once, with columns designating which list(s) it appeared in.

**Your overall task in the notebook series:** 

You will help clean up and analyze this data, culminating in an a series of assessments, to understand the data behind the list commentary and criticisms. 

**From this data and your calculations, you can ascertain if the criticisms are valid.**

**Here is your overall workflow for this problem:**
- **NB 2**: Data understanding, simple Exploratory Data Analysis (EDA)
- **NB 3**: Data understanding, complex Exploratory Data Analysis (EDA)
- **NB 4**: Data cleaning, initial/simple analysis
- **NB 5**: Complex analysis (these are difficult exercises)

# Setup and Data Load

To get started, run the following code cell. This will populate the `voters` and `albums` datasets that you will be working with.

In [None]:
!wget https://github.com/gt-cse-6040/bootcamp/raw/main/practice_exercises/voters.json
!wget https://github.com/gt-cse-6040/bootcamp/raw/main/practice_exercises/Rolling_Stone_500_public.json

In [None]:
import json

with open("voters.json", "r") as read_file:
    voters = json.load(read_file)
read_file.close()

with open("Rolling_Stone_500_public.json", "r") as read_file:
    albums = json.load(read_file)
read_file.close()

# About the voters data

The global variable `voters` is a list of dictionaries. 

Each dictionary represents an individual voter, with information about that voter.

Let's take a look at the dataset in detail.

In [None]:
display(list(voters[0].keys()))

The voters dictionary keys are defined as follows:

- `Year`:  The year that this person voted for the list.
- `ID`:  The unique ID for that year and voter.
- `General_Category`:  General categorization of the voter's role in the music industry.
- `Specific_Category`:  Voter's specific role, if known.
- `Voter`:   Name of the voter
- `Additional_Info`:  Other information about the voter, if known.
- `Pronouns/Gender`:  Voter gender
- `Image`:  URL of voter picture, if known
- `Estimated_Birthyear`:  Voter's estimated birth year, when known
- `Age_at_Vote`:  Voter age at time of vote. This is the difference of the vote year and birth year.
- `Teenage_Decade`:  Decade that the voter was a teenager, with emphasis on the latter teen years. For many persons, the teenage years represent the most formative years for the development of their music tastes. One supposition might be that albums released during voters’ teenage years would have an outsized impact on their choices, and we will use this information to validate/refute that supposition.

The primary key of the dataset is `ID`, which represents the unique combination of `Voter` and `Year`, which is the **year that they voted in**.

Let's take a look at an example below. We can see that **Ms. Lucinda Williams** voted in both 2003 and 2020, so she is represented by two rows, with the `ID` numbers **599** and **171**.

In [None]:
display(voters[409])
display(voters[410])

Relevant observations about Ms. Williams are as follows:

- She is an `Artist, Songwriter, Producer`
- She is a `woman`
- She was born in the year `1953`
- She was `50` years old when she voted in 2003, and she was `67` years old when she voted in 2020.
- Her (latter) teen years are the decade of the `1970s`.

### All values in the dictionary are STRING values.

#### Finally, note that not all information is populated. We will be cleaning the data, based on that which may be missing, or populated with non-useful data, in certain fields.

# Understanding the **`albums`** dataframe #

This exercise involves looking at the albums data; you don't need to write any code. However, you **do** need to run the test cell **and** submit to get the free point(s).

## About the albums data

The global variable `albums` is a list of dictionaries. 

Each dictionary represents an individual album, with information about that album.

Each album was voted intp the **Top 500 List** in one (or more) of the lists in 2003, 2012, or 2020.

Let's take a look at the dataset in detail.

In [None]:
display(list(albums[0].keys()))

The `albums` dictionary keys are defined as follows:

- `Sort_Name`:  Last name, first name of the artist. This value may differ from year to year, for the same artist.
- `Clean_Name`:  First name and last name of the artist. This will be the same from year to year, for the same artist.
- `Album`:  Album name.
- `2003_Rank_Old`:  NOT USED IN THIS ANALYSIS.
- `2003_Rank`:   Rank in the Top 500 list for 2003, if in the list for that year.
- `2012_Rank`:   Rank in the Top 500 list for 2012, if in the list for that year.
- `2020_Rank`:   Rank in the Top 500 list for 2020, if in the list for that year.
- `2020_2003_Differential`:  NOT USED IN THIS ANALYSIS.
- `Release_Year`:  Year that the album was initially released.
- `Album_Genre`:   Music genre of the album.
- `Album_Type`:   NOT USED IN THIS ANALYSIS.
- `Weeks_on_Billboard`:  NOT USED IN THIS ANALYSIS.
- `Peak_Billboard_Position`:  Highest position that the album achieved in the Billboard Top Albums listing.
- `Spotify_Popularity`:  NOT USED IN THIS ANALYSIS.
- `Spotify_URI`:  NOT USED IN THIS ANALYSIS.
- `Chartmetric_Link`:  NOT USED IN THIS ANALYSIS.
- `Artist_Member_Count`:  Number of members that the artist has. For individual artists, this number will be 1, and for artists with more than one member (groups), it will be the number of members in the group.
- `Artist_Gender`:  Gender of the artist. 
- `Artist_Birth_Year_Sum`:  The sum of all of the birth years of the artist. For a single artist, this is the artist's birth year. For groups, this is the sum of the birth years of all of the artists in the group.
- `Debut_Album_Release_Year`:  NOT USED IN THIS ANALYSIS.
- `Avg._Age_at_Top_500_Album`:  NOT USED IN THIS ANALYSIS.
- `Years_Between_Debut_and_Top_500_Album`:  NOT USED IN THIS ANALYSIS.
- `Album_ID`:  NOT USED IN THIS ANALYSIS.
- `Album_ID_Quoted`:   NOT USED IN THIS ANALYSIS.

Let's take a look at an example below. We can see that this album is from the artist `Taylor Swift`, and the album name is `1989`.

In [None]:
display(list(albums[16].items()))

Some relevant observations about this album are as follows:

- The album was not ranked in the 2003 list.
- The album was not ranked in the 2012 list.
- The album ranked `99` in the 2020 list.
- The album's genre is `Country/Folk/Country Rock/Folk Rock`.
- The album was released in the year `2012`.
- The album peaked at `#1` on the Billboard Album popularity list.
- Taylor Swift was born in `1989`.
- Taylor Swift is of the `female` gender.
- Taylor Swift is a solo artist -- (`Artist_Member_Count` is **1**).

### All values in the dictionary are STRING values.

#### Finally, note that not all information is populated, or it may be populated with values that are not useful for the analysis. We will be cleaning the data, based on that which may be missing, or populated with non-useful data, in certain fields.

## That concludes our introduction to the data sets. 

## Follow on notebooks will contain exercises using the data sets.

## Students can refer back to this notebook if they need to review the data elements.