# <center>Pitchfork review analysis</center>

The idea behind this analysis is to find if Pitchfork authors have a certain bias towards any genre of music and later see if the review data can help us create a model which can predict album ratings based on the combination of the following two variables:

- **Name of the author**
- **Name of the music genre/-s**

In the process of this analysis I will be also taking a look at the following questions:
- **How average review scores of different genres and artists compare throughout the years?**
- **Which of these are more likely to be labeled "Best New Music"?**
- **Who are the authors that contribute the most?**

The dataset used for this notebook was fully scraped from **www.pitchfork.com** on 11.04.2020 and the script to do so are available here: **[github.com/glbshv/Pitchfork-project](https://github.com/glbshv/Pitchfork-project)**.

The list of reviews encompasses more than 20 years of work done by Pitchfork authors and contributors. It initially contains 22.367 reviews written and published since the site foundation on **05.01.1999** until **11.04.2020**.

The variables extracted to peform this analysis are the following: 

- **Artist Name** 
- **Album Name**
- **Review Score**
- **Best New Music**
- **Genre**
- **Date Published**
- **Written By** 

They are located at the start of every Pitchfork review page as can be seen from an example of the [**Aphex Twin - Syro**](https://pitchfork.com/reviews/albums/19755-aphex-twin-syro/) review where mentioned variables are framed in red:
![image.png](attachment:image.png)

To complete our task we will need the help of the following libraries:

In [1]:
import pandas as pd #to do data manipulation and table construction
import datetime as dt #to access the date/time format 
from bs4 import BeautifulSoup #to be able to access html content if needed 
import urllib.request #to be able to open web-pages
import seaborn as sb #to visualize the data

Let's import the dataset with scraped pages of Pitchfork and take a look at it's shape and columns:

In [2]:
pitchfork = pd.read_csv('02_Pitchfork_reviews_11042020.csv', encoding='utf-8')
pitchfork.shape

(22367, 8)

The dataset has 22.367 scraped reviews with 8 following columns:

In [3]:
pitchfork.head(20)

Unnamed: 0,Artist Name,Album Name,Review Score,Best New Music,Genre,Date Published,Written By,Review link
0,Laura Marling,Song for Our Daughter,7.6,,Folk/Country,11/04/2020,Owen Myers,http://pitchfork.com/reviews/albums/laura-marl...
1,Sun Araw,Rock Sutra,7.3,,Experimental,11/04/2020,Daniel Felsenthal,http://pitchfork.com/reviews/albums/sun-araw-r...
2,Joni Mitchell,Shine,8.0,,Rock,11/04/2020,Sam Sodomsky,http://pitchfork.com/reviews/albums/joni-mitch...
3,The Strokes,The New Abnormal,5.7,,Rock,10/04/2020,Sam Sodomsky,http://pitchfork.com/reviews/albums/the-stroke...
4,Everything Is Recorded,Friday Forever,6.1,,Electronic,10/04/2020,Aimee Cliff,http://pitchfork.com/reviews/albums/everything...
5,Mosses,T.V. Sun,7.6,,Folk/Country,10/04/2020,Dave Segal,http://pitchfork.com/reviews/albums/mosses-tv-...
6,Ghostie,Self Hate Wraith,7.2,,Rap,10/04/2020,Mano Sundaresan,http://pitchfork.com/reviews/albums/ghostie-se...
7,Nina Simone,Fodder on My Wings,8.3,Best new reissue,Jazz / Pop/R&B,09/04/2020,Sheldon Pearce,http://pitchfork.com/reviews/albums/nina-simon...
8,Sam Hunt,Southside,7.5,,Folk/Country,09/04/2020,Natalie Weiner,http://pitchfork.com/reviews/albums/sam-hunt-s...
9,Phish,Sigma Oasis,6.5,,Rock,09/04/2020,Sam Sodomsky,http://pitchfork.com/reviews/albums/phish-sigm...


However, I will be looking only at the following ones: **Artist Name**, **Album Name**, **Review Score**, **Best New Music**, **Genre**, **Date Published** and **Written By**.

**Review link** was scraped as a precautionary measure if some of the links have to be re-scraped or accessed manually later. Let's remove it now:

In [4]:
pitchfork_dataset = pitchfork.iloc[:,0:6]
pitchfork_dataset.head(20)

Unnamed: 0,Artist Name,Album Name,Review Score,Best New Music,Genre,Date Published
0,Laura Marling,Song for Our Daughter,7.6,,Folk/Country,11/04/2020
1,Sun Araw,Rock Sutra,7.3,,Experimental,11/04/2020
2,Joni Mitchell,Shine,8.0,,Rock,11/04/2020
3,The Strokes,The New Abnormal,5.7,,Rock,10/04/2020
4,Everything Is Recorded,Friday Forever,6.1,,Electronic,10/04/2020
5,Mosses,T.V. Sun,7.6,,Folk/Country,10/04/2020
6,Ghostie,Self Hate Wraith,7.2,,Rap,10/04/2020
7,Nina Simone,Fodder on My Wings,8.3,Best new reissue,Jazz / Pop/R&B,09/04/2020
8,Sam Hunt,Southside,7.5,,Folk/Country,09/04/2020
9,Phish,Sigma Oasis,6.5,,Rock,09/04/2020


We need to inspect the data types of `pitchfork_dataset` to see if there are any errors:

In [5]:
pitchfork_dataset.dtypes

Artist Name        object
Album Name         object
Review Score      float64
Best New Music     object
Genre              object
Date Published     object
dtype: object

We are expecting that **Artist Name**, **Album Name**, **Best New Music** and **Genre** will contain only strings which is correct, hence the type is `object`. 

**Best New Music** will be empty most of the time because only a few albums get this recognition.

However, **Date Published** is read in incorrectly. Let's transform it through `strftime()` into correct date format to be able to access each element individually - `datetime64[ns]`:

In [6]:
pitchfork_dataset['Date Published'] = pd.to_datetime(pitchfork_dataset['Date Published'], format='%d/%m/%Y')
pitchfork_dataset.dtypes

Artist Name               object
Album Name                object
Review Score             float64
Best New Music            object
Genre                     object
Date Published    datetime64[ns]
dtype: object

Now all columns have correct data types!

Let's start the analysis by taking a look at the genres in the `pitchfork_dataset` and if the data already allows us to perform some plotting. To do so I need to remove the duplicates from the **Genre** column first:

In [7]:
pitchfork_dataset[['Genre']].drop_duplicates()

Unnamed: 0,Genre
0,Folk/Country
1,Experimental
2,Rock
4,Electronic
6,Rap
...,...
20805,Electronic / Experimental / Rap / Rock
21114,Experimental / Global / Rock
21396,Jazz / Experimental / Folk/Country
21874,Electronic / Metal / Rap / Rock


I see that there are 142 unique **Genre** values in our dataset because albums are tagged with multiple genres, as it is shown in one of the examples above from row 20.805 - **Electronic / Experimental / Rap / Rock**. This result brings in complexity that we'll need to deal with later in this analysis to acquire accurate data.

I need to understand how Pitchfork arranges their genre tags and if there is a list available somewhere. For our conveniece they already have the genre list available on several pages on the website. For example this filter in the same reviews section:

  ![image.png](attachment:image.png)

For ease of access I will use another link where they list all the same genres: **https://pitchfork.com/artists/**. 

Let's create the genre lookup table:

In [8]:
pitchfork_link = BeautifulSoup(urllib.request.urlopen('https://pitchfork.com/artists/').read(), 'lxml')
extracted_genres = pitchfork_link.find_all('h1', class_="artist-group__heading")

genre_list = []
for genre in extracted_genres:
    genre_list.append(genre.text)

pitchfork_genres = pd.DataFrame({"Genres of Pitchfork": genre_list})

pitchfork_genres

Unnamed: 0,Genres of Pitchfork
0,Electronic
1,Folk
2,Jazz
3,Pop/R&B
4,Rap/Hip-Hop
5,Experimental
6,Global
7,Metal
8,Rock


As observed above - there really are only 9 genres that Pitchfork staff decided to use to organize all the artists and albums but we have 142 distinct values because authors can use multiple tagging.

This means that if I want to create accurate frequency tables and calculate the average ratings correctly for each of 9 genres I will have to parse each string individually to check if it contains one of 9 genres and only then perform the calculation.

Before I proceed with this task I need to address one thing that I noticed while scraping the website - sometimes **Genre** is ommited in the review when an artist is listed as **Various Artists**. 
It also seems to me that the absence of the genre can be caused by some internal agreement among Pitchfork staff or procesual mistake that occurs when categorization of the artist/album might be challenging to the author.

For example, none of the **[Wu-Tang](https://pitchfork.com/artists/29705-wu-tang/)** albums are marked with any genre, but at the same time all albums marked with **[Wu-Tang Clan](https://pitchfork.com/artists/4628-wu-tang-clan/)** are clearly tagged as **Rap** while it is the same artist.

This is the reason why I tagged all such records as "Not found" while scraping the website. I can access such rows through the code below to provide a broader example:

In [9]:
pitchfork_dataset.loc[pitchfork_dataset['Genre'] == 'Not found']

Unnamed: 0,Artist Name,Album Name,Review Score,Best New Music,Genre,Date Published
22,Minor Science,Second Language,7.6,,Not found,2020-04-06
42,Lilly Hiatt,Walking Proof,7.6,,Not found,2020-03-31
44,Cable Ties,Far Enough,6.4,,Not found,2020-03-30
166,Raspberry Bulbs,Before the Age of Mirrors,7.1,,Not found,2020-02-24
247,Chubby & the Gang,Speed Kills,8.0,,Not found,2020-01-29
...,...,...,...,...,...,...
22263,Various Artists,"Oh, Merge: 10 Year Anniversary Compilation",6.9,,Not found,1999-07-06
22264,Quannum,Spectrum,7.5,,Not found,1999-07-06
22272,Brokeback,Field Recordings from the Cook County Water Table,8.5,,Not found,1999-06-20
22274,Soundtrack,Run Lola Run,4.9,,Not found,1999-06-15


It seems reasonable to me to add "Not found" to `pitchfork_genres` to later use for calculations on **Various Artists** in the `pitchfork_dataset` because their categorization follows a certain logic while the rest tagged as "Not found" seems to be random and will negatively affect the precision of the analysis.

Let's complete these steps:

In [10]:
#adding 'Not found' to pitchfork_genres
pitchfork_genres = pitchfork_genres.append({'Genres of Pitchfork': 'Not found'}, ignore_index=True)
pitchfork_genres

Unnamed: 0,Genres of Pitchfork
0,Electronic
1,Folk
2,Jazz
3,Pop/R&B
4,Rap/Hip-Hop
5,Experimental
6,Global
7,Metal
8,Rock
9,Not found


Now there are 10 values in `pitchfork_genres`.

**Various Artists** occupy 776 rows of `pitchfork_dataset`:

In [11]:
pitchfork_dataset.loc[(pitchfork_dataset['Artist Name'] == 'Various Artists') & (pitchfork_dataset['Genre'] == 'Not found')]

Unnamed: 0,Artist Name,Album Name,Review Score,Best New Music,Genre,Date Published
280,Various Artists,Shall Not Fade - 4 Years of Service,7.4,,Not found,2020-01-18
291,Various Artists,Mogadisco - Dancing Mogadishu (Somalia 1972–1991),7.6,,Not found,2020-01-15
375,Various Artists,Until the End of the World (Original Motion Pi...,8.3,,Not found,2019-12-09
383,Various Artists,20 Years of Fabric,6.2,,Not found,2019-12-06
390,Various Artists,HyperSwim,7.8,,Not found,2019-12-04
...,...,...,...,...,...,...
22016,Various Artists,The Virgin Suicides,4.8,,Not found,2000-05-12
22081,Various Artists,High Fidelity OST,7.2,,Not found,2000-03-28
22118,Various Artists,Clicks and Cuts,5.0,,Not found,2000-02-08
22215,Various Artists,Everything is Nice,4.4,,Not found,1999-09-14


Random artists tagged with 'Not found" are found in 1563 rows:

In [12]:
pitchfork_dataset.loc[(pitchfork_dataset['Artist Name'] != 'Various Artists') & (pitchfork_dataset['Genre'] == 'Not found')]

Unnamed: 0,Artist Name,Album Name,Review Score,Best New Music,Genre,Date Published
22,Minor Science,Second Language,7.6,,Not found,2020-04-06
42,Lilly Hiatt,Walking Proof,7.6,,Not found,2020-03-31
44,Cable Ties,Far Enough,6.4,,Not found,2020-03-30
166,Raspberry Bulbs,Before the Age of Mirrors,7.1,,Not found,2020-02-24
247,Chubby & the Gang,Speed Kills,8.0,,Not found,2020-01-29
...,...,...,...,...,...,...
22241,Timeout Drawer,Record of Small Histories,7.1,,Not found,1999-08-03
22264,Quannum,Spectrum,7.5,,Not found,1999-07-06
22272,Brokeback,Field Recordings from the Cook County Water Table,8.5,,Not found,1999-06-20
22274,Soundtrack,Run Lola Run,4.9,,Not found,1999-06-15


This means that right size of `pitchfork_dataset` should be 22.367 - 1.563 = **20.804** when we remove untagged artists/albums:

In [13]:
pitchfork_dataset = pitchfork_dataset.drop(
    pitchfork_dataset.loc[
        (pitchfork_dataset['Artist Name'] != 'Various Artists') & (pitchfork_dataset['Genre'] == 'Not found')
    ].index)
pitchfork_dataset

Unnamed: 0,Artist Name,Album Name,Review Score,Best New Music,Genre,Date Published
0,Laura Marling,Song for Our Daughter,7.6,,Folk/Country,2020-04-11
1,Sun Araw,Rock Sutra,7.3,,Experimental,2020-04-11
2,Joni Mitchell,Shine,8.0,,Rock,2020-04-11
3,The Strokes,The New Abnormal,5.7,,Rock,2020-04-10
4,Everything Is Recorded,Friday Forever,6.1,,Electronic,2020-04-10
...,...,...,...,...,...,...
22362,Cassius,1999,4.8,,Electronic,1999-01-26
22363,Coldcut,Let Us Replay!,8.9,,Electronic / Jazz,1999-01-26
22364,Don Caballero,"Singles Breaking Up, Vol. 1",7.2,,Experimental / Metal / Rock,1999-01-12
22365,Mojave 3,Out of Tune,6.3,,Rock,1999-01-12
