**Pitchfork review analysis**

The idea behind the analysis is to show if Pitchfork authors have a certain bias towards any genre of music and later if the review data can help us create a model which can predict an album rating based on two variables:

- Name of the authour
- Name of the music genre

The dataset used for this notebook was scraped of www.pitchfork.com on 11.04.2020. The script to do so is available here:  

In [3]:
import pandas as pd
import datetime as dt
from bs4 import BeautifulSoup
import urllib.request

Let's import the dataset with scraped pages of pitchfork and take a look at it's shape:

In [4]:
csv_file = pd.read_csv('02_Pitchfork_reviews_11042020.csv', encoding='utf-8')
pitchfork = csv_file
pitchfork.shape

(22367, 8)

In [5]:
pitchfork.head()

Unnamed: 0,Artist Name,Album Name,Review Score,Best New Music,Genre,Date Published,Written By,Review link
0,Laura Marling,Song for Our Daughter,7.6,,Folk/Country,11/04/2020,Owen Myers,http://pitchfork.com/reviews/albums/laura-marl...
1,Sun Araw,Rock Sutra,7.3,,Experimental,11/04/2020,Daniel Felsenthal,http://pitchfork.com/reviews/albums/sun-araw-r...
2,Joni Mitchell,Shine,8.0,,Rock,11/04/2020,Sam Sodomsky,http://pitchfork.com/reviews/albums/joni-mitch...
3,The Strokes,The New Abnormal,5.7,,Rock,10/04/2020,Sam Sodomsky,http://pitchfork.com/reviews/albums/the-stroke...
4,Everything Is Recorded,Friday Forever,6.1,,Electronic,10/04/2020,Aimee Cliff,http://pitchfork.com/reviews/albums/everything...


We will be looking only at the following columns: **Artist Name**, **Album Name**, **Review Score**, **Best New Music**, **Genre**, **Date Published** and **Written By**.

Let's remove the **Review link**:

In [6]:
pitchfork_dataset = pitchfork.iloc[:,0:6]
pitchfork_dataset.head()

Unnamed: 0,Artist Name,Album Name,Review Score,Best New Music,Genre,Date Published
0,Laura Marling,Song for Our Daughter,7.6,,Folk/Country,11/04/2020
1,Sun Araw,Rock Sutra,7.3,,Experimental,11/04/2020
2,Joni Mitchell,Shine,8.0,,Rock,11/04/2020
3,The Strokes,The New Abnormal,5.7,,Rock,10/04/2020
4,Everything Is Recorded,Friday Forever,6.1,,Electronic,10/04/2020


Let's inspect the data types in the Pitchfork dataset:

In [7]:
pitchfork_dataset.dtypes

Artist Name        object
Album Name         object
Review Score      float64
Best New Music     object
Genre              object
Date Published     object
dtype: object

We are expecting that **Artist Name**, **Album Name**, **Best New Music** and **Genre** will contain only strings which is correct. **Best New Music** most of the times will be empty because only a few albums get this recognition.

However, **Date Published** is read in incorrectly. Let's transform it into correct date format - `datetime64[ns]`:

In [8]:
pitchfork_dataset['Date Published'] = pd.to_datetime(pitchfork_dataset['Date Published'], format='%d/%m/%Y')
pitchfork_dataset.dtypes

Artist Name               object
Album Name                object
Review Score             float64
Best New Music            object
Genre                     object
Date Published    datetime64[ns]
dtype: object

Now all columns have correct data types!

(TO-DO PITCHFORK GENRE DESCRIPTION)

In [10]:
pitchfork_link = BeautifulSoup(urllib.request.urlopen('https://pitchfork.com/artists/').read(), 'lxml')
extracted_genres = pitchfork_link.find_all('h1', class_="artist-group__heading")

genre_list = []
for genre in extracted_genres:
    genre_list.append(genre.text)

pitchfork_genres = pd.DataFrame({"Genres of Pitchfork": genre_list})

pitchfork_genres

Unnamed: 0,Genres of Pitchfork
0,Electronic
1,Folk
2,Jazz
3,Pop/R&B
4,Rap/Hip-Hop
5,Experimental
6,Global
7,Metal
8,Rock
