# Step 2 - Fetch Reviews from Pitchfork's website

Now that we have a list of URLs that each correspond to an individual review on pitchfork.com, we can create a function that will fetch specified information from the review's location on pitchfork's website.

In [3]:
# Make necessary imports

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

**Read in our CSV**

In [62]:
# Read in the CSV of 
partial_urls = pd.read_csv('./datasets/partial_pitchfork_urls.csv')

**Define Function**

The function defined below takes a list of URLs fom www.pitchfork.com and 

In [148]:
# Create variable for the pitchfork website section where reviews are kept
pitchfork = 'https://pitchfork.com'

def scrape_pitchfork(list_of_urls):
    
    # Create a dataframe to store all the reviews
    df = pd.DataFrame(columns=['album', 'artist', 'score', 'genre', 'date', 'label', 'author', 'review'])
    
    for url in list_of_urls:

        # Request content from website
        # Use try/except in case 
        try:
            res = requests.get(pitchfork + url)
        except:
            pass
        
        # Instantiate and get content from soup item
        soup = BeautifulSoup(res.content, 'lxml')
        
        # Using try/except, capture the following information from the review site
        # Try/except will work to let this function continue to operate if errors come up,
        #\n i.e., if a review is missing information such as genre, label, or author
        try:
            review = {

                'album' : soup.find('h1', {'class': 'single-album-tombstone__review-title'}).text,
                'artist' : soup.find('ul', {'class': 'artist-links artist-list single-album-tombstone__artist-links'}).text,
                'score' : soup.find('span', {'class': 'score'}).text,
                'genre' : soup.find('a', {'class': 'genre-list__link'}).text,
                'date' : soup.find('span', {'class': 'single-album-tombstone__meta-year'}).text[3:],
                'label' : soup.find('li', {'class': 'labels-list__item'}).text,
                'author' : soup.find('a', {'class': 'authors-detail__display-name'}).text,
                'review' : soup.find('div', {'class': 'contents dropcap'}).text.replace('\n', ' ')
            }
        
        except:
            pass
        
        # Add the newly fetched review to our dataframe
        df = df.append(review, ignore_index = True)
        
    return df
    

In [156]:
# Call the function to scrape our list of URLs
reviews = scrape_pitchfork(partial_urls['url'])

**Check work**

How did the function perform?

In [157]:
# Check the shape -- how many reviews did we get?
reviews.shape

(21599, 8)

In [158]:
# Check the head of the dataframe
reviews.head()

Unnamed: 0,album,artist,score,genre,date,label,author,review
0,Set My Heart on Fire Immediately,Perfume Genius,9.0,Pop/R&B,2020,Matador,Madison Bloom,Each Perfume Genius album is a metamorphosis. ...
1,WILL THIS MAKE ME GOOD,Nick Hakim,6.0,Rock,2020,ATO,Jonah Bromwich,Nick Hakim’s compulsively listenable debut alb...
2,"Every Sun, Every Moon",I'm Glad It's You,7.4,Rock,2020,6131,Arielle Gordon,Grief casts a shadow over the past. It lends n...
3,The Quickening,Jim WhiteMarisa Anderson,7.8,Rock,2020,Thrill Jockey,Jesse Jarnow,The Quickening opens with an ecstatic swirl of...
4,Dare,The Human League,9.1,Electronic,1981,Virgin,Brad Nelson,"In late 1980, the singer Philip Oakey was sche..."


In [159]:
# Check the tail of the dataframe
reviews.tail()

Unnamed: 0,album,artist,score,genre,date,label,author,review
21594,Ten New Songs,Leonard Cohen,8.0,Rock,¢ 2001,Columbia,Dominique Leone,I should get one thing out of the way before t...
21595,,Triangle,2.7,Rock,¢ 2001,File-13,Dan Kilian,"In the fringes of obscurity, a new battle will..."
21596,,Triangle,2.7,Rock,¢ 2001,File-13,Dan Kilian,"In the fringes of obscurity, a new battle will..."
21597,Kekeland,Brigitte Fontaine,8.0,Pop/R&B,2001,Virgin,Andi Rowlands,"With the recent rebirth of Paris and its arts,..."
21598,Please Smile My Noise Bleed,Múm,8.7,Electronic,2001,Morr,Christopher F. Schiel,"Here's the setup: Icelandic foursome, describe..."


**Duplicates**

There are duplicates that this function creates that can be seen simply in calling the head and tail of the dataframe, which were likely created when the try/except clause was generated from the web scraper encountering an error (this pattern was noted when I was testing the function before pulling all reviews). 

We'll drop duplicates before moving on to EDA and further data analysis.

In [161]:
# Drop duplicates
reviews = reviews.drop_duplicates()

In [162]:
# The cell above dropped 2,368 reviews
reviews.shape

(19231, 8)

In [164]:
# Look for any null values
reviews.isnull().sum()

album     0
artist    0
score     0
genre     0
date      0
label     0
author    0
review    0
dtype: int64

In [165]:
# Save final CSV to dataset folder
reviews.to_csv(r'./datasets/pitchfork_reviews.csv')