# Step 1 - Fetch URL components 

Pitchfork's website has all the reviews on the same webpage, https://pitchfork.com/reviews/albums/. In order to get to each individual review and scrape for necessary information, I first had to gather a list of each review's URL. Through exploring pitchfork's website, I was able to find the URLs embedded in the website's HTML. The pages below were able to find those URL components, scrape them from pitchfork's website, save them to a dataframe, and export it from this notebook as a CSV file.

**Import necessary libraries**

In [1]:
# Make necessary imports

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

**Define function to Scrape www.pitchfork.com and find URL components**

In [25]:
# Create an empty list to append URLs to
urls = []

# Create variable for the section of pitchfork's website where reviews are kept
pitchfork = 'https://pitchfork.com/reviews/albums/?page='

# Define a function to fetch reviews in 
def fetch_urls(website):
    
    # Determine how many pages of reviews to scrape
    nums = range(1, 1800)
    
    # Start for loop 
    for num in nums:
        
        # Turn number to string so it can be included in URL
        num = str(num)
    
        # Request content from website
        res = requests.get(url + num)
    
        # Instantiate instance of Beautiful Soup
        soup = BeautifulSoup()
    
        # Get content from soup item
        soup = BeautifulSoup(res.content, 'lxml')
    
        # Find all review URLs on the the current page
        # Method found on Stack Overflow: https://stackoverflow.com/questions/5041008/how-to-find-elements-by-class
        revs = soup.findAll('a', {'class': 'review__link'})
    
        # Narrow down to just the bits of necessary hyperlink
        revs = [rev.get('href') for rev in revs]
        
        # Loop through revs
        for rev in revs:
        
            # Append each rev
            urls.append(rev)
    
        # Slow down loop so as not to over-burn pitchfork's servers
        #time.sleep(3)

In [26]:
# Call the function on the predefined URL
fetch_urls(pitchfork)

**Check work**

In [27]:
# How many URLs were compiled?
len(urls)

21599

In [36]:
# Check the first 5 to make sure they look as they should
urls[0:5]

['/reviews/albums/perfume-genius-set-my-heart-on-fire-immediately/',
 '/reviews/albums/nick-hakim-will-this-make-me-good/',
 '/reviews/albums/im-glad-its-you-every-sun-every-moon/',
 '/reviews/albums/jim-white-marisa-anderson-the-quickening/',
 '/reviews/albums/the-human-league-dare/']

**Save to dataframe and export as a CSV file**

In [30]:
# Create a dataframe with a single column in which to store the URLs
review_urls = pd.DataFrame(columns = ['url'])

In [32]:
# Add the gathered URLs to the column
review_urls['url'] = urls

In [33]:
# Check the final dataframe
review_urls

Unnamed: 0,url
0,/reviews/albums/perfume-genius-set-my-heart-on...
1,/reviews/albums/nick-hakim-will-this-make-me-g...
2,/reviews/albums/im-glad-its-you-every-sun-ever...
3,/reviews/albums/jim-white-marisa-anderson-the-...
4,/reviews/albums/the-human-league-dare/
...,...
21594,/reviews/albums/1527-ten-new-songs/
21595,/reviews/albums/8150-triangle/
21596,/reviews/albums/7166-born-into-trouble-as-the-...
21597,/reviews/albums/3143-kekeland/


In [35]:
# Export results as a CSV file
review_urls.to_csv(r'./datasets/partial_pitchfork_urls.csv')