## Pitchfork Reviews 

- Using pitchfork-api (https://github.com/tejassharma96/pitchfork_api/) to download reviews
    - unfortunately, this doesn't let you fet X amount of reviews for data collection
- there is a dataset on kaggle of pitchfork reviews from 2017 (https://www.kaggle.com/nolanbconaway/pitchfork-data/home 18.4k reviews) but I want to look at music from 2018.
- I plan to eventually add this updated dataset to kaggle.




In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pitchfork_api

import urllib
from bs4 import BeautifulSoup

In [2]:
p = pitchfork_api.search('kanye west', 'my beautiful') # the title is autocompleted

In [3]:
# Printing stats for album review
print("Score ", p.score())
print("Abstract, ", p.abstract())
# print("Editorial, ", p.editorial())
print("Full Text, ", p.full_text())
# # Ideally .cover() should get a link to cover art but this seems to not work?
# print("Cover Link, ", p.cover())
print("Artist, ", p.artist())
print("Album, ", p.album())
print("Label, ", p.label())
print("Year, ", p.year())
# Only 200 characters difference in length
print("Length editorial, ", len(p.editorial()))
print("Length full text, ", len(p.full_text()))

Score  10.0
Abstract,  Kanye's big year culminates in an LP that feels like an instant greatest hits, the ultimate realization of his strongest talents and divisive public persona.

Full Text,  Kanye's big year culminates in an LP that feels like an instant greatest hits, the ultimate realization of his strongest talents and divisive public persona.
Kanye West's 35-minute super-video, Runaway, peaks with a parade. Fireworks flash while red hoods march through a field. At the center of the spectacle is a huge, pale, cartoonish rendering of Michael Jackson's head. My Beautiful Dark Twisted Fantasy's gargantuan "All of the Lights" soundtracks the procession, with Kanye pleading, "Something wrong, I hold my head/ MJ gone, our nigga dead." The tribute marks another chapter in West's ongoing obsession with the King of Pop.
West's discography contains innumerable references and allusions to Jackson. His first hit as a producer, Jay-Z's "Izzo (H.O.V.A.)", sampled the Jackson 5's "I Want You Ba

## Data Collection / Web Scraping


In [4]:
# this page is infinite 
url = "https://pitchfork.com/reviews/albums/"
baseUrl = "https://pitchfork.com"
# https://pitchfork.com/reviews/albums/?page=1

# each review has artist-name-album-name in url
# "https://pitchfork.com/reviews/albums/avril-lavigne-let-go/"

# query the website and return the html to the variable ‘page’
page = urllib.request.urlopen(url)

# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')



In [5]:
# Gets metadata for one review and returns it as a dictionary
def getReviewData(review):
    data = {}
    data['album'] = review.find("h2", {"class": "review__title-album"}).text
    data['artist'] = review.find("ul", {"class": "artist-list review__title-artist"}).li.text
    try:
        data['genre'] = review.find("li", {"class": "genre-list__item"}).text
    except:
        data['genre'] = ""
    data['link'] = review.find("a", {"class": "review__link"}).get('href')
    data['published'] = review.find("time", {"class": "pub-date"}).get('datetime')
    data['full text'] = ""
    data['score'] = 0
    data['year'] = 0
    data['abstract'] = ""
    return data

# fetches n pages of reviews and returns in list
def scanReviews(n=10):
    metadata = []
    # Loop through recent reviews within page
    for i in range(n):
        if i%100==0:
            print(i)
        tempURL = "https://pitchfork.com/reviews/albums/?page="+str(i+1)
        #print(tempURL)
        page = urllib.request.urlopen(tempURL)
        soup = BeautifulSoup(page, 'html.parser')
        reviews = soup.findAll("div", {"class": "review"})
        for review in reviews:
            metadata.append(getReviewData(review))
    print("n: ", n)
    print("reviews fetched: ", len(metadata))
    return metadata 

In [6]:
# print(soup.prettify())
# need to get all div's with class name review  <div class="review">
reviews = soup.findAll("div", {"class": "review"})
print(len(reviews))

# prints review metadata 
print(reviews[0].prettify())
metadata = []
for review in reviews:
    metadata.append(getReviewData(review))
    
metadata[0]

12
<div class="review">
 <a class="review__link" href="/reviews/albums/power-trip-opening-fire-2008-2014/">
  <div class="review__artwork artwork">
   <div class="">
    <img alt="Cover of Opening Fire 2008-2014" src="https://media.pitchfork.com/photos/5c11645b7cf9c54cdbe36ad7/1:1/w_160/power%20trip_opening%20fire.jpg"/>
   </div>
  </div>
  <div class="review__title">
   <ul class="artist-list review__title-artist">
    <li>
     Power Trip
    </li>
   </ul>
   <h2 class="review__title-album">
    Opening Fire: 2008-2014
   </h2>
  </div>
 </a>
 <div class="review__meta">
  <ul class="genre-list genre-list--inline review__genre-list">
   <li class="genre-list__item">
    <a class="genre-list__link" href="/reviews/albums/?genre=metal">
     Metal
    </a>
   </li>
  </ul>
  <ul class="authors">
   <li>
    <a class="linked display-name display-name--linked" href="/staff/andy-oconnor/">
     <span class="by">
      by:
     </span>
     Andy O'Connor
    </a>
   </li>
  </ul>
  <time c

{'album': 'Opening Fire: 2008-2014',
 'artist': 'Power Trip',
 'genre': 'Metal',
 'link': '/reviews/albums/power-trip-opening-fire-2008-2014/',
 'published': '2018-12-22T06:00:00',
 'full text': '',
 'score': 0,
 'year': 0,
 'abstract': ''}

In [7]:
# putting reviews fetched into a dataframe
%time df = pd.DataFrame(scanReviews(1730))

0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
n:  1730
reviews fetched:  20758
CPU times: user 9min 24s, sys: 8.74 s, total: 9min 33s
Wall time: 29min 8s


In [10]:
df.head(10)
df.to_csv('without_reviews.csv')

### Adding reviews to review metadata

Each review grid in the latest reviews page has a link to the actual review. The next step is to get the editorial and abstract from this page. The abstract may be useful when making the summarizer because I can try abstractive vs extractive methods. 

In [11]:
# This causes an error but I know there is a review for this album
# p = pitchfork_api.search("H.E.R.", "I Used to Know Her: Part 2 EP") 
# pitchfork api seems to be kind of slow
# trying to use autocorrect
p = pitchfork_api.search("H.E.R.", "I Used to") 
# This works 

In [17]:
def getReviewFeatures(row):
    if row.name%1000==0:
        print(row.name)
    try:
        # try to access review using pitchfork api 
        p = pitchfork_api.search(row.artist, row.album) # the title is autocompleted
        # adding missing review information to row in dataframe
        row['score'] = p.score()
        row['abstract'] = p.abstract()
        row['full text'] = p.full_text()
        row['year'] = p.year()
    except:
        # try to access review by link
        end = row.link
        baseUrl = "https://pitchfork.com"
        goTo = baseUrl + end
        try:
            page = urllib.request.urlopen(goTo)
            soup = BeautifulSoup(page, 'html.parser')
            #full = soup.find("div", {"class": "review-detail__text clearfix"})
            review = soup.findAll("p")
            row['year'] = soup.find(class_='single-album-tombstone__meta-year').get_text()[3:]
            row['abstract'] = soup.find("div", {"class": "review-detail__abstract"}).text
            row['score'] = soup.find("span", {"class": "score"}).text
            row['full text'] = [x.text for x in review]
        except:
            print("failed twice")

    return row


# sending request for specific review n 
# page = urllib.request.urlopen(fullUrl)
# soup = BeautifulSoup(page, 'html.parser')
# inspecting HMTL structure
# prints review metadata 
# print(soup.prettify())


# pitchfork-api can also be used since we have artist and album name 
# apply function along rows
%time df = df.apply(getReviewFeatures, axis=1)

0
0
1000
2000
3000
4000
5000
6000
failed twice
7000
8000
9000
failed twice
10000
failed twice
11000
12000
failed twice
13000
14000
15000
16000
17000
18000
failed twice
19000
20000
CPU times: user 2h 12min 40s, sys: 2min 3s, total: 2h 14min 43s
Wall time: 8h 57min 52s


In [18]:
df.to_csv('full.csv')