# AWS Lambda Function Walkthrough

## Importing Libraries

External Libraries (BeautifulSoup, Requests) must be installed in a folder on a local machine using PIP, then added to a layer so it can be accessed by the Lambda Function.

In [None]:
# Import built-in libraries
import json
import time
import boto3
import datetime

# The following libraries must be installed in a folder on a local machine
# Then added as a layer
from bs4 import BeautifulSoup
import requests

## Load AWS Resources

In [None]:
# Load AWS Resources
dynamodb = boto3.resource('dynamodb')
db = dynamodb.Table('Pitchfork')

## Function for Scraping Reviews

The main scraping capabilities follow from 1. Data Scraping. The scraped reviews will be stored in a DynamoDB table called 'Pitchfork', the following attributes will be stored:
- url (Primary Key)
- album_name
- album_image
- artist
- genres
- rating
- tagline
- time_loaded

Since the order of the reviews on the 'Reviews' page does not change, once all of the unique review urls are obtained (in order from most recent to least), a query is run to check if the url already exists in the table. If it exists, the function stops adding reviews to the table.

Once reviews are finished being added to the table, the send_email function is called to send an update email to designated accounts.

In [None]:
def lambda_handler(event, context):
    
    # Get Reviews Page
    url = "https://pitchfork.com/reviews/albums/"
    r = requests.get(url)
    soup = BeautifulSoup(r.content)

    # Get all item reviews on Review Page
    table = soup.findAll('div', attrs = {'class':"SummaryItemWrapper-iwvBff gVhZsz summary-item summary-item--has-border summary-item--no-icon summary-item--text-align-left summary-item--layout-placement-text-below summary-item--layout-position-image-left summary-item--layout-proportions-50-50 summary-item--side-by-side-align-top summary-item--side-by-side-image-right-mobile-false summary-item--standard SummaryCollectionGridSummaryItem-WColm dzwmya"}) 
    new_reviews = []
    print(table)

    # Loop through reviews
    for i in table:   
        # Get URL
        hyperlink = i.find('a', attrs = {'SummaryItemHedLink-civMjp PNQqc summary-item-tracking__hed-link summary-item__hed-link summary-item__hed-link--underline-disable'}, href=True)
        url = "https://pitchfork.com" + hyperlink['href']

        # Check if current review is in db, if it is end function
        check = db.get_item(Key={'url': url})

        if 'Item' in check:
            return {
                'statusCode': 200,
                'body': json.dumps(str(send_email(new_reviews)))
            }

        # Album Name
        album_name = i.find('h3', attrs = {'class': 'SummaryItemHedBase-hiFYpQ jwYeiM summary-item__hed'}).get_text()

        # Image URL
        album_cover = i.find('img', attrs = {'class': 'ResponsiveImageContainer-eybHBd fptoWY responsive-image__image'}, src=True)['src']

        # Artist Name
        try:
            artist = i.findAll('div', attrs = {'class': 'SummaryItemSubHedBase-gMyBBg bijetA summary-item__sub-hed'})[0].get_text()
        except:
            artist = "Various Artists"

        # Print Genres
        try:
            genre_helper = i.findAll('div', attrs = {'class':'RubricWrapper-dKmCNX cStFUw rubric SummaryItemRubric-dguGKN lapGFj summary-item__rubric'})
            genres = genre_helper[0].findAll('a')[0].findAll('span')[0].get_text()
        except:
            genres = "Not Found"

        # Send Request for review page
        time.sleep(1)
        review_soup = BeautifulSoup(requests.get(url).content)

        # Get Rating
        try:
            rating = review_soup.find('p', attrs = {'class': 'BaseWrap-sc-gjQpdd BaseText-ewhhUZ Rating-bkjebD iUEiRd bwCcXY imqiqZ'}).get_text()
        except:
            try:
                rating = review_soup.find('div', attrs = {'class': 'ScoreCircle-jAxRuP akdGf'}).get_text()
            except:
                rating = "NOT FOUND"

        # Get Summary
        try:
            tagline = review_soup.find('div', attrs = {'class': 'BaseWrap-sc-gjQpdd BaseText-ewhhUZ SplitScreenContentHeaderDekDown-csTFQR iUEiRd Byyns MVQMg'}).get_text()
        except:
            try:
                tagline = review_soup.find('div', attrs = {'class': 'BaseWrap-sc-gjQpdd BaseText-ewhhUZ MultiReviewContentHeaderDek-dQARIe iUEiRd Byyns bBVtRE'}).get_text()
            except:
                tagline = "NOT FOUND"

        # Create JSON for new review
        review_info = {
            'url': url,
            'album_name': album_name,
            'album_image': album_cover,
            'artist': artist,
            'genres': genres,
            'rating': rating,
            'tagline': tagline,
            'time_loaded': str(datetime.datetime.now())
        }

        # Add New Review to DB
        db.put_item(
            Item=review_info
        )

        # Add New Review to List
        new_reviews.append(review_info)

    return {
        'statusCode': 200,
        'body': json.dumps(str(send_email(new_reviews)))
    }

## Function For Sending Emails

send_email consumes a list containing dictionaries for new reviews. If this list is empty, the function returns a string indicating that there are no new reviews (an email is not sent out). Otherwise, the function will create the body of the email using HTML and send it to designated accounts.

The emails are being sent using Amazon's Simple Email Service. As this project is for personal use, a 'sandbox' version of SES is being utilized, only allowing for emails to be sent/received by verified emails.

In [None]:
def send_email(text):
    # Sends Email with new reviews

    # if there are new reviews, send email.
    if text != []:
        client = boto3.client("ses")
        subject = "Latest Reviews from Pitchfork"

        # Generate Email Body
        body = "<h1 align='center'>New Reviews From Pitchfork</h1>"
        for i in text:
            body = body + "<center><img src='%s' width='300' height='300'><br><a href='%s'><b>%s</b> by %s</a> (%s, Rating: %s): %s</center><hr>" % (i['album_image'],i['url'],i["album_name"], i["artist"], i["genres"], i['rating'], i['tagline'])

        # Send Email
        message = {"Subject": {"Data": subject}, "Body": {"Html": {"Data": str(body)}}}
        response = client.send_email(Source = "XXX",
                Destination = {"ToAddresses": ["XXX", "YYY"]}, Message = message)
        
        # Return Response Message
        return response
    else:
        return "No new Reviews"