## Yelp Reviews Scrapper
#####  This project aims to extract reviews for SoFi Stadium from Yelp and utilize Natural Language Processing (NLP) techniques on the data to gain insights into customer sentiment and opinion. The Yelp reviews will be scraped, parsed, and cleaned to obtain the relevant information. The data will be analyzed using various NLP techniques such as sentiment analysis, topic modeling, and text classification. The results of the analysis will be used to identify the most popular topics discussed in the reviews, the overall sentiment towards the stadium, and any areas of improvement that can be identified. The insights obtained from this project will help SoFi Stadium improve its services and better meet customer expectations.

In [5]:
#! conda install -c conda-forge requests -y

In [9]:
## Import required libraries
from bs4 import BeautifulSoup
import requests
import os
import pandas as pd
import sys
import numpy as np
import json

In [10]:
## Initialize header
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36'}

## As of now can only get 20 latest reviews
def parse_url(url='https://www.yelp.com/biz/sofi-stadium-inglewood?osq=sofi+stadium', out_file='yelp_reviews.csv') :
    '''
    This function get the data from url and pareses the html content and adds the information 
    (author, date, rating, and review) in out_file if the information isn't present already
    '''
    data_dicts=[]
    out_file='yelp_reviews.csv'
    for i in range(0,1):
        nextpage_url=url+'&start='+str(i*10)  ## Next page url
        nextpage=requests.get(nextpage_url,headers=headers) 
        soup = BeautifulSoup(nextpage.content, 'html.parser')  ## Get data from the url 
        body=soup.find('div')
        data_tmp=body.find_all('script',type="application/ld+json")
        data_dicts.extend(json.loads(data_tmp[1].string)['review'])  ## Find information related to reviews
    df=pd.DataFrame(data_dicts)
    if os.path.isfile(out_file):  ## Read the data if the file already exists
        d=pd.read_csv(out_file, index_col=False)  
        df=pd.concat([df,d])   ## combine the old and new data
        df['reviewRating'] = df['reviewRating'].astype(str)  
        df.drop_duplicates(ignore_index=True, inplace=True)  ## remove dulicate entries from the df
    df.to_csv(out_file, mode='w', index=False)  ## Write df to out_file
    return df

df=parse_url()

In [11]:
df

Unnamed: 0,author,datePublished,reviewRating,description
0,Curtis A.,2023-01-27,{'ratingValue': 5},Great experience going for a pre-season game! ...
1,Jordynn B.,2023-01-22,{'ratingValue': 4},Came for my little brothers flag football game...
2,Jazmine P.,2023-01-14,{'ratingValue': 4},I came here to see The Weeknd not once but twi...
3,Steve N.,2023-01-03,{'ratingValue': 5},Glad to finally witness SoFi Stadium in all of...
4,Angel B.,2022-12-21,{'ratingValue': 4},Absolutely beautiful. Absolutely stunning. Abs...
5,Tommy M.,2022-12-19,{'ratingValue': 2},"The is a very beautiful stadium, a state of t..."
6,Ray H.,2022-12-26,{'ratingValue': 4},Finally got a chance to visit sofi for the den...
7,Juan C.,2022-11-30,{'ratingValue': 5},My wife &amp; I came to the Los Angeles Charge...
8,Angela C.,2022-11-18,{'ratingValue': 3},This is a beautiful new stadium to enjoy watch...
9,Taylor O.,2022-12-12,{'ratingValue': 3},Ok coming here to review SoFi as a sports and ...
