# Web Scraping 

In this notebook, I will be using sentiment analysis on Rotten Tomatoes movie reviews to make predictions on users star ratings. First I am web scraping the movie rating website Rotten Tomatoes. Here I will parse the user id, review and star rating and add these features to a dataframe. 

### Importing Modules

In [1]:
import pandas as pd
from collections import defaultdict

import requests
import urllib.request
from bs4 import BeautifulSoup

### Connecting to the URL

In [2]:
# Tester URL
url = 'https://www.rottentomatoes.com/m/black_panther_2018/reviews?type=user'
response = requests.get(url)

### Brewing the soup

In [3]:
# Brewing the soup
soup = BeautifulSoup(response.text, 'html.parser')

### Isolating all the relevant user data

In [4]:
# One user
user_all = soup.find_all('li', {'class': 'audience-reviews__item'})
user = user_all[0]
print(user.prettify())

<li class="audience-reviews__item" data-qa="review-item">
 <div class="audience-reviews__user-wrap">
  <a href="/user/id/978824977">
   <span class="audience-review__default-image">
   </span>
  </a>
  <div class="audience-reviews__name-wrap">
   <a class="audience-reviews__name" data-qa="review-name" href="/user/id/978824977">
    Diego O
   </a>
  </div>
 </div>
 <div class="audience-reviews__review-wrap">
  <span class="audience-reviews__score">
   <span class="star-display" data-qa="star-display">
    <span class="star-display__filled">
    </span>
    <span class="star-display__filled">
    </span>
    <span class="star-display__filled">
    </span>
    <span class="star-display__filled">
    </span>
    <span class="star-display__half">
    </span>
   </span>
  </span>
  <span class="audience-reviews__duration" data-qa="review-duration">
   Jan 08, 2021
  </span>
  <p class="audience-reviews__review js-review-text clamp clamp-8 js-clamp" data-qa="review-text">
   Good but not tha

# Tester Code 

In [5]:
# Parsing star rating
full_stars  = user.find_all('span', {'class': 'star-display__filled'})
half_stars  = user.find_all('span', {'class': 'star-display__half'})
star_rating = len(full_stars)+(len(half_stars)/2)

print(star_rating)

4.5


In [6]:
# Parsing review
review = user.find('p', {'class': 'audience-reviews__review js-review-text clamp clamp-8 js-clamp'})

print(review)

Good but not that surprising.


In [7]:
# Parsing user ID
id_     = user.find('a')['href']
id_list = id_.split('/')
user_id = id_list[3]

print(user_id)

/user/id/978824977
['', 'user', 'id', '978824977']
978824977


# Parsing the star rating, review and user id

In [31]:
# Empty data dictionary
data = defaultdict(list)

# Parsing star rating, review and user id
for user in soup.find_all('li', {'class': 'audience-reviews__item'}):
    # Extracting user ID
    id_     = user.find('a')['href']
    id_list = id_.split('/')
    user_id = id_list[3]

    # Review
    review = user.find('p', {'class': 'audience-reviews__review js-review-text clamp clamp-8 js-clamp'}).text
    
    # Star rating
    full_stars  = user.find_all('span', {'class': 'star-display__filled'})
    half_stars  = user.find_all('span', {'class': 'star-display__half'})
    star_rating = len(full_stars)+(len(half_stars)/2)

    # Appending values to data dictionary
    data['user_id'].append(user_id)
    data['review'] .append(review)
    data['rating'] .append(star_rating)

# Creating dataframe from data dictionary
df = pd.DataFrame(data).reset_index(drop=True)
df

Unnamed: 0,user_id,review,rating
0,978824977,Good but not that surprising.,4.5
1,906471241,It was exciting to see this in theaters with m...,5.0
2,978925578,"I'm not a huge Marvel fan, but this movie is V...",4.0
3,978898527,It was absolutely appaling!!!!! I have never b...,0.5
4,977911687,Best movie of all time? Best drama of all tim...,1.5
5,978866570,Black panther lived up to the height,5.0
6,978071323,Not sure what all the hype was about. This mov...,2.0
7,978762264,R.I.P. Chadwick Boseman,5.0
8,978883340,"Average superhero movie. Nothing ""super"" speci...",2.0
9,936458985,A masterpiece and a cultural phenomenon. I lo...,5.0
