# <center>HW2 : Web Scraping </center>

### Thanapoom Phatthanaphan <br>CWID: 20011296

**Instructions**: 
- Please read the problem description carefully
- Make sure to complete all requirements (shown as bullets) . In general, it would be much easier if you complete the requirements in the order as shown in the problem description
- Follow the Submission Instruction to submit your assignment.
- Code of academic integrity:
    - **Each assignment needs to be completed independently. This is NOT group assignment**. 
    - Never ever copy others' work (even with minor modification, e.g. changing variable names)
    - If you generate code using large lanaguage models (although it is not encouraged), make sure to adapt the generated code to meet all requirements and it is executable.
    - Anti-Plagiarism software will be used to check similarities between all submissions.
    - Check Syllabus for more details.

# Q1. Collecting Movie Reviews

Write a function `getReviews(url, webdriver = None)` to scrape all **reviews on the first page**, including, 
- **title** (see (1) in Figure)
- **reviewer's name** (see (2) in Figure)
- **date** (see (3) in Figure)
- **rating** (see (4) in Figure)
- **review content** (see (5) in Figure)
    - For each review, if the full text is not shown, first click the expander icon (shown in (7)) to expand the review. 
    - Hint. You can first select all expander icons on the page and click each of them. The expander icon can be selected by CSS Selector `div.ipl-expander div.expander-icon-wrapper`
    - Then collect the **complete review text**.
- **helpful** (see (6) in Figure). 


Requirements:
- `Function Input`:
    - `page URL`: the URL string
    - `web driver`: if you use Selenium or Playwright, pass the initialized web driver. In other words, your function should work with an initialized web driver of any web browser.
- `Function Output`: save all reviews as a DataFrame of columns (`title, reviewer, rating, date, review, helpful`). For the given URL, you can get 25 reviews.
- If a field, e.g. rating, is missing, use `None` to indicate it. 

    


![alt text](IMDB.png "IMDB")

In [1]:
# Import necessary libraries
import requests
from bs4 import BeautifulSoup  
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import time

In [2]:
def getReviews(page_url, webdriver=None):
    
    reviews = [] 
    page = requests.get(page_url)
    # Initiate a beautifulsoup object using the html source and Python’s html.parser
    soup = BeautifulSoup(page.content, 'html.parser')  
    # Find the path of required data
    divs = soup.select("div#main section.article div.lister div.lister-list \
    div[class~=lister-item] div.review-container div.lister-item-content")
    
    # Iterate to get all required data
    for idx, div in enumerate(divs):

        # Initiate the variable for each review
        title = None
        reviewer_name = None
        date = None
        rating = None
        review_content = None
        helpful = None

        # Get title
        title_path = div.select("a.title")
        if title_path != []:
            title = title_path[0].get_text(strip=True)

        # Get reviewer
        reviewer_path = div.select("div.display-name-date span.display-name-link a")
        if reviewer_path != []:
            reviewer_name = reviewer_path[0].get_text()

        # Get date
        date_path = div.select("div.display-name-date span.review-date")
        if date_path != []:
            date = date_path[0].get_text()

        # Get rating
        rating_path = div.select("span.rating-other-user-rating span")
        if rating_path != []:
            rating = rating_path[0].get_text()

        # Get review content
        review_path = div.select("div.content div[class~=text]")
        if review_path != []:
            review_content = review_path[0].get_text()

        # Get helpful
        helpful_path = div.select("div.content div.actions.text-muted")
        if helpful_path != []:
            helpful = helpful_path[0].contents[0].strip()

        reviews.append((title, reviewer_name, rating, date, review_content, helpful))

    return reviews

In [3]:
# Test the function for Question 1

# Website url
page_url = 'https://www.imdb.com/title/tt1745960/reviews?sort=totalVotes&dir=desc&ratingFilter=0'

# I don't use Selenium or Playwright for this question
# driver = webdriver.Chrome()
# driver.get(page_url)

# Get the information from the website
reviews = getReviews(page_url)
columns_name = ['Title', 'Reviewer', 'Rating', 'Date', 'Content', 'Helpful']
reviews_table = pd.DataFrame(reviews, columns=columns_name)
reviews_table

Unnamed: 0,Title,Reviewer,Rating,Date,Content,Helpful
0,This is slightly different to the other review...,scottedwards-87359,10,26 May 2022,If you were a late teen or in your early twent...,"5,329 out of 5,610 found this helpful."
1,The truly epic blockbuster we needed.,Top_Dawg_Critic,10,23 May 2022,"Wow. The first Top Gun is a classic, and as we...","2,736 out of 3,092 found this helpful."
2,Let me just say...,lovefalloutkindagamer,10,26 May 2022,"I was reluctantly dragged into the theater, th...","1,885 out of 2,086 found this helpful."
3,Best Sequel yet,GusherPop,10,25 May 2022,In one of the more memorable lines in the orig...,"1,223 out of 1,442 found this helpful."
4,The real cinema experience!,alexglimbergwindh,10,30 May 2022,If there's any movie that deserves to be seen ...,"1,071 out of 1,240 found this helpful."
5,This is why we go to the movies,dtucker86,10,27 May 2022,This is one sequel that looked like it would n...,"983 out of 1,157 found this helpful."
6,Flying High,DarkVulcan29,10,27 May 2022,"Top Gun (1986) made Tom Cruise a star, and now...",641 out of 828 found this helpful.
7,What an excellent sequel,r96sk,9,25 May 2022,"What an excellent sequel - I, in fact, like it...",658 out of 823 found this helpful.
8,Fake Imdb reviews artificially upping the rati...,imseeg,5,29 May 2022,"Almost 90 percent of all the reviews have a 8,...",143 out of 803 found this helpful.
9,"Great Flight Sequences, Cliche-Ridden Plot",Stoshie,7,18 June 2022,I don't share everyone's unbridled enthusiasm ...,577 out of 783 found this helpful.


# Q2 (Bonus) Scrape Dynamic Content


- Expand your function defined in Q1 to include an argument `N` for the minimum number of reveiws to be collected, i.e., `get_N_review(url, webdriver = None, N = 100)`. 
- When called, this function can scrape **at least N reviews** by clicking the `Load More` button at the end of the page continously.

In [4]:
# for Bonus

def get_N_reviews(page_url, webdriver, N = 100):
    
    webdriver.get(page_url)
    reviews = []
    
    while True:
        
        try:         
            # Get the page source
            updated_page = webdriver.page_source
            
            # Initiate a beautifulsoup object using the html source and Python’s html.parser
            soup = BeautifulSoup(updated_page, 'html.parser')
            
            # Find the path of required data
            divs = soup.select("div#main section.article div.lister div.lister-list \
            div[class~=lister-item] div.review-container div.lister-item-content")
            
            time.sleep(20)
            
            # Iterate to get all required data
            for div in divs:

                # Initiate the variable for each review
                title = None
                reviewer_name = None
                date = None
                rating = None
                review_content = None
                helpful = None

                # Get title
                title_path = div.select("a.title")
                if title_path != []:
                    title = title_path[0].get_text(strip=True)

                # Get reviewer
                reviewer_path = div.select("div.display-name-date span.display-name-link a")
                if reviewer_path != []:
                    reviewer_name = reviewer_path[0].get_text()

                # Get date
                date_path = div.select("div.display-name-date span.review-date")
                if date_path != []:
                    date = date_path[0].get_text()

                # Get rating
                rating_path = div.select("span.rating-other-user-rating span")
                if rating_path != []:
                    rating = rating_path[0].get_text()

                # Get review content
                review_path = div.select("div.content div[class~=text]")
                if review_path != []:
                    review_content = review_path[0].get_text()

                # Get helpful
                helpful_path = div.select("div.content div.actions.text-muted")
                if helpful_path != []:
                    helpful = helpful_path[0].contents[0].strip()

                reviews.append((title, reviewer_name, rating, date, review_content, helpful))

                if len(reviews) == N:
                    break
                    
            if len(reviews) == N:
                break
            
            # Wait for the "Load More" button to be clickable
            load_more_button = WebDriverWait(webdriver, 30).until(
            EC.element_to_be_clickable((By.CSS_SELECTOR, "div.load-more-data div button.ipl-load-more__button")))
            
            # Click the button
            load_more_button.click()
            
        except TimeoutException:
                # If the button is not found within the timeout, break out of the loop
                break
    
    return reviews

In [5]:
# Test the function for Question 2

# initialize the web drive
driver = webdriver.Chrome()
driver.get(page_url)
n_reviews = get_N_reviews(page_url, driver)
columns_name = ['Title', 'Reviewer', 'Rating', 'Date', 'Content', 'Helpful']
n_reviews_table = pd.DataFrame(n_reviews, columns=columns_name)
driver.quit()

In [6]:
n_reviews_table

Unnamed: 0,Title,Reviewer,Rating,Date,Content,Helpful
0,This is slightly different to the other review...,scottedwards-87359,10,26 May 2022,If you were a late teen or in your early twent...,
1,The truly epic blockbuster we needed.,Top_Dawg_Critic,10,23 May 2022,"Wow. The first Top Gun is a classic, and as we...",
2,Let me just say...,lovefalloutkindagamer,10,26 May 2022,"I was reluctantly dragged into the theater, th...","1,885 out of 2,086 found this helpful."
3,Best Sequel yet,GusherPop,10,25 May 2022,In one of the more memorable lines in the orig...,
4,The real cinema experience!,alexglimbergwindh,10,30 May 2022,If there's any movie that deserves to be seen ...,"1,071 out of 1,240 found this helpful."
...,...,...,...,...,...,...
95,Take note Hollywood this is how to do a blockb...,rockingruby,9,29 May 2022,I appreciate the first film but never been a m...,
96,Talk to me Goose,kosmasp,10,30 May 2022,It makes more than sense to have seen the firs...,
97,Are these reviews being bought by the military...,SwissCheeze,1,23 July 2022,Went with an open mind. Didn't expect anything...,230 out of 452 found this helpful.
98,Cliched garbage,kristoferthompson,1,27 May 2022,"Just as dull, crap, and boring as the first pi...",
