# Lab 5.2 -- Scraping IMBD

Our goal is to scrap [IMDB](imdb.com) user reviews for *Borat Subsequent Moviefilm*.  Unfortunately, the page for user reviews only shows a limited number of reviews and you can't access additional pages through a link.  `selenium` to the rescue! In this lab, we will combine our two approaches to web scraping by

1. Using `selenium` to load the page and click the *Load More* until we have all the reviews.
2. Creating a `BeautifulSoup` instance for the complete page and parsing the results.

### Task 1 -- Load the reviews.

Explore IMBD to find the web link for the user reviews for *Borat Subsequent Moviefilm* and load this page in Python with `selenium`.

In [62]:
# import requests
# from bs4 import BeautifulSoup

# s = requests.Session()
# r = s.get('https://www.imdb.com/')
# imdb = BeautifulSoup(r.content, 'html.parser')

In [63]:
# Get and process the Yelp search
from composable import pipeable
from composable.strict import map, filter
from composablesoup import find, find_all, get_text, has_attr
from composablesoup.soup import find_parent, parents, children, find_previous_sibling, find_previous_siblings, find_next_sibling, find_next_siblings, find_previous_sibling
from composable.sequence import to_list, head
from composable.string import strip
from composable import from_toolz as tlz

In [35]:
# For local machine
!pip install selenium



In [53]:
# For running locally (with a pop up browser)
from selenium import webdriver

DRIVER_PATH = '/mnt/c/Users/rp5626vi/Documents/chromedriver/chromedriver.exe'
url = 'https://www.imdb.com/title/tt13143964/reviews?ref_=tt_ql_3/'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get(url)

### Task 2 -- Figure out how to click the *Load More* button.

To load all of the user reviews, we need to click the *Load More* button multiple times.  First, find the corresponding WebElement and verify that clicking this button loads another page of results.

In [37]:
load_btn = driver.find_element_by_id('load-more-trigger')

### Task 3 -- Click *Load More* until you have all the results.

Now you need to write code that will keep clicking the *Load More* button when you find it.  **Hint:** We can think of this as an example of an *unfold* process, meaning you should use a `while` loop combined with a [try-and-except statement](https://pythonbasics.org/try-except/) to keep trying to click the button.  To make sure you don't get an infinite loop, use a variable to identify and hold the stopping condition/state.

In [60]:
def click_till_disappear(Driver):
    """
    Args - 
        Driver - for the page
    Returns -
        The page after the load more disappears
    """
    load_btn = None
    while(True):
        try:
            load_btn = driver.find_element_by_id('load-more-trigger')
            load_btn = load_btn.click()
#             print('pressed load more')
            driver.implicitly_wait(2)
        except Exception as e:
#             print(type(e))
            print('Done clicking more!!')
            break
    return load_btn

In [61]:
click_till_disappear(driver)

pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
pressed load more
<class 'selenium.common.exceptions.ElementNotInteractableException'>
Done clicking more!!


<selenium.webdriver.remote.webelement.WebElement (session="bd72a807d2144f8719044ede50e8145e", element="6759d98e-4a7e-47cd-87ee-c7fcbfa9c42a")>

### Task 4 -- Load the results in a `BeautifulSoup` object.

Since `bs4` has better tools for parsing html, we will now switch to using this module to parse the results.  Recall that you can access the content of the current content from the `selenium` driver using `driver.page_source`.  You can use this attribute to make a `soup` object for the page using 

> soup = BeautifulSoup(driver.page_source, 'html.parser')

In [64]:
imbd = BeautifulSoup(driver.page_source, 'html.parser')

### Task 5 -- Extract the information

Now extract the following data to a csv file.

1. Title
2. Score
3. User
4. Date
5. Text (replace commas with semi-colons!)
6. Two columns for X and Y, where `"X out of Y found this helpful"`
7. Permanent link the the review.


In [74]:
title = (imbd
 >> find('a', attrs = {'itemprop' : 'url'})
 >> get_text
)
title

'Borat Subsequent Moviefilm'

In [82]:
score = (imbd 
         >> find_all('svg', attrs = {'class' : 'ipl-icon ipl-star-icon'})
         >> map(find_next_sibling)
         >> map(get_text)
         )
score

['10',
 '10',
 '10',
 '10',
 '10',
 '9',
 '10',
 '10',
 '10',
 '6',
 '4',
 '4',
 '9',
 '8',
 '9',
 '8',
 '3',
 '10',
 '1',
 '5',
 '8',
 '8',
 '9',
 '8',
 '8',
 '10',
 '9',
 '9',
 '10',
 '1',
 '1',
 '3',
 '9',
 '9',
 '10',
 '10',
 '1',
 '7',
 '8',
 '3',
 '4',
 '10',
 '10',
 '1',
 '3',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '2',
 '10',
 '10',
 '1',
 '1',
 '1',
 '1',
 '1',
 '8',
 '8',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '8',
 '1',
 '1',
 '1',
 '2',
 '1',
 '1',
 '10',
 '1',
 '2',
 '10',
 '8',
 '2',
 '1',
 '1',
 '1',
 '2',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '3',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '4',
 '1',
 '9',
 '3',
 '1',
 '1',
 '1',
 '1',
 '10',
 '4',
 '5',
 '1',
 '2',
 '1',
 '8',
 '1',
 '4',
 '2',
 '1',
 '10',
 '8',
 '7',
 '3',
 '8',
 '1',
 '1',
 '10',
 '3',
 '10',
 '10',
 '2',
 '1',
 '1',
 '7',
 '1',
 '9',
 '1',
 '2',
 '1',
 '2',
 '3',
 '3'

In [86]:
import re
user_finder = re.compile('/user/.*')
user = (imbd 
        >> find_all('a', attrs = {'href' : user_finder})
        >> map(get_text)
        )
user

['MissCzarChasm',
 'YourSonsDad',
 'lvanka',
 'WindsOfWintergreen',
 'AnaAnaBanana',
 'cartsghammond',
 'Her-Excellency',
 'dylanmoogatt-28263',
 'diegosays',
 'matianero',
 'FixedYourEnding',
 'schroederagustavo',
 'McScared',
 'TopDawgCritic',
 'taseron-1',
 'djkaine',
 'pedopete-37355',
 'juujuuuujj',
 'efecctor',
 'rphanley',
 'ChaCha44',
 'fciocca',
 'Freakazoid13',
 'nyccma1974',
 'yamxt600',
 'bucky716',
 'robscott-1313',
 'rnixon-15663',
 'Wittercom',
 'stosh-96135',
 'internet-52971',
 'juanquaglia',
 'deloudelouvain',
 'rogiervanwegberg',
 'cathiyx',
 'jasperverbrugge',
 'JayTeamman',
 'efd-10467',
 'PyroSikTh',
 'Hazu29',
 'kr98664',
 'lisamicklewright',
 'TheAll-SeeingI',
 'aponteh',
 'amnsulh',
 'olliebenet',
 'henrybrown-terry',
 'sheedykieran',
 'garybest-57435',
 'prolead',
 'usdvp',
 'gurbuzfam',
 'filipmail',
 'dougmacdonaldburr',
 'reedraymondb',
 'mikeygirwin',
 'niseynisey',
 'catherineharding-739-558537',
 'ejedrysek',
 'mhg-26735',
 'justnobody88',
 'jakesteinber

In [89]:
date = (imbd
        >> find_all('span', attrs = {'class' : 'review-date'})
        >> map(get_text)
       )
date

['29 October 2020',
 '28 October 2020',
 '30 October 2020',
 '27 October 2020',
 '27 October 2020',
 '23 October 2020',
 '27 October 2020',
 '23 October 2020',
 '24 October 2020',
 '23 October 2020',
 '23 October 2020',
 '23 October 2020',
 '23 October 2020',
 '24 October 2020',
 '1 November 2020',
 '23 October 2020',
 '23 October 2020',
 '29 October 2020',
 '23 October 2020',
 '23 October 2020',
 '26 October 2020',
 '3 November 2020',
 '31 October 2020',
 '31 October 2020',
 '24 October 2020',
 '23 October 2020',
 '23 October 2020',
 '6 November 2020',
 '7 November 2020',
 '23 October 2020',
 '23 October 2020',
 '23 October 2020',
 '5 November 2020',
 '4 November 2020',
 '5 November 2020',
 '4 November 2020',
 '23 October 2020',
 '23 October 2020',
 '4 November 2020',
 '23 October 2020',
 '25 October 2020',
 '27 October 2020',
 '23 October 2020',
 '24 October 2020',
 '25 October 2020',
 '23 October 2020',
 '25 October 2020',
 '24 October 2020',
 '23 October 2020',
 '24 October 2020',


In [91]:
replace_comma = pipeable(lambda String: String.replace(',', ';'))
text = (imbd
        >> find_all('div', attrs = {'class' : 'text show-more__control'})
        >> map(get_text)
        >> map(replace_comma)
        )
text

['Borat Make a *Glorious* #2! Subsequent Moviefilm: Delivery of Prodigious Bribe to American Regime for Make Benefit Once Glorious Nation of Kazakhstan is very naiiice!America Mayor Rudolph Giuliani say he not like film.America Mayor Rudolph Giuliani say he very much LIE\ndown to fix pants like in nation of Kazakhstan where we not stand up to tuck the shirt. Much success.You watch.Chin qui',
 "What's even funnier than the movie itself; is seeing people cry about it. I mean; damn; Cohen just exposes the stuff; he doesn't make them say or do what they say and do. If you're mad at anyone; be mad at the buffoons caught on tape. Or on second thought; keep being mad here; cause its hella hilarious.Great movie. Had richer; deeper laughs than the first; though not as many. Plus; it's on Amazon; so you can basically watch it for what you already pay or get yourself a free trial.Great s#!^!",
 "For those saying Giuliani was just tucking in his shirt; why lay down on a bed to do so instead of STA

In [179]:
def get_review_finder(List):
    """
    Args - 
        List - list potential matches.
    Returns - 
        List of reviews
    """
    review_finder = re.compile('\d* out of \d* found this helpful')
    counter = 0
    new_list = []
    while(counter < len(List)):
        try:
            result = review_finder.search(List[counter])
            result = result.group()
            new_list.append(result)
            counter = counter + 1
        except Exception as e:
            #print(e)
            new_list.append('N/A')
            counter = counter + 1
            continue
    return new_list

In [180]:
two_columns = (imbd
               >> find_all('div', attrs = {'class', 'actions text-muted'})
               >> map(get_text)
               >> pipeable(get_review_finder)
               )
two_columns

['186 out of 286 found this helpful',
 '152 out of 235 found this helpful',
 '130 out of 200 found this helpful',
 '206 out of 327 found this helpful',
 '181 out of 295 found this helpful',
 '408 out of 714 found this helpful',
 '177 out of 301 found this helpful',
 '391 out of 692 found this helpful',
 '307 out of 540 found this helpful',
 '317 out of 569 found this helpful',
 '444 out of 813 found this helpful',
 '459 out of 850 found this helpful',
 '218 out of 403 found this helpful',
 '119 out of 217 found this helpful',
 '40 out of 67 found this helpful',
 '265 out of 513 found this helpful',
 '415 out of 839 found this helpful',
 '46 out of 81 found this helpful',
 '474 out of 972 found this helpful',
 '280 out of 565 found this helpful',
 '88 out of 166 found this helpful',
 '10 out of 14 found this helpful',
 '70 out of 130 found this helpful',
 '55 out of 100 found this helpful',
 '74 out of 139 found this helpful',
 '241 out of 489 found this helpful',
 '214 out of 432 found

In [177]:
permalink_finder = re.compile('/review/.*?ref_')
add_link = pipeable(lambda String: 'https://www.imbd.com' + String)
permalink = (imbd 
             >> find_all('div', attrs = {'class' : 'actions text-muted'})
             >> map(find('a', attrs = {'href' : permalink_finder}))
             >> map(tlz.get('href'))
             >> map(add_link)
             )
permalink

['https://www.imbd.com/review/rw6217081/?ref_=tt_urv',
 'https://www.imbd.com/review/rw6213611/?ref_=tt_urv',
 'https://www.imbd.com/review/rw6219436/?ref_=tt_urv',
 'https://www.imbd.com/review/rw6210276/?ref_=tt_urv',
 'https://www.imbd.com/review/rw6211296/?ref_=tt_urv',
 'https://www.imbd.com/review/rw6197576/?ref_=tt_urv',
 'https://www.imbd.com/review/rw6209636/?ref_=tt_urv',
 'https://www.imbd.com/review/rw6197615/?ref_=tt_urv',
 'https://www.imbd.com/review/rw6201438/?ref_=tt_urv',
 'https://www.imbd.com/review/rw6198186/?ref_=tt_urv',
 'https://www.imbd.com/review/rw6199408/?ref_=tt_urv',
 'https://www.imbd.com/review/rw6199405/?ref_=tt_urv',
 'https://www.imbd.com/review/rw6197742/?ref_=tt_urv',
 'https://www.imbd.com/review/rw6202147/?ref_=tt_urv',
 'https://www.imbd.com/review/rw6227128/?ref_=tt_urv',
 'https://www.imbd.com/review/rw6197667/?ref_=tt_urv',
 'https://www.imbd.com/review/rw6198518/?ref_=tt_urv',
 'https://www.imbd.com/review/rw6215329/?ref_=tt_urv',
 'https://

In [178]:
with open('lab_5_2.csv', 'w') as f: 
    f.write('Title,Score,User,Date,Text,TwoColumns,Permalink')

In [189]:
String = title + ',' + ','.join(text) + ',' + ','.join(score) + ',' + ','.join(user) + ',' + ','.join(date) + ',' + ','.join(text) + ',' + ','.join(two_columns) + ',' + ','.join(permalink)
String



In [191]:
with open('lab_5_2.csv', 'w') as f: 
    f.write(String)