# STA 141B Data & Web Technologies for Data Analysis

### Lecture 12, 11/16/23, Scraping

### Announcements

 - 


### Today's topics
- Scraping with Java Script
- GraphQL
    
### Ressources
- Mitchell: Scraping with Python, Chapters 9 and 10
- [GraphQL](https://www.mobilelive.ca/blog/graphql-vs-rest-what-you-didnt-know) (Attention: This is infotainment!)

### Scraping from `ratemyprofesors.com`

We are interested in retrieving information from the webpage `ratemyprofesors.com`. By navigating with our browser, we find that all professors at UCD can be retrieved as follows. 

In [1]:
import requests

In [2]:
endpoint = 'https://www.ratemyprofessors.com/search/professors/1073?'
params = {'q':'*'}

In [3]:
result=requests.get(endpoint, params)
result.raise_for_status

<bound method Response.raise_for_status of <Response [200]>>

In [4]:
import lxml
from bs4 import BeautifulSoup

In [5]:
html = BeautifulSoup(result.text,'lxml')
print(html.prettify()) 

<!DOCTYPE html>
<!-- SSR -->
<html>
 <head>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="#000000" name="theme-color"/>
  <meta content="https://www.ratemyprofessors.com/build/thumbnail.svg" name="thumbnail"/>
  <link href="/build/manifest.json" rel="manifest"/>
  <link href="/static/css/main.1773c5b7.css" rel="stylesheet" type="text/css"/>
  <!-- Google Optimize Anti-flicker snippet -->
  <style>
   .async-hide { opacity: 0 !important}
  </style>
  <script>
   (function(a,s,y,n,c,h,i,d,e){s.className+=' '+y;h.start=1*new Date;
        h.end=i=function(){s.className=s.className.replace(RegExp(' ?'+y),'')};
        (a[n]=a[n]||[]).hide=h;setTimeout(function(){i();h.end=null},c);h.timeout=c;
        })(window,document.documentElement,'async-hide','dataLayer',4000,
        {'OPT-MLW3VTZ':true});
  </script>
  <!-- Google Optimize -->
  <script async="" src="https://www.googleoptimize.com/optimize.js?id=OPT-MLW3VTZ">
  </script>
  <script async=""

In [6]:
result.text.find("Lynn")

50771

In [7]:
result.text.find("Stylianos")

-1

As we have already seen in the wikipedia example, the website (html) rendered by the browser does not coincide with the html returned by the request. Apparently, some information is fetched while the *browser* executed JS. 

The running of scripts is a client-side operation run in the browser itself, rather
than on a web server. 

JavaScript is, by far, the most common and most well-supported client-side scripting
language on the Web today. It can be used to collect information for user tracking,
submit forms without reloading the page, embed multimedia, and even power entire
online games. Even deceptively simple-looking pages can often contain multiple
pieces of JavaScript. You can find it embedded between `<script>` tags in the page’s
source code.

Since we are interested in the rendered html displayed by the browser, we have to artificialy render it first, then return the rendered html as a string. This can be achieved with `Selenium`. 

Selenium is a powerful web scraping tool developed originally for website testing.
These days it’s also used when the accurate portrayal of websites—as they appear in a
browser—is required. Selenium works by automating browsers to load the website,
retrieve the required data, and even take screenshots or assert that certain actions
happen on the website.

In [13]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

In [9]:
url = result.url
url

'https://www.ratemyprofessors.com/search/professors/1073?q=%2A'

In [11]:
# driver.get(url)

We have already seen that it takes a while to load the page in the browser. We don't have time for this. 

In [15]:
from selenium.common.exceptions import TimeoutException
driver.set_page_load_timeout(10) # ten seconds should be enough

try:
    driver.get(url)
except TimeoutException:
    driver.execute_script("window.stop();")

Other professors are not displayed. For that, we have to hit the button `show more`, or, better, specify that we are only interested in stats professors. 

How do we navigate on this page? First, we need to get rid of the cookies banner. Using developer tools, we can inspect find the 'close' button for the cookies banner: 

    "/html/body/div[5]/div/div/button"

See the [docs](https://www.selenium.dev/selenium/docs/api/py/index.html)!

In [17]:
button=driver.find_element("xpath", "/html/body/div[5]/div/div/button")
button.click()

Next, we should select the stats professors. To do so, we need to access the dropdown menu. From the developer tools, we find that its coded as `div` element, so we cannot use the implemented `select` method to access the dropdown. 

First, we need to find which `div` actually opens the dropdown. 

In [49]:
driver.find_element("xpath", '//div[@class=" css-1l6bn5c-control"]').click()

Lets see how the html looks like in the dropdown menu. 

In [52]:
import lxml.html as lx
from lxml import etree

html = lx.fromstring(driver.page_source)
dropdown = html.xpath('//div[@class=" css-1hwfws3"]')[0]

In [51]:
dropdown

[<Element div at 0x7face8976a90>]

In [53]:
print(BeautifulSoup(etree.tostring(dropdown),'lxml').prettify())

<html>
 <body>
  <div class="css-1hwfws3">
   <div class="css-1wa3eu0-placeholder">
    Select...
   </div>
   <input aria-autocomplete="list" class="css-62g3xt-dummyInput" id="react-select-3-input" readonly="" tabindex="0" value=""/>
  </div>
 </body>
</html>


We ought to select the element with `id="react-select-3-option-86"`. 

In [55]:
driver.find_element("xpath", '//div[@id="react-select-3-option-86"]').click()

We learn that there are 102 professors in the Statistics department, but only 8 are shown. Further investigation shows that we might use the class attribute that contains `Pagination button`. 

In [60]:
button=driver.find_element("xpath", "//button[contains(@class, 'PaginationButton')]")
button.click()

In [61]:
import time

In [63]:
while True: 
    try: 
        time.sleep(0.2)
        button=driver.find_element("xpath", "//button[contains(@class, 'PaginationButton')]")
        button.click()
    except: 
        break

In [65]:
html = lx.fromstring(driver.page_source)

We don't need the browser anymore. We can close it. 

In [66]:
driver.quit()

Since we do not need visual confimation of what the browser does, we can run it in headless mode as well next time. 

In [None]:
#chrome_options.add_argument("--headless")
#browser = webdriver.Chrome('./chromedriver', options=chrome_options)

Lets retrieve name and link for now. Any further analysis can be performed similar to our previous case studies. 

In [67]:
links = html.xpath('//a[@class = "TeacherCard__StyledTeacherCard-syjs0d-0 dLJIlx"]/@href')
links[1:10]

['/professor/220688',
 '/professor/346995',
 '/professor/525108',
 '/professor/770600',
 '/professor/776124',
 '/professor/780398',
 '/professor/810072',
 '/professor/810782',
 '/professor/1109751']

In [68]:
names = html.xpath('//div[@class = "CardName__StyledCardName-sc-1gyrgim-0 cJdVEK"]')
names = [name.text for name in names]
names[1:10]

['Abdolrahman Azari',
 'Prabir Burman',
 'Tamara Greasby',
 'Rahman Azari',
 'Clayton Schupp',
 'Greg Wall',
 'Lawrence C. Wang',
 'Soma Roychowdhury',
 'Seung Tae Yi']

In [69]:
import pandas as pd

df=pd.DataFrame({'name': names, 'link': links})
df

Unnamed: 0,name,link
0,Francisco Samaniego,/professor/162840
1,Abdolrahman Azari,/professor/220688
2,Prabir Burman,/professor/346995
3,Tamara Greasby,/professor/525108
4,Rahman Azari,/professor/770600
...,...,...
101,Alicia Graziosi,/professor/556414
102,Michael Klass,/professor/558447
103,Yinglei Lai,/professor/560315
104,Selvaratnam Sridharma,/professor/560318


So far so good. Next, we will see how these steps could have been achieved somewhat easier. 

We have seen that the html was rendered after some JS code has been executed. However, the information we retrieved must have been retrieved by querying some data base. To see which data base was queried using which script, we can use the performance tab in the developer tools. 

As it turns out, the information is fetched via *GraphQL*. GraphQL is an API as we have seen them before, but its not a REST API. Facebook developed it as an internal technology for their versatile applications, and later, publicly released it as open-source. Since then, the software development community has utilized it as one of the favourite technology stacks for developing web services.

As a query language, GraphQL defines specifications of how a client application can request the needed data from a remote server. As a result, the server application returns a response to the requested client query. The exciting thing to notice here is that the client application can also query exactly what it needs, without relying on the server-side application to define a query. 

GraphQL has become fairly common. Its adavantage is that due to specific queries, it avoids some problems of REST APIs, namely 
 - Multiple roundtrips with REST
 - Over-fetching and Under-fetching Problems with REST

Lets see how the GraphQL request is made. 

In [70]:
endpoint = 'https://www.ratemyprofessors.com/graphql'
headers = {
    "Authorization": "Basic dGVzdDp0ZXN0", 
}

In [71]:
# first query
data = {
    "query":"query TeacherSearchResultsPageQuery(\n  $query: TeacherSearchQuery!\n  $schoolID: ID\n) {\n  search: newSearch {\n    ...TeacherSearchPagination_search_1ZLmLD\n  }\n  school: node(id: $schoolID) {\n    __typename\n    ... on School {\n      name\n    }\n    id\n  }\n}\n\nfragment TeacherSearchPagination_search_1ZLmLD on newSearch {\n  teachers(query: $query, first: 8, after: \"\") {\n    didFallback\n    edges {\n      cursor\n      node {\n        ...TeacherCard_teacher\n        id\n        __typename\n      }\n    }\n    pageInfo {\n      hasNextPage\n      endCursor\n    }\n    resultCount\n    filters {\n      field\n      options {\n        value\n        id\n      }\n    }\n  }\n}\n\nfragment TeacherCard_teacher on Teacher {\n  id\n  legacyId\n  avgRating\n  numRatings\n  ...CardFeedback_teacher\n  ...CardSchool_teacher\n  ...CardName_teacher\n  ...TeacherBookmark_teacher\n}\n\nfragment CardFeedback_teacher on Teacher {\n  wouldTakeAgainPercent\n  avgDifficulty\n}\n\nfragment CardSchool_teacher on Teacher {\n  department\n  school {\n    name\n    id\n  }\n}\n\nfragment CardName_teacher on Teacher {\n  firstName\n  lastName\n}\n\nfragment TeacherBookmark_teacher on Teacher {\n  id\n  isSaved\n}\n",
    "variables":{
        "query":{
            "text":"",
            "schoolID":"U2Nob29sLTEwNzM=",
            "fallback":True,
            "departmentID":"RGVwYXJ0bWVudC0xNDA=", 
        },
        "schoolID":"U2Nob29sLTEwNzM="
    }
} 

In [72]:
response = requests.post(endpoint, headers = headers, json=data)
response.raise_for_status()
result = response.json()
result

{'data': {'school': {'__typename': 'School',
   'id': 'U2Nob29sLTEwNzM=',
   'name': 'University of California Davis'},
  'search': {'teachers': {'didFallback': False,
    'edges': [{'cursor': 'YXJyYXljb25uZWN0aW9uOjA=',
      'node': {'__typename': 'Teacher',
       'avgDifficulty': 3.4,
       'avgRating': 3.6,
       'department': 'Statistics',
       'firstName': 'Francisco',
       'id': 'VGVhY2hlci0xNjI4NDA=',
       'isSaved': False,
       'lastName': 'Samaniego',
       'legacyId': 162840,
       'numRatings': 47,
       'school': {'id': 'U2Nob29sLTEwNzM=',
        'name': 'University of California Davis'},
       'wouldTakeAgainPercent': -1}},
     {'cursor': 'YXJyYXljb25uZWN0aW9uOjE=',
      'node': {'__typename': 'Teacher',
       'avgDifficulty': 3.5,
       'avgRating': 3.3,
       'department': 'Statistics',
       'firstName': 'Abdolrahman',
       'id': 'VGVhY2hlci0yMjA2ODg=',
       'isSaved': False,
       'lastName': 'Azari',
       'legacyId': 220688,
       'numRa

In [73]:
def fetch_info(dic): 
    name = dic['node']['firstName'] + " " + dic['node']['lastName']
    lid = "/professor?tid=" + str(dic['node']['legacyId'])
    return name, lid
    
prof_list = result['data']['search']['teachers']['edges']
    
[fetch_info(prof) for prof in prof_list]

[('Francisco Samaniego', '/professor?tid=162840'),
 ('Abdolrahman Azari', '/professor?tid=220688'),
 ('Prabir Burman', '/professor?tid=346995'),
 ('Tamara Greasby', '/professor?tid=525108'),
 ('Rahman Azari', '/professor?tid=770600'),
 ('Clayton Schupp', '/professor?tid=776124'),
 ('Greg Wall', '/professor?tid=780398'),
 ('Lawrence C. Wang', '/professor?tid=810072')]

Using developer tools, we find the the subsequent requests can be done using a different data layout. Watch out, the `query` value has changed! 

In [74]:
cursor = result['data']['search']['teachers']['pageInfo']['endCursor']
cursor

'YXJyYXljb25uZWN0aW9uOjc='

In [75]:
def new_data(cursor):
    data = {
        "query":"query TeacherSearchPaginationQuery(\n  $count: Int!\n  $cursor: String\n  $query: TeacherSearchQuery!\n) {\n  search: newSearch {\n    ...TeacherSearchPagination_search_1jWD3d\n  }\n}\n\nfragment TeacherSearchPagination_search_1jWD3d on newSearch {\n  teachers(query: $query, first: $count, after: $cursor) {\n    didFallback\n    edges {\n      cursor\n      node {\n        ...TeacherCard_teacher\n        id\n        __typename\n      }\n    }\n    pageInfo {\n      hasNextPage\n      endCursor\n    }\n    resultCount\n    filters {\n      field\n      options {\n        value\n        id\n      }\n    }\n  }\n}\n\nfragment TeacherCard_teacher on Teacher {\n  id\n  legacyId\n  avgRating\n  numRatings\n  ...CardFeedback_teacher\n  ...CardSchool_teacher\n  ...CardName_teacher\n  ...TeacherBookmark_teacher\n}\n\nfragment CardFeedback_teacher on Teacher {\n  wouldTakeAgainPercent\n  avgDifficulty\n}\n\nfragment CardSchool_teacher on Teacher {\n  department\n  school {\n    name\n    id\n  }\n}\n\nfragment CardName_teacher on Teacher {\n  firstName\n  lastName\n}\n\nfragment TeacherBookmark_teacher on Teacher {\n  id\n  isSaved\n}\n",
        "variables":{
            "count":8,
            "cursor": cursor, 
            "query":{
                "text":"",
                "schoolID":"U2Nob29sLTEwNzM=",
                "fallback":True,
                "departmentID":"RGVwYXJ0bWVudC0xNDA=", 
            }
        }
    } 
    return data
data = new_data(cursor)

In [76]:
response = requests.post(endpoint, headers = headers, json=data)
response.raise_for_status()
result = response.json()
result

{'data': {'search': {'teachers': {'didFallback': False,
    'edges': [{'cursor': 'YXJyYXljb25uZWN0aW9uOjg=',
      'node': {'__typename': 'Teacher',
       'avgDifficulty': 1.4,
       'avgRating': 3.9,
       'department': 'Statistics',
       'firstName': 'Soma',
       'id': 'VGVhY2hlci04MTA3ODI=',
       'isSaved': False,
       'lastName': 'Roychowdhury',
       'legacyId': 810782,
       'numRatings': 24,
       'school': {'id': 'U2Nob29sLTEwNzM=',
        'name': 'University of California Davis'},
       'wouldTakeAgainPercent': -1}},
     {'cursor': 'YXJyYXljb25uZWN0aW9uOjk=',
      'node': {'__typename': 'Teacher',
       'avgDifficulty': 0,
       'avgRating': 0,
       'department': 'Statistics',
       'firstName': 'Seung Tae',
       'id': 'VGVhY2hlci0xMTA5NzUx',
       'isSaved': False,
       'lastName': 'Yi',
       'legacyId': 1109751,
       'numRatings': 0,
       'school': {'id': 'U2Nob29sLTEwNzM=',
        'name': 'University of California Davis'},
       'wouldTak

In [77]:
prof_list = result['data']['search']['teachers']['edges']
[fetch_info(prof) for prof in prof_list]

[('Soma Roychowdhury', '/professor?tid=810782'),
 ('Seung Tae Yi', '/professor?tid=1109751'),
 ('Mitali Das', '/professor?tid=1112661'),
 ('Peter Hall', '/professor?tid=1133672'),
 ('Rudolph Beran', '/professor?tid=1155956'),
 ('Travis Loux', '/professor?tid=1177487'),
 ('Azari Abdolrahman', '/professor?tid=1239201'),
 ('Erin Esp', '/professor?tid=1310060')]

In [78]:
cursor = result['data']['search']['teachers']['pageInfo']['endCursor']
cursor

'YXJyYXljb25uZWN0aW9uOjE1'

In [79]:
flag = result['data']['search']['teachers']['pageInfo']['hasNextPage']
flag

True

Lets formalize this. 

In [80]:
def fetch_profs(): 
    endpoint = 'https://www.ratemyprofessors.com/graphql'
    headers = {
        "Authorization": "Basic dGVzdDp0ZXN0", 
    }
    
    # first query
    data = {
        "query":"query TeacherSearchResultsPageQuery(\n  $query: TeacherSearchQuery!\n  $schoolID: ID\n) {\n  search: newSearch {\n    ...TeacherSearchPagination_search_1ZLmLD\n  }\n  school: node(id: $schoolID) {\n    __typename\n    ... on School {\n      name\n    }\n    id\n  }\n}\n\nfragment TeacherSearchPagination_search_1ZLmLD on newSearch {\n  teachers(query: $query, first: 8, after: \"\") {\n    didFallback\n    edges {\n      cursor\n      node {\n        ...TeacherCard_teacher\n        id\n        __typename\n      }\n    }\n    pageInfo {\n      hasNextPage\n      endCursor\n    }\n    resultCount\n    filters {\n      field\n      options {\n        value\n        id\n      }\n    }\n  }\n}\n\nfragment TeacherCard_teacher on Teacher {\n  id\n  legacyId\n  avgRating\n  numRatings\n  ...CardFeedback_teacher\n  ...CardSchool_teacher\n  ...CardName_teacher\n  ...TeacherBookmark_teacher\n}\n\nfragment CardFeedback_teacher on Teacher {\n  wouldTakeAgainPercent\n  avgDifficulty\n}\n\nfragment CardSchool_teacher on Teacher {\n  department\n  school {\n    name\n    id\n  }\n}\n\nfragment CardName_teacher on Teacher {\n  firstName\n  lastName\n}\n\nfragment TeacherBookmark_teacher on Teacher {\n  id\n  isSaved\n}\n",
        "variables":{
            "query":{
                "text":"",
                "schoolID":"U2Nob29sLTEwNzM=",
                "fallback":True,
                "departmentID":"RGVwYXJ0bWVudC0xNDA=", 
            },
            "schoolID":"U2Nob29sLTEwNzM="
        }
    } 
    
    response = requests.post(endpoint, headers = headers, json=data)
    result = response.json()
    
    prof_list = result['data']['search']['teachers']['edges']
    df = [fetch_info(prof) for prof in prof_list]
    
    cursor = result['data']['search']['teachers']['pageInfo']['endCursor']
    
    flag = True
    while flag: 
        data = new_data(cursor)
        response = requests.post(endpoint, headers = headers, json=data)
        result = response.json()
            
        prof_list = result['data']['search']['teachers']['edges']
        df.extend([fetch_info(prof) for prof in prof_list])
        cursor = result['data']['search']['teachers']['pageInfo']['endCursor']

        flag = result['data']['search']['teachers']['pageInfo']['hasNextPage']
        
    return df

In [81]:
df = fetch_profs()

In [82]:
pd.DataFrame(df)

Unnamed: 0,0,1
0,Francisco Samaniego,/professor?tid=162840
1,Abdolrahman Azari,/professor?tid=220688
2,Prabir Burman,/professor?tid=346995
3,Tamara Greasby,/professor?tid=525108
4,Rahman Azari,/professor?tid=770600
...,...,...
6147,Paul Cartledge,/professor?tid=965738
6148,Pulindu Ratnasekera,/professor?tid=2810585
6149,John Stockton,/professor?tid=2485678
6150,Tony Passero,/professor?tid=2791404


### Summary 

- `Selenium` is very useful to remote-control a browser
- Internally, information is usually handled via APIs anyway