Key points:

* Demo an exploratory workflow when doing web scraping
* Use `selenium` with `chromedriver` to crawl the pages
* Export the data to CSV using `pandas.DataFrame.to_csv()`
* Use double loop to handle utexas page, one loop for category, one loop for subject


# Libguides crawler

Libguide is one standard type of documents curated and maintained by librarians regarding a certain topic. The topics can range from how to use library resource to structure of a specific academic subject. You can find databases, text books, exercises and projects from libguides. It is a good starting point when you enter a new area.

### Final result

* [libguides/columbia.csv](libguides/columbia.csv) -- `(subject, link)`
* [libguides/utexas.csv](libguides/utexas.csv) -- `(category, subject, link)`

In [70]:
import pandas as pd

## First try: columbia

In [47]:
import requests
from bs4 import BeautifulSoup
from IPython.core.display import HTML

In [27]:
content = requests.get('https://library.columbia.edu/subject-guides.html').text
html = BeautifulSoup(content, 'lxml')

In [36]:
#[a.text for a in html.find_all('a')]

In [34]:
#a.text

''

In [38]:
open('libguides/columbia.html', 'w').write(content)

26523

In [39]:
HTML('''
<iframe width=100% src="libguides/columbia.html"></iframe>
'''
)

## First try: utexas

In [40]:
import requests
from bs4 import BeautifulSoup

In [41]:
content = requests.get('https://guides.lib.utexas.edu/?b=s').text

In [42]:
html = BeautifulSoup(content, 'lxml')

In [43]:
#html.find_all('a')

In [44]:
open('libguides/utexas.html', 'w').write(content)

36605

In [46]:
HTML('''
<iframe width=100% src="libguides/utexas.html"></iframe>
'''
)

## Observation

Both sites use dynamic loading. So `requests` alone is not enough to retrieve the content. We need to use browser emulation.

## Use Selenium: columbia

In [50]:
from selenium import webdriver  
from selenium.common.exceptions import NoSuchElementException  
import time  
import re 

In [51]:
browser = webdriver.Chrome()

In [52]:
browser.get('https://library.columbia.edu/subject-guides.html')

In [54]:
el_subjects = browser.find_elements_by_css_selector('div#s-lg-widget-1497288984288 ul li a')

In [56]:
e = el_subjects[0]

In [61]:
e.text, e.get_attribute('href')

('Active Audience: Course guide',
 'https://guides.library.columbia.edu/activeaudience')

In [63]:
subjects = [(e.text, e.get_attribute('href')) for e in el_subjects]

In [66]:
len(subjects)

328

In [67]:
subjects[0]

('Active Audience: Course guide',
 'https://guides.library.columbia.edu/activeaudience')

In [68]:
subjects[-1]

('Zulu Language and Culture Acquisitions at Columbia',
 'https://guides.library.columbia.edu/zulu-language')

In [69]:
subjects[100]

('Decennial Census Information: 2000 Census',
 'https://guides.library.columbia.edu/2000')

In [73]:
df_columbia = pd.DataFrame(subjects, columns=['Subject', 'Link'])

In [76]:
df_columbia.head()

Unnamed: 0,Subject,Link
0,Active Audience: Course guide,https://guides.library.columbia.edu/activeaudi...
1,Adriatic Romanticisms,https://guides.library.columbia.edu/romanticism
2,Advanced Investment Research: Course guide,https://guides.library.columbia.edu/AIR
3,Advanced Investment Research: Course guide,https://guides.library.columbia.edu/advinvestment
4,African Civilization (AFCV C1020),https://guides.library.columbia.edu/AFCV-C1020...


In [75]:
df_columbia.to_csv('libguides/columbia.csv')

## Use Selenium: utexas

In [130]:
browser.get('https://guides.lib.utexas.edu/?b=s')

In [131]:
el_category = browser.find_elements_by_css_selector('div.panel.panel-default')[0]

In [132]:
el_category.find_element_by_css_selector('div.bold').text

'Accounting'

In [133]:
el_category.click()

In [134]:
el_a = el_category.find_elements_by_css_selector('ul.s-lg-guide-list li a')[0]

In [135]:
el_a.text, el_a.get_attribute('href')

('"Business Research Center"', 'https://guides.lib.utexas.edu/BRC')

In [136]:
el_a.text

'"Business Research Center"'

In [138]:
subjects = []

el_categories = browser.find_elements_by_css_selector('div.panel.panel-default')
for el_category in el_categories:
    category = el_category.find_element_by_css_selector('div.bold').text
    
    el_category.click()
    time.sleep(1)
    
    el_as = el_category.find_elements_by_css_selector('ul.s-lg-guide-list li a')
    for el_a in el_as:
        subject = el_a.text
        link = el_a.get_attribute('href')
        subjects.append((category, subject, link))

In [139]:
#subjects

In [140]:
df_utexas = pd.DataFrame(subjects, columns=['Category', 'Subject', 'Link'])

In [142]:
df_utexas.head()

Unnamed: 0,Category,Subject,Link
0,Accounting,"""Business Research Center""",https://guides.lib.utexas.edu/BRC
1,Accounting,Accounting & Tax Research (ACC),https://guides.lib.utexas.edu/ACC
2,Advertising,Advertising and Public Relations,https://guides.lib.utexas.edu/ADVPR
3,Aerospace Engineering,Aerospace Engineering,https://guides.lib.utexas.edu/aerospace
4,African American Studies,African and African American Studies,https://guides.lib.utexas.edu/AAAS


In [143]:
df_utexas.tail()

Unnamed: 0,Category,Subject,Link
495,Women's and Gender Studies,Disability Studies,https://guides.lib.utexas.edu/disabilitystudies
496,Women's and Gender Studies,LGBTQA+ Studies,https://guides.lib.utexas.edu/lgbtq
497,Women's and Gender Studies,Women's & Gender Studies,https://guides.lib.utexas.edu/wgs
498,Yiddish,Hebrew / Jewish / Israel Studies,https://guides.lib.utexas.edu/hebrew-jewish-is...
499,Youth Literature,Youth Literature,https://guides.lib.utexas.edu/youthlit


In [144]:
df_utexas.to_csv('libguides/utexas.csv')

### Notes on the utexas case

There are two key points when crawling utexas libguides.

First is to note that they have "category" as the first level of navigation. One needs to click on the category name in order to expand the secondary menu, which contains the `(subject, link)` tuple we are looking for. For utexas, we have a richer tuple `(category, subject, link)`.

The next thing to watch out is the following block.

```
    el_category.click()
    time.sleep(1)
```

You can try to execute without the two lines. If we don't "click" on the category link. The secondary menu will not be expanded. Then, when we query `.text` of the anchor element, we get empty string. This is an unusual case. Usually, even hidden elements should give you inner text. Do to time constraint, I'm not digging into the root cause today and this is left to interested readers as an exercise.

## Conclusion

Although those universities use the same libguides system as backend, the frontend could be very different. In order to collect all the catelogue, we need to mannual craft suitable logics -- mainly CSS selectors. Next steps:

1. Create a list of libguides root pages (of different universities)
2. Further check if some CSS selectors can be re-used