# Scrape HTML of trip descriptions
In this notebook we use `requests` to get the URLs to all accounts of LSD trips, and scrape the raw HTML for each.

In [96]:
import requests

Firstly we will scrape HTML from the index page of all LSD trip accounts

In [87]:
# Website top level URL
BASE_URL = 'https://www.erowid.org'
SAVE_PATH = Path('./artefacts')
# Path to index of accounts
EXPERIENCE_INDEX_URL = BASE_URL + '/experiences/exp.cgi'
LSD_INDEX_URL = BASE_URL + '/experiences/subs/exp_LSD.shtml'
LSD_GENERAL_INDEX_URL = BASE_URL + '/experiences/subs/exp_LSD_General.shtml'

At time of analysis there are 657 general LSD experience reports. By default, only 100 are listed on the page at a time. To view all at on a single page, we need to generate a different URL using the general experience URL and Erowid's param API.

- `S` specifies the drug ('2' for LSD)
- `C` specifies the type of experience ('1' for general)

We will define a helper function to achieve this task with any URL, as well as to save to file and avoid re-scraping pages.

In [217]:
def get_html(url, params=None, save_file=None, force_overwrite=False):
    """
    Scrape HTML data from a URL. If already scraped, don't re-query,
    read in from file instead of re-
    """
    # Store in the SAVE_PATH directory
    if save_file is not None and str(SAVE_PATH) not in save_file:
        save_file = str(SAVE_PATH / save_file)

    # If file already exists, simply read in
    if save_file is not None and Path(save_file).exists() and not force_overwrite:
        with open(save_file, 'r') as fhand:
            print("Read in", save_file)
            text = fhand.read()
            print(f"{len(text):,} characters")
            return text
    else:
        response = requests.get(url, params=params)
        print(f"Status code: {response.status_code} for URL {response.url}")
        print(f"{len(response.text):,} characters in {response.elapsed.seconds} seconds")
        if save_file is not None:
            with open(save_file, 'w') as fhand:
                fhand.write(response.text)
        return response

In [213]:
LSD_INDEX_SAVE = 'lsd_index_page.html'

# Will generate 'https://www.erowid.org/experiences/exp.cgi?S=2&C=1&Start=0&Max=1000'
payload = {'S': 2, 'C': 1, 'Start': 0, 'Max': 1000}
response = get_html(EXPERIENCE_INDEX_URL, params=payload,
                    save_file=LSD_INDEX_SAVE)
if isinstance(response, requests.models.Response):
    response = response.text

Read in artefacts/lsd_index_page.html
123,518 characters


Now we have scraped the HTML of the index of all LSD reviews, we have to do some HTML parsing to retrieve the hyperlinks to each review. We will use `beautifulsoup`.

In [98]:
from bs4 import BeautifulSoup, Tag, Comment, NavigableString

In [105]:
# soup = BeautifulSoup(html_doc, 'html.parser')
soup = BeautifulSoup(lsd_general_experiences_req, 'lxml')
print(soup.prettify()[:500])

<html>
 <head>
  <title>
   LSD Reports - General : Erowid Experience Vaults
  </title>
  <meta content="Erowid Experience Vaults: An Experience" name="description"/>
  <meta content="Experience Report Vaults, trip reports, stories, descriptions" name="keywords"/>
  <link href="/includes/general_default.css" rel="stylesheet" type="text/css"/>
  <link href="includes/exp.css" rel="stylesheet" type="text/css"/>
  <!-- Sperowider <noindex/> -->
 </head>
 <body alink="#008080" bgcolor="#000000" link=


The index page is structured as a table like so:

![title](erowid_index.png)

We will find all tables in the parse tree. Erowid's HTML is a little painfully set out. The obvious main table we want is actually the third table in the page, and is nested in the second table.

In [144]:
tables = soup.find_all('table')

# We want the third table (nested in the second)
index_table = tables[2]

In [156]:
for i, elem in enumerate(list(index_table.children)[:7]):
    print(f"{i + 1}. {type(elem)}: {elem}")

1. <class 'bs4.element.NavigableString'>: 

2. <class 'bs4.element.Tag'>: <tr height="10">
<th width="75"><input onclick="SortBy('RA');" type="Button" value="Rating"/><img src="/experiences/images/arrow_down.jpg"/></th>
<th width="230"><input onclick="SortBy('TA');" type="Button" value=" Title"/></th>
<th width="105"><input onclick="SortBy('AA');" type="Button" value=" Author "/></th>
<th width="150"><input onclick="SortBy('SA');" type="Button" value="Substance"/></th>
<th width="85"><input onclick="SortBy('PDD');" type="Button" value="Pub Date"/><img src="/experiences/images/arrow_down.jpg"/></th>
</tr>
3. <class 'bs4.element.NavigableString'>: 

4. <class 'bs4.element.Tag'>: <tr height="8"><th colspan="5"></th></tr>
5. <class 'bs4.element.NavigableString'>: 

6. <class 'bs4.element.Tag'>: <tr class=""><td> <img align="right" alt="Very Highly Recommended" border="0" src="images/exp_star_3.gif"/></td><td><a href="exp.php?ID=89042">Some Growing Up to Do</a></td><td>thingummajig</td><td>

This is a bit annoying - every second child of the index tree is a blank `NavigableString`. This is the datatype `Beautifulsoup` uses to contain text within tags. I'm not sure whether the fault lies with `BeautifulSoup` for parsing the page weirdly, or Erowid for having weird HTML.  

In either case, to avoid dealing with this, we'll pull out all the table rows and just work with these.

In [175]:
for i, elem in enumerate(index_table.find_all('tr')):
    print(f"{i + 1}. {elem}: \n")
    if i >= 4:
        break

1. <tr height="10">
<th width="75"><input onclick="SortBy('RA');" type="Button" value="Rating"/><img src="/experiences/images/arrow_down.jpg"/></th>
<th width="230"><input onclick="SortBy('TA');" type="Button" value=" Title"/></th>
<th width="105"><input onclick="SortBy('AA');" type="Button" value=" Author "/></th>
<th width="150"><input onclick="SortBy('SA');" type="Button" value="Substance"/></th>
<th width="85"><input onclick="SortBy('PDD');" type="Button" value="Pub Date"/><img src="/experiences/images/arrow_down.jpg"/></th>
</tr>: 

2. <tr height="8"><th colspan="5"></th></tr>: 

3. <tr class=""><td> <img align="right" alt="Very Highly Recommended" border="0" src="images/exp_star_3.gif"/></td><td><a href="exp.php?ID=89042">Some Growing Up to Do</a></td><td>thingummajig</td><td>MDMA, LSD &amp; Cannabis</td><td align="right">Feb 2 2012</td>
</tr>: 

4. <tr class=""><td> <img align="right" alt="Highly Recommended" border="0" src="images/exp_star_2.gif"/></td><td><a href="exp.php?ID=1

The first row has the headers, the second row is blank, then the review links start from the third row. Note that each review row has:
- an image indicating the rated number of stars
- a link to the write-up with a title
- a username
- a list of substances used
- a date

We'll go through the table, extracting all these fields into a list of dict, then converting them into a `Pandas.DataFrame`.

In [189]:
# Get all rows of this table
rows = index_table.find_all('tr')

In [166]:
# First row is the headers - we will retrieve the header values
headers = rows[0].find_all('th')
headers

[<th width="75"><input onclick="SortBy('RA');" type="Button" value="Rating"/><img src="/experiences/images/arrow_down.jpg"/></th>,
 <th width="230"><input onclick="SortBy('TA');" type="Button" value=" Title"/></th>,
 <th width="105"><input onclick="SortBy('AA');" type="Button" value=" Author "/></th>,
 <th width="150"><input onclick="SortBy('SA');" type="Button" value="Substance"/></th>,
 <th width="85"><input onclick="SortBy('PDD');" type="Button" value="Pub Date"/><img src="/experiences/images/arrow_down.jpg"/></th>]

In [167]:
col_names = [th.input['value'].strip() for th in headers]
col_names

['Rating', 'Title', 'Author', 'Substance', 'Pub Date']

In [190]:
experiences = []

# Reviews start from the 3rd row
for row in rows[2:]:
    entry = {}
    # Get all cells of the row
    for i, td in enumerate(row.find_all('td')):
        # The first column has an img tag of stars rating
        if td.find('img') is not None:
            entry[col_names[i]] = td.img['alt']
        # For all other rows, just want the text
        else:
            entry[col_names[i]] = td.text
        # If link included, add that to dict
        if td.find('a') is not None:
            entry['href'] = td.find('a').get('href')
    experiences.append(entry)

In [191]:
len(experiences)

657

In [192]:
experiences[:3]

[{'Rating': 'Very Highly Recommended',
  'Title': 'Some Growing Up to Do',
  'href': 'exp.php?ID=89042',
  'Author': 'thingummajig',
  'Substance': 'MDMA, LSD & Cannabis',
  'Pub Date': 'Feb 2 2012'},
 {'Rating': 'Highly Recommended',
  'Title': 'My Minidose Manifesto',
  'href': 'exp.php?ID=112505',
  'Author': 'Uncle Iroh',
  'Substance': 'LSD',
  'Pub Date': 'Oct 26 2018'},
 {'Rating': 'Highly Recommended',
  'Title': 'The Colossus',
  'href': 'exp.php?ID=112152',
  'Author': 'nervewing',
  'Substance': 'Memantine, 3-MEO-PCE, LSD, 4-AcO-MiPT, 4-HO-MET, 2C-C, Clonazepam & Aripiprazole',
  'Pub Date': 'Aug 4 2018'}]

Many reviews seem to include multiple drugs, e.g. the first review would appear to include memantine, 3-MEO-PCE and a bunch of things. We'll restrict to just the pure LSD reviews.

In [193]:
lsd_only = [ent for ent in experiences if ent['Substance'] == 'LSD']

In [194]:
len(lsd_only)

438

Finally we can convert this to a `Pandas DataFrame` and save to file as a CSV.

In [195]:
df = pd.DataFrame(lsd_only)
df = df[['Rating', 'Title', 'href', 'Author', 'Substance', 'Pub Date']]
df.head()

Unnamed: 0,Rating,Title,href,Author,Substance,Pub Date
0,Highly Recommended,My Minidose Manifesto,exp.php?ID=112505,Uncle Iroh,LSD,Oct 26 2018
1,Highly Recommended,Physics at the Edge of the Universe,exp.php?ID=69866,Spooky,LSD,Apr 19 2016
2,Highly Recommended,Somatic Vision and Cosmic Consciousness,exp.php?ID=77462,Lokapalas,LSD,Nov 28 2013
3,Highly Recommended,LSD Microdosing RCT,exp.php?ID=101638,Gwern.net,LSD,Oct 23 2013
4,Highly Recommended,It Can Be Whatever I Want It to Be,exp.php?ID=88486,triptacular,LSD,Dec 4 2012


In [196]:
df['Pub Date'] = pd.to_datetime(df['Pub Date'])

In [197]:
df.head()

Unnamed: 0,Rating,Title,href,Author,Substance,Pub Date
0,Highly Recommended,My Minidose Manifesto,exp.php?ID=112505,Uncle Iroh,LSD,2018-10-26
1,Highly Recommended,Physics at the Edge of the Universe,exp.php?ID=69866,Spooky,LSD,2016-04-19
2,Highly Recommended,Somatic Vision and Cosmic Consciousness,exp.php?ID=77462,Lokapalas,LSD,2013-11-28
3,Highly Recommended,LSD Microdosing RCT,exp.php?ID=101638,Gwern.net,LSD,2013-10-23
4,Highly Recommended,It Can Be Whatever I Want It to Be,exp.php?ID=88486,triptacular,LSD,2012-12-04


In [198]:
csv_output_path = str(SAVE_PATH / 'lsd_metadata.csv')
df.to_csv(csv_output_path, index=None)

We will now set up a loop to iterate through and scrape the HTML for each experience write-up. We'll wait 1 second between queries to be very nice to their servers.  

Note that some reviews redirect to an external site. Because the next stage will depend on a predictably structured HTML tree, we'll check the scraped URL and discard any that redirected outside Erowid's domain. Hopefully this is not very common

In [228]:
import time

redirects = []

for i, row in df.iterrows():
    query_url = 'https://www.erowid.org/experiences_html/' + row['href']
    # Extract the number from e.g. "exp.php?ID=101638"
    fname = str(SAVE_PATH / 'experiences' / (query_url.split('ID=')[-1] + '.html'))
    experience_res = get_html(query_url)
    
    # Check for redirect
    if 'https://www.erowid.org' not in experience_res.url:
        redirects.append(experience_res.url)
    else:
        with open(fname, 'w') as fhand:
            fhand.write(experience.text)
    time.sleep(1)

Status code: 200 for URL https://www.erowid.org/experiences/exp.php?ID=112505
18,218 characters in 0 seconds
Status code: 200 for URL https://www.erowid.org/experiences/exp.php?ID=69866
18,959 characters in 1 seconds
Status code: 200 for URL https://www.erowid.org/experiences/exp.php?ID=77462
42,875 characters in 0 seconds
Status code: 200 for URL http://www.gwern.net/LSD%20microdosing
330 characters in 0 seconds
Status code: 200 for URL https://www.erowid.org/experiences/exp.php?ID=88486
28,678 characters in 0 seconds
Status code: 200 for URL https://www.erowid.org/experiences/exp.php?ID=76778
21,377 characters in 0 seconds
Status code: 200 for URL https://www.erowid.org/experiences/exp.php?ID=93659
14,451 characters in 0 seconds
Status code: 200 for URL https://www.erowid.org/experiences/exp.php?ID=83544
26,379 characters in 0 seconds
Status code: 200 for URL https://www.erowid.org/experiences/exp.php?ID=73418
15,155 characters in 0 seconds
Status code: 200 for URL https://www.erowid

In [229]:
3 + 5

8