# Extract and clean HTML to text
Here we will use `BeautifulSoup` to parse each scraped experience report, extracting the text along with metadata such as the dose, ingestion method and body weight.

We will then use `spaCy` to lemmatise and clean the texts and save all to file.

In [1]:
from bs4 import BeautifulSoup, Tag, Comment, NavigableString

In [None]:
SAVE_PATH = Path('./artefacts/experiences_html')

In [27]:
trip1 = BeautifulSoup(trip1_html, 'lxml')
print(trip1.prettify()[:500])

<html>
 <head>
  <title>
   LSD - Erowid Exp - 'My Minidose Manifesto'
  </title>
  <meta content="An Experience with LSD. 'My Minidose Manifesto' by Uncle Iroh" name="description"/>
  <meta content="Experience Report Vaults drug trip reports stories descriptions" name="keywords"/>
  <link href="/includes/general_default.css" rel="stylesheet" type="text/css"/>
  <link href="includes/exp_view.css" id="main_css" rel="stylesheet" type="text/css"/>
  <link href="includes/exp_view_light_on_dark.css" 


In [28]:
divs = trip1.find_all('div')
len(divs)

15

In [72]:
# Kind of annoying HTML on page. Several <divs> with classes have key info
# Then a div has several tables and the main text

trip_info = {}
for div in divs:
    if 'class' in div.attrs:
        # Returns class as list, to allow multi-classes
        if 'title' in div['class']:
            trip_info['title'] = div.text
        elif 'substance' in div['class']:
            trip_info['substance'] = div.text
        elif 'author' in div['class']:
            trip_info['author'] = div.a.text
        # Main body or report stored here
        elif 'report-text-surround' in div['class']:
            report = div

trip_info.keys()

dict_keys(['title', 'substance', 'author'])

In [73]:
assert lsd_only[0]['Title'] == trip_info['title']
assert lsd_only[0]['Author'] == trip_info['author']
assert lsd_only[0]['Substance'] == trip_info['substance']

Good, looks like we have the correct review.

In [74]:
tables = report.find_all('table')
len(tables)

4

In [75]:
# First table doesn't seem to have anything
tables[0]

<table align="right" border="0" cellpadding="0" cellspacing="0">
<tr><td></td><td width="15"> </td></tr>
</table>

In [76]:
# Second table has dose and other info
dose_table = tables[1]
assert dose_table['class'][0] == 'dosechart'
dose_table

<table border="2" bordercolor="#224422" cellpadding="4" cellspacing="0" class="dosechart">
<tr>
<td align="right" width="90">DOSE:<br/></td>
<td align="center" class="dosechart-amount" width="90">10-15 ug</td>
<td align="center" class="dosechart-method">oral</td>
<td class="dosechart-substance"><a href="/chemicals/lsd/">LSD</a></td>
<td class="dosechart-form"><b>(blotter / tab)</b></td>
</tr>
</table>

In [77]:
trip_info['dose_info'] = []

# The first cell is "DOSE:" so skip that
for td in dose_table.find_all('td')[1:]:
    trip_info['dose_info'].append(td.text)

In [78]:
trip_info['dose_info']

['10-15 ug', 'oral', 'LSD', '(blotter / tab)']

In [79]:
tables[2]

<table border="2" bordercolor="#444455" cellpadding="4" cellspacing="0" class="bodyweight">
<tr>
<td class="bodyweight-title" width="110">BODY WEIGHT:</td>
<td class="bodyweight-amount" width="80">180 lb</td>
</tr>
</table>

In [80]:
for td in tables[2].find_all('td'):
    if 'bodyweight-amount' in td['class']:
        trip_info['body_weight'] = td.text

In [81]:
trip_info

{'title': 'My Minidose Manifesto',
 'substance': 'LSD',
 'author': 'Uncle Iroh',
 'dose_info': ['10-15 ug', 'oral', 'LSD', '(blotter / tab)'],
 'body_weight': '180 lb'}

In [82]:
tables[3]

<table border="0" cellpadding="5" cellspacing="0" class="footdata">
<tr><td width="700">Exp Year: 2018</td><td width="90">ExpID: 112505</td></tr>
<tr><td>Gender: Male</td><td> </td></tr>
<tr><td>Age at time of experience: 24</td><td> </td></tr>
<tr><td>Published: Oct 26, 2018</td><td>Views: 1,333</td></tr>
<tr><td align="center" colspan="2">[ <a href="exp.php?ID=112505&amp;format=pdf" type="text/pdf">View as PDF (for printing)</a> ] [ <a href="exp_pdf.php?ID=112505&amp;format=latex">View as LaTeX (for geeks)</a> ]
[ <a href="#" onclick="expChangeColors(); return false;">Switch Colors</a> ]
</td></tr>
<tr><td colspan="2">LSD (2) : Retrospective / Summary (11), Glowing Experiences (4), Performance Enhancement (50), General (1), Alone (16)</td></tr>
<!--  <img src="/images/new.gif" alt="May"> -->
</table>

In [108]:
for i, td in enumerate(tables[3].find_all('td')):
    if 'gender' in td.text.lower():
        # Capture e.g. Gender: Male
        trip_info['gender'] = td.text.split(':')[1].strip()
    elif 'age' in td.text.lower():
        trip_info['age'] = int(td.text.split(':')[1].strip())
    elif 'published' in td.text.lower():
        trip_info['date'] = td.text.split(':')[1].strip()
    elif 'views' in td.text.lower():
        trip_info['views'] = int(td.text.split(':')[1].replace(',', '').strip())

In [109]:
trip_info

{'title': 'My Minidose Manifesto',
 'substance': 'LSD',
 'author': 'Uncle Iroh',
 'dose_info': ['10-15 ug', 'oral', 'LSD', '(blotter / tab)'],
 'body_weight': '180 lb',
 'gender': 'Male',
 'age': 24,
 'date': 'Oct 26, 2018',
 'views': 1333}

In [173]:
for i, elem in enumerate(report):
    if isinstance(elem, Comment):
        if 'start body' in elem.string.lower():
            start_idx = i
        elif 'end body' in elem.string.lower():
            end_idx = i
            break

In [192]:
report_text = report.contents[start_idx + 1:end_idx]
report_text[:5]

['\nMy Minidose Manifesto\r',
 <br/>,
 '\n',
 <br/>,
 '\nI would like to preface this report with a note on the terminology of ingesting sub-perceptual doses of LSD. Technically speaking, a psychedelic microdose is a sub-threshold dose of the substance. This would lead one to believe that the effects of said amount would be unperceivable. There seems to be a contradiction here that I wish to resolve. Call me a drug nerd or a word nerd, but if a microdose is defined as sub-perceptual, then perceiving anything from a dose you took disqualifies it as a true microdose. Since my experiences with small amounts of LSD have somehow fallen between the sub-perceptual and threshold realms, I propose the term \x91minidose.\x92 It\x92s lower than a \x91museum dose,\x92 (One where effects are apparent beyond threshold levels to the user, but still appropriate for a public experience) but higher than a true microdose. Here\x92s a more appropriate word for those of us that felt something that wasn\x92

In [272]:
for i in range(10):
    print(report_text[i], report_text[i].name == 'i')


 False
<br/> False

 False
<br/> False

I would like to preface this report with a note on the terminology of ingesting sub-perceptual doses of LSD. Technically speaking, a psychedelic microdose is a sub-threshold dose of the substance. This would lead one to believe that the effects of said amount would be unperceivable. There seems to be a contradiction here that I wish to resolve. Call me a drug nerd or a word nerd, but if a microdose is defined as sub-perceptual, then perceiving anything from a dose you took disqualifies it as a true microdose. Since my experiences with small amounts of LSD have somehow fallen between the sub-perceptual and threshold realms, I propose the term minidose. Its lower than a museum dose, (One where effects are apparent beyond threshold levels to the user, but still appropriate for a public experience) but higher than a true microdose. Heres a more appropriate word for those of us that felt something that wasnt  False
<i>nothing</i> True
, but no

In [276]:
# After much Googling, seems text has been decoded in latin1
# See https://stackoverflow.com/questions/45292526/how-do-i-convert-unicode-string-with-cp1252-characters-into-utf-8-with-python
texts = []
for elem in report_text:
    if isinstance(elem, NavigableString):
        texts.append(elem.encode('latin1').decode('cp1252'))
    elif isinstance(elem, Tag) and elem.name == 'i':
        texts.append(elem.string.encode('latin1').decode('cp1252'))

<i>nothing</i>
<i>something</i>


In [283]:
text = ""
for elem in texts:
    text += elem

In [284]:
print(text)


My Minidose Manifesto

I would like to preface this report with a note on the terminology of ingesting sub-perceptual doses of LSD. Technically speaking, a psychedelic microdose is a sub-threshold dose of the substance. This would lead one to believe that the effects of said amount would be unperceivable. There seems to be a contradiction here that I wish to resolve. Call me a drug nerd or a word nerd, but if a microdose is defined as sub-perceptual, then perceiving anything from a dose you took disqualifies it as a true microdose. Since my experiences with small amounts of LSD have somehow fallen between the sub-perceptual and threshold realms, I propose the term ‘minidose.’ It’s lower than a ‘museum dose,’ (One where effects are apparent beyond threshold levels to the user, but still appropriate for a public experience) but higher than a true microdose. Here’s a more appropriate word for those of us that felt something that wasn’t nothing, but nothing about it was really something, 

In [314]:
trip_info['text'] = text

In [321]:
with open('./cleaned_texts/112505.txt', 'w') as fhand:
    fhand.write(text)

In [317]:
trip_df = pd.DataFrame(trip_info)
trip_df.head()

Unnamed: 0,title,substance,author,dose_info,body_weight,gender,age,date,views,text
0,My Minidose Manifesto,LSD,Uncle Iroh,10-15 ug,180 lb,Male,24,"Oct 26, 2018",1333,\nMy Minidose Manifesto\r\n\nI would like to p...
1,My Minidose Manifesto,LSD,Uncle Iroh,oral,180 lb,Male,24,"Oct 26, 2018",1333,\nMy Minidose Manifesto\r\n\nI would like to p...
2,My Minidose Manifesto,LSD,Uncle Iroh,LSD,180 lb,Male,24,"Oct 26, 2018",1333,\nMy Minidose Manifesto\r\n\nI would like to p...
3,My Minidose Manifesto,LSD,Uncle Iroh,(blotter / tab),180 lb,Male,24,"Oct 26, 2018",1333,\nMy Minidose Manifesto\r\n\nI would like to p...
