Skip to content

Commit

Permalink
remove references to clean_html, now replaced by BeautifulSoup
Browse files Browse the repository at this point in the history
  • Loading branch information
stevenbird committed Sep 9, 2015
1 parent 03f3e1d commit 1d83322
Show file tree
Hide file tree
Showing 2 changed files with 1 addition and 9 deletions.
3 changes: 0 additions & 3 deletions book/ch03.rst
Expand Up @@ -300,9 +300,6 @@ of a blog, as shown below:
'was', 'being', 'au', 'courant', ',', 'I', 'mentioned', 'the', 'expression',
'DUI4XIANG4', '\u5c0d\u8c61', '("', 'boy', '/', 'girl', 'friend', '"', ...]

..
>>> word_tokenize(nltk.clean_html(llog.entries[2].content[0].value))
With some further work, we can write programs to create a small corpus of blog posts,
and use this as the basis for our |NLP| work.

Expand Down
7 changes: 1 addition & 6 deletions book/ch11.rst
Expand Up @@ -544,7 +544,7 @@ the original file *using the original word processor*.

Once we know the data is correctly formatted, we
can write other programs to convert the data into a different format.
The program in code-html2csv_ strips out the HTML markup using ``nltk.clean_html()``,
The program in code-html2csv_ strips out the HTML markup using the ``BeautifulSoup`` library,
extracts the words and their pronunciations, and generates output
in "comma-separated value" (CSV) format.

Expand Down Expand Up @@ -572,11 +572,6 @@ with gzip.open(fn+".gz","wb") as f_out:
f_out.write(bytes(s, 'UTF-8'))


.. note::
For more sophisticated processing of |HTML|, use the *Beautiful Soup* package,
available from ``http://www.crummy.com/software/BeautifulSoup/``


Obtaining Data from Spreadsheets and Databases
----------------------------------------------

Expand Down

0 comments on commit 1d83322

Please sign in to comment.