HTML Table Extractor

Note: This is a re-release of html-table-extractor of yuanxu-li, existing just because I've been waiting for too long for an actual release to fix the incorrect dependency (pipenv would refuse to install new version of BeautifulSoup using the original version 1.4.0). I've kept changes to a minimum, just to add this notice, fix setup.py to make it PyPI friendly, and change the PyPI package name.

HTML Table Extractor is a python library that uses Beautiful Soup to extract data from complicated and messy html table

Important links

Repository: https://github.com/yuanxu-li/html-table-extractor
Issues: https://github.com/yuanxu-li/html-table-extractor/issues

Installation

pip install 'beautifulsoup4==4.5.3'
pip install html-table-extractor

Usage

Example 1 - Simple

1	2
3	4

from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2'], [u'3', u'4']]

Example 2 - Transformer

1	2
3	4

from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc, transformer=int)
extractor.parse()
extractor.return_list()

It will print out:

[[1, 2], [3, 4]]

Example 3 - Pass BS4 Tag

1	2
3	4

from html_table_extractor.extractor import Extractor
from bs4 import BeautifulSoup
table_doc = """
<html><table id='wanted'><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table><table id='unwanted'><tr><td>not wanted</td></tr></table></html>
"""
soup = BeautifulSoup(table_doc, 'html.parser')
extractor = Extractor(soup, id_='wanted')
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2'], [u'3', u'4']]

Example 4 - Complex

1	2	3
	4
5

from html_table_extractor.extractor import Extractor
table_doc = """
<table>
  <tr>
    <td rowspan=2>1</td>
    <td>2</td>
    <td>3</td>
  </tr>
  <tr>
    <td colspan=2>4</td>
  </tr>
  <tr>
    <td colspan=3>5</td>
  </tr>
</table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2', u'3'], [u'1', u'4', u'4'], [u'5', u'5', u'5']]

Example 5 - Conflicted

1	2	3
	4
5

from html_table_extractor.extractor import Extractor
table_doc = """
<table>
    <tr>
        <td rowspan=2>1</td>
        <td>2</td>
        <td rowspan=3>3</td>
    </tr>
    <tr>
        <td colspan=2>4</td>
    </tr>
    <tr>
        <td colspan=2>5</td>
    </tr>
</table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2', u'3'], [u'1', u'4', u'3'], [u'5', u'5', u'3']]

Example 6 - Write to file

1	2
3	4

from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc).parse()
extractor.write_to_csv(path='.')

It will write to a given path and create a new csv file called output.csv:

1,2
3,4

Team

@yuanxu-li

Errors/ Bugs

If something is not working correctly, or if you have any suggestion on improvements, report it here

Copyright

Third-party copyright in this distribution is noted where applicable.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
html_table_extractor		html_table_extractor
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HTML Table Extractor

Important links

Installation

Usage

Example 1 - Simple

Example 2 - Transformer

Example 3 - Pass BS4 Tag

Example 4 - Complex

Example 5 - Conflicted

Example 6 - Write to file

Team

Errors/ Bugs

Copyright

About

Uh oh!

Releases

Packages

Languages

License

isaacto/html-table-extractor

Folders and files

Latest commit

History

Repository files navigation

HTML Table Extractor

Important links

Installation

Usage

Example 1 - Simple

Example 2 - Transformer

Example 3 - Pass BS4 Tag

Example 4 - Complex

Example 5 - Conflicted

Example 6 - Write to file

Team

Errors/ Bugs

Copyright

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages