## Robust HTML table parser code for tables with rowspan and/or colspans
In scouring the web, it was to the embarrasment of humanity that I could not find a parser that could accurately parse the tables found on Wikipedia such as [Wisconsin_political_power](https://en.wikipedia.org/wiki/Political_party_strength_in_Wisconsin); I thus wrote my own.

In [1]:
##Libraries used are:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
import requests

### Parser functions

In [2]:
def pre_process_table(table):
    """
    INPUT:
        1. table - a bs4 element that contains the desired table: ie <table> ... </table>
    OUTPUT:
        a tuple of: 
            1. rows - a list of table rows ie: list of <tr>...</tr> elements
            2. num_rows - number of rows in the table
            3. num_cols - number of columns in the table
    """
    rows = [x for x in table.find_all('tr')]

    num_rows = len(rows)
    num_cols = max([len(x.find_all('th')) for x in rows])
    
    return (rows, num_rows, num_cols)


def get_spans(cell):
        """
        INPUT:
            1. cell - a <td>...</td> or <th>...</th> element that contains a table cell entry
        OUTPUT:
            1. a tuple with the cell's row and col spans
        """
        if cell.has_attr('rowspan'):
            rep_row = int(cell.attrs['rowspan'])
        else: # ~cell.has_attr('rowspan'):
            rep_row = 1
        if cell.has_attr('colspan'):
            rep_col = int(cell.attrs['colspan'])
        else: # ~cell.has_attr('colspan'):
            rep_col = 1 
        
        return (rep_row, rep_col)
 
def process_rows(rows, num_rows, num_cols):
    """
    INPUT:
        1. rows - a list of table rows ie <tr>...</tr> elements
    OUTPUT:
        1. data - a Pandas dataframe with the html data in it
    """
    data = pd.DataFrame(np.ones((num_rows, num_cols))*np.nan)
    for i, row in enumerate(rows):
        col_stat = data.iloc[i,:][data.iloc[i,:].isnull()].index[0]

        for j, cell in enumerate(row.find_all(['td', 'th'])):
            rep_row, rep_col = get_spans(cell)

            #print("cols {0} to {1} with rep_col={2}".format(col_stat, col_stat+rep_col, rep_col))
            #print("\trows {0} to {1} with rep_row={2}".format(i, i+rep_row, rep_row))

            #find first non-na col and fill that one
            while any(data.iloc[i,col_stat:col_stat+rep_col].notnull()):
                col_stat+=1

            data.iloc[i:i+rep_row,col_stat:col_stat+rep_col] = cell.getText()
            if col_stat<data.shape[1]-1:
                col_stat+=rep_col

    return data



### Example:
Here, I will parse the HTML table linked in the description. 

In [3]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
import requests

def fetch_website(url):
    """
    To hide that the scraping is being done via Python, I change the user-agent to a Firefox
    browser so that the website believes it is a chrome browser accessing them. Hope it works.
    """
    user_agent={'User-agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.18 Safari/537.36'}
    r=requests.get(url, headers=user_agent)
    try:
        #print("Accessed and downloaded URL data")
        return(r.content)
    except ConnectionError:
        print("Skipping this url")
        return(None)

In [4]:
## Go to the link and download the page's html code:
url = "https://en.wikipedia.org/wiki/Political_party_strength_in_Wisconsin"
site = fetch_website(url)
soup = bs(site, 'lxml')

In [5]:
## Find tables on the page and locate the desired one:
## Caveat: note that I explicitly search for a wikitable!  
tables = soup.findAll("table", class_='wikitable')

## I want table 3 or the one that contains years 2000-2018
table = tables[3]

In [6]:
## run the above functions to extract the data
rows, num_rows, num_cols = pre_process_table(table)
df = process_rows(rows, num_rows, num_cols)

## print the result
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,Year,Executive offices,Executive offices,Executive offices,Executive offices,Executive offices,State Legislature,State Legislature,United States Congress,United States Congress,United States Congress,Electoral College votes
1,Year,Governor,Lieutenant Governor,Secretary of State,Attorney General,Treasurer,State Senate,State Assembly,U.S. Senator (Class I),U.S. Senator (Class III),U.S. House,Electoral College votes
2,2000,Tommy Thompson (R),Scott McCallum (R),Doug La Follette (D),Jim Doyle (D),Jack Voight (R),"17D, 16R","55R, 44D",Herb Kohl (D),Russ Feingold (D),"5D, 4R",Gore/Lieberman (D) N
3,2001,Scott McCallum (R),Margaret Farrow (R),Doug La Follette (D),Jim Doyle (D),Jack Voight (R),"18D, 15R","56R, 43D",Herb Kohl (D),Russ Feingold (D),"5D, 4R",Gore/Lieberman (D) N
4,2002,Scott McCallum (R),Margaret Farrow (R),Doug La Follette (D),Jim Doyle (D),Jack Voight (R),"18D, 15R","56R, 43D",Herb Kohl (D),Russ Feingold (D),"5D, 4R",Gore/Lieberman (D) N
5,2003,Jim Doyle (D),Barbara Lawton (D),Doug La Follette (D),Peggy Lautenschlager (D),Jack Voight (R),"18R, 15D","58R, 41D",Herb Kohl (D),Russ Feingold (D),"4R, 4D",Gore/Lieberman (D) N
6,2004,Jim Doyle (D),Barbara Lawton (D),Doug La Follette (D),Peggy Lautenschlager (D),Jack Voight (R),"18R, 15D","58R, 41D",Herb Kohl (D),Russ Feingold (D),"4R, 4D",Kerry/Edwards (D) N
7,2005,Jim Doyle (D),Barbara Lawton (D),Doug La Follette (D),Peggy Lautenschlager (D),Jack Voight (R),"19R, 14D","60R, 39D",Herb Kohl (D),Russ Feingold (D),"4R, 4D",Kerry/Edwards (D) N
8,2006,Jim Doyle (D),Barbara Lawton (D),Doug La Follette (D),Peggy Lautenschlager (D),Jack Voight (R),"19R, 14D","60R, 39D",Herb Kohl (D),Russ Feingold (D),"4R, 4D",Kerry/Edwards (D) N
9,2007,Jim Doyle (D),Barbara Lawton (D),Doug La Follette (D),J. B. Van Hollen (R),Dawn Marie Sass (D),"18D, 15R","52R, 47D",Herb Kohl (D),Russ Feingold (D),"5D, 3R",Kerry/Edwards (D) N
