## Scraping S&P 500 Components from Wikipedia 

### What we're going to do: 
1. Scrape wikipedia for lists of: 
    - Current companies in the S&P 500 
    - Historical changes to the S&P 500 
2. Write a function to generate the list of companies in the index as of a given date 
<!-- TEASER_END --> 

### Background 
Getting the current list of companies in the S&P 500 is pretty easy so we're gonna tackle that first. Reconstructing the index historically isn't so easy. Since the index is regularly rebalanced, we need a list of all the companies added and removed from the index and the date the change occurred. 

Wikipedia is nice enough to make this data available, but as we'll see shortly, the format of the table of company changes is a little tricky and requires some web scraping gymnastics to get it into a useable format for analysis. Let's get to it. 

**Attribution:** Two of the big problems I ran into were solved by a fellow named Andy Roche and he was nice enough to write a blog post with his approach an code. [Here's his post](https://roche.io/2016/05/scrape-wikipedia-with-python) so be sure to check that out for a more thorough approach to wikipedia tables.

### Preliminaries 
First up, import libraries, get the site HTML with request.get(), then extract the tables for further cleaning. 

In [1]:
import requests
from bs4 import BeautifulSoup
import datetime 
import re 

# wikipedia page with our target tables and the initial web request 
WIKI_URL = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
req = requests.get(WIKI_URL)
req.raise_for_status()

# here we search for all the tables on the web page and get them into a 
# beautiful soup result set  
soup = BeautifulSoup(req.content, 'lxml')
table_classes = {"class": ["sortable", "plainrowheaders"]}
wikitables = soup.findAll("table", table_classes)
type(wikitables)

bs4.element.ResultSet

### Parsing the table of current companies 
We're interested in the first two tables on the page web page. The first table is a pretty clean HTML table that lists all the companies currently in the S&P 500. We're going to traverse the table, clean thing up a bit, then store the results in a list for use later. Note the regular expression to strip out the wikipedia footnotes. 

In [2]:
rows = wikitables[0].find_all("tr")

#parse data from table by extracting each table row ("tr" tags) 
current_companies_list = []   
for tr in rows:
    if rows.index(tr) == 0: 
        row_cells = [th.getText().strip() for th in tr.find_all('th') 
                        if th.getText().strip() != '']  
    else: 
        row_cells = (([tr.find('th').getText()] if tr.find('th') else []) 
                        + [td.getText().strip() for td in tr.find_all('td')])
    if len(row_cells) > 1: 
        # strip out brackets from reference links 
        for i, element in enumerate(row_cells): 
            if element.find('[') != -1: 
                row_cells[i] = re.sub("[\[].*?[\]]", "", element)
        current_companies_list += [row_cells]
        
current_companies_list[:3]

[['Symbol',
  'Security',
  'SEC filings',
  'GICS Sector',
  'GICS Sub Industry',
  'Headquarters Location',
  'Date first added',
  'CIK',
  'Founded'],
 ['MMM',
  '3M Company',
  'reports',
  'Industrials',
  'Industrial Conglomerates',
  'St. Paul, Minnesota',
  '',
  '0000066740',
  '1902'],
 ['ABT',
  'Abbott Laboratories',
  'reports',
  'Health Care',
  'Health Care Equipment',
  'North Chicago, Illinois',
  '1964-03-31',
  '0000001800',
  '1888']]

Now we have a clean list where the first element is a list of headers and each element after it relates to a single company. We're mostly interested in the ticker symbol for each company but it doesn't hurt too keep the additional reference info for now. Not too shabby! 

**Next comes the hard part:** the history of changes to the index components. The second table has the data we need but it also has lots of rows where a single data element spans multiple rows. This isn't good for data analysis so here's what we have to do: 
- The first column is the date a change occurred so we'll write a helper function to check if it's a date
    - If we find a date, we'll hold it in a temporary variable and repeat it for each row it spans in the original HTML table  
- Next we'll clean the data so we wind up with one list element per change, in the following format: 
    - [Date, Added, Removed, Reason]
- We also need to explicitly keep blank cells (sometimes companies are added and none are removed or vice versa) 

In [3]:
#get table of changes into bs4 result set 
row_chgs = wikitables[1].find_all("tr")

#function to check if first element is a date 
def date_check(date_text): 
    try: 
        datetime.datetime.strptime(date_text, '%B %d, %Y')
        return True 
    except ValueError: 
        return False 

# parse data as is
company_changes_list, date_holder, reason_holder = [], '', ''
for tr in row_chgs:
    if row_chgs.index(tr) == 0: 
        row_cells = [th.getText().strip() for th in tr.find_all('th') 
                        if th.getText().strip() != '']  
    else: 
        row_cells = (([tr.find('th').getText()] if tr.find('th') else []) 
                        + [td.getText().strip() for td in tr.find_all('td')])
        # check if element is a date 
        if date_check(row_cells[0]): 
            date_holder = row_cells[0]
            reason_holder = row_cells[-1]
        else: 
            row_cells.insert(0, date_holder)
            if len(row_cells) == 5: 
                row_cells.append(reason_holder) 
    if len(row_cells) > 1: 
        # strip out brackets from reference links 
        if len(row_cells) == 6: 
            row_cells[5] = re.sub("[\[].*?[\]]", "", row_cells[5])
        company_changes_list += [row_cells]

company_changes_list[:6]

[['Date', 'Added', 'Removed', 'Reason'],
 ['', 'Ticker'],
 ['January 2, 2019',
  'FRC',
  'First Republic Bank',
  'SCG',
  'SCANA',
  'Dominion Energy acquiring SCANA Corporation'],
 ['December 24, 2018',
  'CE',
  'Celanese Corp.',
  'ESRX',
  'Express Scripts',
  'S&P 500 constituent Cigna (NYSE: CI) acquired ESRX'],
 ['December 3, 2018',
  'LW',
  'Lamb Weston Holdings Inc',
  'COL',
  'Rockwell Collins Inc',
  'UTX acquires COL '],
 ['December 3, 2018',
  'MXIM',
  'Maxim Integrated Products Inc',
  'AET',
  'Aetna Inc',
  'CVS acquires Aetna']]

Ok we're good to go! Notice the second element of the list has a junk entry in it since the table headers aren't consistent - such is life when scraping data from the web! 

In [4]:
#final bit of cleaning - delete the last bit of junk HTML in the second list element 
del company_changes_list[1]

company_changes_list[:6]

[['Date', 'Added', 'Removed', 'Reason'],
 ['January 2, 2019',
  'FRC',
  'First Republic Bank',
  'SCG',
  'SCANA',
  'Dominion Energy acquiring SCANA Corporation'],
 ['December 24, 2018',
  'CE',
  'Celanese Corp.',
  'ESRX',
  'Express Scripts',
  'S&P 500 constituent Cigna (NYSE: CI) acquired ESRX'],
 ['December 3, 2018',
  'LW',
  'Lamb Weston Holdings Inc',
  'COL',
  'Rockwell Collins Inc',
  'UTX acquires COL '],
 ['December 3, 2018',
  'MXIM',
  'Maxim Integrated Products Inc',
  'AET',
  'Aetna Inc',
  'CVS acquires Aetna'],
 ['December 3, 2018',
  'FANG',
  'Diamondback Energy Inc',
  'SRCL',
  'Stericycle Inc',
  'Market Capitalization change']]