The general idea behind web scraping is to retrieve data from a website and convert it into a format that is usable for any kind of analysis. In this tutorial, I will be going through a detail but simple explanation of how to scrape data in Python using BeautifulSoup and I will be scraping Wikipedia to find out all the s&p ticker list present in https://en.wikipedia.org/wiki/List_of_S%26P_400_companies. Refer http://www.compjour.org/warmups/govt-text-releases/intro-to-bs4-lxml-parsing-wh-press-briefings/ for BeautifulSoup tutorial. 

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# requests.get(url).text will ping a website and return you HTML of the website.
get_sp_500_ticker_url = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_400_companies').text

Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. Prettify() function in BeautifulSoup will enable us to view how the tags are nested in the document.

In [8]:
detailed_info = BeautifulSoup(get_sp_500_ticker_url, 'lxml' )
# print(detailed_info.prettify())

By carefully inspecting the HTML script all under the table contents i.e. names of the ticker which we intend to extract is under class Wikitable Sortable. 

In [9]:
# find class ‘wikitable sortable’ in the HTML script.
sp_table_500 = detailed_info.find('table', {'class': 'wikitable sortable'})
# print(sp_table_500)

In [12]:
# extract all the links within <a>, we will use find_all() & we are not interested in any other types
sp_500_link = sp_table_500.find_all('a',{'rel':'nofollow' })
# sp_500_link

In [5]:
# extract the ticker but not the reports which is the name of countries.
sp_500_ticker = []
for ticker in sp_500_link:
    if ticker.text != 'reports':
        sp_500_ticker.append(ticker.text)
print(sp_500_ticker)

['AAN', 'ACHC', 'ACIW', 'ADNT', 'ATGE', 'ACM', 'ACC', 'AEO', 'AFG', 'AGCO', 'AHL', 'AKRX', 'ALE', 'ALEX', 'APY', 'ATI', 'AMCX', 'AN', 'ARW', 'ARRS', 'ASB', 'ASGN', 'ASH', 'ATO', 'ATR', 'AVNS', 'AVT', 'AYI', 'BBBY', 'BC', 'BCO', 'BDC', 'BID', 'BIG', 'BIO', 'BKH', 'BLKB', 'BMS', 'BOH', 'BRO', 'BXS', 'BYD', 'CABO', 'CAKE', 'CAR', 'CARS', 'CASY', 'CATY', 'CBSH', 'CBT', 'CC', 'CDK', 'CFR', 'CGNX', 'CHE', 'CHDN', 'CHFC', 'CHK', 'CIEN', 'CLB', 'CLGX', 'CLH', 'CLI', 'CMC', 'CMD', 'CMP', 'CNK', 'CNO', 'COHR', 'CONE', 'COR', 'CPE', 'CPT', 'CR', 'CREE', 'CRI', 'CRL', 'CRS', 'CRUS', 'CNX', 'CSL', 'CTLT', 'CUZ', 'CVLT', 'CXW', 'CW', 'CBRL', 'CY', 'DAN', 'DCI', 'DDS', 'DECK', 'DEI', 'DKS', 'DLPH', 'DLX', 'DNB', 'DNKN', 'DNOW', 'DO', 'DPZ', 'DRQ', 'DY', 'EAT', 'EGN', 'EHC', 'EME', 'ENR', 'ENS', 'EPC', 'EPR', 'ERI', 'ESL', 'ESV', 'EV', 'EVR', 'EWBC', 'EXEL', 'EXP', 'FAF', 'FDS', 'FHN', 'FICO', 'FII', 'FIVE', 'FLO', 'FR', 'FNB', 'FSLR', 'FULT', 'GATX', 'GEF', 'GEO', 'GGG', 'GHC', 'GME', 'GMED', 'GNTX',

In [6]:
# convert the list of tickers into Pandas DataFrame to work in python.
df = pd.DataFrame()
df['sp_500_ticker'] = sp_500_ticker
df

Unnamed: 0,sp_500_ticker
0,AAN
1,ACHC
2,ACIW
3,ADNT
4,ATGE
5,ACM
6,ACC
7,AEO
8,AFG
9,AGCO


In [7]:
# or export it to csv
export_ticker_500_to_csv = df.to_csv(r'/Users/XXXX/Downloads/ticker_500.csv')