# Scraping S&P 500 Data Off the Web

I'd like to analyze stocks that are members of the S&P 500 index. There are lots of free resouces on the web, but I prefer the free stock screener available on the finviz website (http://www.finviz.com). Since the free screener has no download button, I'll scrape data from the screener's webpages.

## Getting the list of S&P 500 Stocks

I'll start off by getting a list of all the S&P 500 stocks... Actually, I'll start by importing the Python libraries I'll be using.

For this project I'll need to use the Requests, Beautiful Soup and Pandas libraries.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

I'll start with the screener webpage that filters stocks that belong to the S&P 500 index.

Here, I'll request the webpage, then parse the content using BeautifulSoup.

In [2]:
# request the webpage and parse the content
response = requests.get("https://finviz.com/screener.ashx?v=111&f=idx_sp500")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

The parser returns the HTML code of the webpage.

In [3]:
print(parser)

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<html>
<head>
<title>Stock Screener - Overview sp500 </title>
<meta content="Stock screener for investors and traders, financial visualizations." name="description"/>
<meta content="Stock Screener, Charts, Quotes, Maps, News, Financial Visualizations, Research, Trading Systems" name="keywords"/>
<meta content="noindex" name="robots"/>
<meta content="no" http-equiv="imagetoolbar"/>
<meta content="no-cache" http-equiv="pragma"/>
<meta content="no-cache" http-equiv="cache-control"/>
<meta content="-1" http-equiv="Expires"/>
<link href="//fonts.googleapis.com/css?family=Lato:400,700,900" rel="stylesheet" type="text/css"/><link href="/finviz.css?rev=114" rel="stylesheet" type="text/css"/>
<link href="/favicon_2x.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="/favicon.png" rel="icon" sizes="16x16" type="image/png"/>
<script src="/script/boxover.js?rev=128" type="text/javascript"></script>
<script src="/script/light

Now, I'm only interested in the part of the HTML code that shows the ticker symbols. 

Fortunately, the ticker symbol is the only HTML element that has the CSS class "screener-link-primary"; thus it's relatively easy to search for the HTML elements with CSS class "screener-link-primary", and store the text in a list.

In [4]:
# selecting stocks/ticker symbols
stocks = parser.select(".screener-link-primary")
sp_list = [stock.get_text() for stock in stocks]

Here are the ticker symbols from this webpage.

In [5]:
print(sp_list)

['A', 'AA', 'AAL', 'AAP', 'AAPL', 'ABBV', 'ABC', 'ABMD', 'ABT', 'ACN', 'ADBE', 'ADI', 'ADM', 'ADP', 'ADS', 'ADSK', 'AEE', 'AEP', 'AES', 'AFL']


I need to repeat this for the rest of the 26 webpage showing S&P 500 ticker symbols. I'll loop through the rest of the webpages, parse the content, get the ticker symbols, then append them to the sp_list.

In [6]:
page = "https://finviz.com/screener.ashx?v=111&f=idx_sp500"

# pages 2 - 26 of search results end with "&r=" + some number
# in the set (21,41,...,501)
for i in range(21,521,20):
    next_page = page + "&r=" + str(i)
    response = requests.get(next_page)
    content = response.content
    parser = BeautifulSoup(content, 'html.parser')
    stocks = parser.select(".screener-link-primary")
    stocksToAppend = [stock.get_text() for stock in stocks]
    for stock in stocksToAppend:
        sp_list.append(stock)

Here are all the ticker symbols for S&P 500 stocks.

In [7]:
print(sp_list)

['A', 'AA', 'AAL', 'AAP', 'AAPL', 'ABBV', 'ABC', 'ABMD', 'ABT', 'ACN', 'ADBE', 'ADI', 'ADM', 'ADP', 'ADS', 'ADSK', 'AEE', 'AEP', 'AES', 'AFL', 'AGN', 'AIG', 'AIV', 'AIZ', 'AJG', 'AKAM', 'ALB', 'ALGN', 'ALK', 'ALL', 'ALLE', 'ALXN', 'AMAT', 'AMD', 'AME', 'AMG', 'AMGN', 'AMP', 'AMT', 'AMZN', 'ANET', 'ANSS', 'ANTM', 'AON', 'AOS', 'APA', 'APC', 'APD', 'APH', 'ATVI', 'AVB', 'AVGO', 'AVY', 'AWK', 'AXP', 'AZO', 'BA', 'BAC', 'BAX', 'BBT', 'BBY', 'BDX', 'BEN', 'BF-B', 'BHF', 'BHGE', 'BIIB', 'BK', 'BKNG', 'BLK', 'BLL', 'BMY', 'BR', 'BRK-B', 'BSX', 'BWA', 'BXP', 'C', 'CAG', 'CAH', 'CAT', 'CBOE', 'CBRE', 'CBS', 'CCI', 'CCL', 'CDNS', 'CE', 'CELG', 'CERN', 'CF', 'CFG', 'CHD', 'CHRW', 'CHTR', 'CI', 'CINF', 'CL', 'CLX', 'CMA', 'CMCSA', 'CME', 'CMG', 'CMI', 'CMS', 'CNC', 'CNP', 'COF', 'COG', 'COO', 'COP', 'COST', 'CPB', 'CPRT', 'CRM', 'CSCO', 'CSX', 'CTAS', 'CTL', 'CTSH', 'CTXS', 'CVS', 'CVX', 'CXO', 'D', 'DAL', 'DE', 'DFS', 'DG', 'DGX', 'DHI', 'DHR', 'DIS', 'DISCA', 'DISCK', 'DISH', 'DLPH', 'DLR', 'DLT

Despite the number 500 being in the name, there are actually 505 component stocks in the S&P 500 index.

In [8]:
len(sp_list)

505

## Scraping the Columns

With the list of ticker symbols, I can search the screener's webpages for each individual stock.

Because I want a list of all the columns when I create a Pandas DataFrame, I'll search for one arbitrary stock, and get the fields in the relevant table.

In [9]:
page = "https://finviz.com/quote.ashx?t=AA"
response = requests.get(page)
content = response.content
parser = BeautifulSoup(content, 'html.parser')

Here's the HTML code for the Alcoa's webpage.

In [10]:
print(parser)

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<html>
<head>
<title>AA Alcoa Corporation Stock Quote</title>
<meta content="Stock screener for investors and traders, financial visualizations." name="description"/>
<meta content="Stock Screener, Charts, Quotes, Maps, News, Financial Visualizations, Research, Trading Systems" name="keywords"/>
<meta content="no" http-equiv="imagetoolbar"/>
<meta content="no-cache" http-equiv="pragma"/>
<meta content="no-cache" http-equiv="cache-control"/>
<meta content="-1" http-equiv="Expires"/>
<link href="//fonts.googleapis.com/css?family=Lato:400,700,900" rel="stylesheet" type="text/css"/><link href="/finviz.css?rev=114" rel="stylesheet" type="text/css"/>
<link href="/favicon_2x.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="/favicon.png" rel="icon" sizes="16x16" type="image/png"/>
<script src="/script/boxover.js?rev=128" type="text/javascript"></script>
<script src="/script/lightup.js?rev=128" type="text/javascript"></

The first HTML table with the "snapshot-table2" CSS class is the table with all the data.

Within that table, the elements with the "snapshot-td2-cp" CSS class are the names of various attributes.

In [11]:
columns = parser.select(".snapshot-table2")[0].select(".snapshot-td2-cp")

table_columns = []
for i in range(len(columns)):
    table_columns.append(columns[i].get_text())

Here are all the names, which I'll be using as columns names for the DataFrame.

In [12]:
print(table_columns)

['Index', 'P/E', 'EPS (ttm)', 'Insider Own', 'Shs Outstand', 'Perf Week', 'Market Cap', 'Forward P/E', 'EPS next Y', 'Insider Trans', 'Shs Float', 'Perf Month', 'Income', 'PEG', 'EPS next Q', 'Inst Own', 'Short Float', 'Perf Quarter', 'Sales', 'P/S', 'EPS this Y', 'Inst Trans', 'Short Ratio', 'Perf Half Y', 'Book/sh', 'P/B', 'EPS next Y', 'ROA', 'Target Price', 'Perf Year', 'Cash/sh', 'P/C', 'EPS next 5Y', 'ROE', '52W Range', 'Perf YTD', 'Dividend', 'P/FCF', 'EPS past 5Y', 'ROI', '52W High', 'Beta', 'Dividend %', 'Quick Ratio', 'Sales past 5Y', 'Gross Margin', '52W Low', 'ATR', 'Employees', 'Current Ratio', 'Sales Q/Q', 'Oper. Margin', 'RSI (14)', 'Volatility', 'Optionable', 'Debt/Eq', 'EPS Q/Q', 'Profit Margin', 'Rel Volume', 'Prev Close', 'Shortable', 'LT Debt/Eq', 'Earnings', 'Payout', 'Avg Volume', 'Price', 'Recom', 'SMA20', 'SMA50', 'SMA200', 'Volume', 'Change']


## Scraping All of the Data

With the ticker symbols and column names taken care of, all that remains is getting the data for all 505 component stocks.

Scraping the data is nearly identical to getting the table columns in the last step. Like last time, the first HTML table with the "snapshot-table2" CSS class is the correct table. The difference is that the data is contained in HTML elements with the CSS class "snapshot-td2".

In [13]:
# store data in a list of lists
data = []


for ticker in sp_list:
    page = "https://finviz.com/quote.ashx?t=" + ticker
    response = requests.get(page)
    content = response.content
    parser = BeautifulSoup(content, 'html.parser')
    raw_fields = parser.select(".snapshot-table2")[0].select(".snapshot-td2")
    fields = []
    for i in range(len(raw_fields)):
        fields.append(raw_fields[i].get_text())
    data.append(fields)
    print(ticker + " processed...")

A processed...
AA processed...
AAL processed...
AAP processed...
AAPL processed...
ABBV processed...
ABC processed...
ABMD processed...
ABT processed...
ACN processed...
ADBE processed...
ADI processed...
ADM processed...
ADP processed...
ADS processed...
ADSK processed...
AEE processed...
AEP processed...
AES processed...
AFL processed...
AGN processed...
AIG processed...
AIV processed...
AIZ processed...
AJG processed...
AKAM processed...
ALB processed...
ALGN processed...
ALK processed...
ALL processed...
ALLE processed...
ALXN processed...
AMAT processed...
AMD processed...
AME processed...
AMG processed...
AMGN processed...
AMP processed...
AMT processed...
AMZN processed...
ANET processed...
ANSS processed...
ANTM processed...
AON processed...
AOS processed...
APA processed...
APC processed...
APD processed...
APH processed...
ATVI processed...
AVB processed...
AVGO processed...
AVY processed...
AWK processed...
AXP processed...
AZO processed...
BA processed...
BAC processed...
B

WBA processed...
WCG processed...
WDC processed...
WEC processed...
WELL processed...
WFC processed...
WHR processed...
WLTW processed...
WM processed...
WMB processed...
WMT processed...
WRK processed...
WU processed...
WY processed...
WYNN processed...
XEC processed...
XEL processed...
XLNX processed...
XOM processed...
XRAY processed...
XRX processed...
XYL processed...
YUM processed...
ZBH processed...
ZION processed...
ZTS processed...


## Storing the Data

I'll store the data in a Pandas DataFrame.

Here are the first 5 rows.

In [14]:
sp = pd.DataFrame(data,columns=table_columns, index=sp_list)

print(sp.head(5))

             Index     P/E EPS (ttm) Insider Own Shs Outstand Perf Week  \
A         S&P; 500   24.42      2.68       0.30%      321.92M    -1.53%   
AA        S&P; 500  977.24      0.03       0.10%      195.69M     4.34%   
AAL       S&P; 500    6.94      4.62       0.20%      477.01M     0.00%   
AAP       S&P; 500   34.34      4.63       0.10%       74.11M     2.66%   
AAPL  DJIA S&P500;   12.21     12.15       0.07%        4.87B    -5.05%   

     Market Cap Forward P/E EPS next Y Insider Trans   ...      Earnings  \
A        21.07B       19.41       3.37        -5.17%   ...    Nov 19 AMC   
AA        5.55B        8.01       3.54         0.00%   ...    Jan 16 AMC   
AAL      15.28B        5.63       5.69         4.38%   ...    Jan 24 BMO   
AAP      11.77B       19.50       8.14         0.27%   ...    Feb 12 BMO   
AAPL    722.24B       10.13      14.64       -14.91%   ...    Jan 31 AMC   

      Payout Avg Volume   Price Recom   SMA20    SMA50   SMA200      Volume  \
A     58.10% 

Now, I'll write the DataFrame to a .csv file, then take a break.

In [15]:
sp.to_csv("S&P500.csv")

## Conclusion

I've got 72 columns of data for each of the 505 S&P 500 component stocks.

Just like the table's CSS class contains the phrase "snapshot", the 72 columns of data is really just a snapshot in time. Still, I think it's good enough to get a grasp of the state of the S&P 500 stocks, as of the snapshot (i.e. Sunday, January 6, 2009).

For my next project, I'll clean up this dataset.

Thanks for reading.