6/20 - in progress of recreating this project here from my "rough" version that's complete, but quite messy

This project came from me reading about "all-time high" short ratios for SPY, the S&P 500 Index ETF. The article positioned this as an ominous sign, but it made me wonder:

"What tends to happen to future returns when a stock (or ETF) has a high (or low) short ratio?"

On the one hand, a high short ratio suggests a high level of negative sentiment towards that holding. But, on the other, shorts have to close out their positions at some point, so a high short ratio could also portend future demand, which could be a positive.

I thought this would be a great question to explore personally - and a great way to develop my data scraping/pandas manipulation skills.

My goal is to look at the stocks in the S&P 500 and observe any potential relationship(s) between short interest and future returns.

NOTES:
- I could not find a publicly available "short ratio" data set, so I used a proxy based on daily short volume of trading. Namely, I used the trailing 30 day average of short volume as a percentage of total trading volume. To improve the accuracy of this analysis it could be beneficial to access a data set specifically looking at short shares outstanding, not just daily volume of trading.
- I still am learing the very basics of machine learning and other forms of predictive analysis, so right now this is really just an exercise in data gathering and manipulation. But that would be a great next step

In [2]:
# Scrape a list of the S&P 500 companies and their ticker symbols from Wikipedia

# Imports
import urllib.request
from bs4 import BeautifulSoup

# Specify url to scrape
url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

# Open the url request and save 
page = urllib.request.urlopen(url)

# Parse the HTML into BeautifulSoup
soup = BeautifulSoup(page, 'lxml')

# Create a variable to hold the HTML for the table in question (that holds the S&P 500 stocks and tickers)
# From exploring the HTML of the page I found that the table in question was a class 'wikitable sortable'
ticker_table = soup.find('table', class_='wikitable sortable')

# Initialize lists to "catch" the tickers and company names
tickers = []
companies = []

# Loop through my table rows and extract the first two cells of each row (which correspond to 'ticker' and 'company')
for row in ticker_table.find_all('tr'):
    cells = row.find_all('td')
    count = 1
    for cell in cells:
        if count == 1:
            tickers.append(cell.find('a', href=True).string)
            count += 1
        elif count == 2:
            companies.append(cell.string)
            count += 2
        else:
            count +=1

In [5]:
# Check my lists
display(tickers[0:5])
display(len(tickers))
display(companies[0:5])
display(len(companies))

['MMM', 'ABT', 'ABBV', 'ABMD', 'ACN']

505

['3M Company',
 'Abbott Laboratories',
 'AbbVie Inc.',
 'ABIOMED Inc',
 'Accenture plc']

505

The lists look right, but we have 505 comapanies/tickers, NOT 500. Looks like it could be an error, but actually the S&P 500 name is slightly misleading. 

The "500" refers to the number of companies in the index. There are actually 505 different securities listed on the index. Some companies have multiple securities listed - for example: Berkshire Hathaway has both 'A' and 'B' shares listed.

This means our list lengths look good. We can consider this step a success and move on.

Ultimately, all of this data will be stored in a dataframe, so let's set that up now:

In [8]:
# Import pandas and iniatilize dataframe for final data
import pandas as pd
final_df = pd.DataFrame(tickers, columns=['ticker'])
final_df['company'] = companies

  ticker              company
0    MMM           3M Company
1    ABT  Abbott Laboratories
2   ABBV          AbbVie Inc.
3   ABMD          ABIOMED Inc
4    ACN        Accenture plc
