# A Deep Dive into the S&P 500: Predicting Stock Prices
Kanishk Chinnapapannagari, Aarav Naveen, Avyay Potarlanka, and Melvin Rajendran

## Introduction

In today’s evolving financial landscape, both investors and traders are constantly seeking an edge to make informed decisions. The stock market, which contains an intricate web of variables and is influenced by numerous factors, has proven to be a difficult environment to navigate.

In the past, investment-related decisions were often made based on analysis of historical trends. However, the advancement of data science and machine learning techniques has introduced a new opportunity to potentially predict future stock prices with reasonable accuracy and thus gain valuable insights.

This data science project delves into prediction of stock prices within the Standard & Poor’s 500 index, otherwise known as the S&P 500. This index contains 500 of the top companies in the United States, and it represents approximately 80% of the U.S. stock market’s total value. Hence, it serves as a strong indicator of the movement within the market. To learn more about the S&P 500 and other popular indices in the U.S., read this article: https://www.investopedia.com/insights/introduction-to-stock-market-indices/.

Throughout this project, we will follow a comprehensive data science approach that includes the following steps:
* Data collection
* Data processing
* Exploratory data analysis and data visualization
* Data analysis, hypothesis testing, and machine learning (ML)
* Insight formation

Our project aims to leverage predictive modeling techniques to provide insights to investors. The analysis herein will identify stocks that are undervalued and thus will increase in price in the near future, meaning investors should consider buying or holding shares. Likewise, it will also identify stocks that are overvalued and will soon decrease in price, indicating that investors should consider selling their position.

In [None]:
# Import necessary libraries
from bs4 import BeautifulSoup
from keras.layers import Dense, LSTM
from keras.models import Sequential
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import requests
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder

## Data Collection

### Reading in a Kaggle Dataset

To gather information about the S&P 500 companies, we will be using the following dataset: https://www.kaggle.com/datasets/paultimothymooney/stock-market-data. This Kaggle dataset contains the date, volume, and prices for the NASDAQ, NYSE, and S&P 500. For the purposes of this project, we will only analyze the stock prices of companies in the S&P 500.

In [None]:
# Initialize an empty data frame to store the stock price data
price_data = pd.DataFrame()

# Initialize the path to the folder containing the data
folder_path = 'sp500-data'

# Iterate across each file in the folder by name
for file_name in os.listdir(folder_path):
    
    # Check if the current file is a CSV file
    if file_name.endswith('.csv'):
        
        # Read the current file into a temporary data frame
        temp = pd.read_csv(os.path.join(folder_path, file_name))
        
        # Extract the symbol from the current file's name
        symbol = file_name[0:-4]
        
        # Store the symbol in a new column in the temporary data frame 
        temp['Symbol'] = symbol
        
        # Concatenate the accumulating and temporary data frames
        price_data = pd.concat([price_data, temp], ignore_index = True)

# Print the last five rows of the price data frame
price_data.tail()

### Webscraping From Wikipedia

We noticed that the Kaggle dataset does not contain sector data. For this reason, we will supplement our existing data with that which is contained on the following webpage: https://en.wikipedia.org/wiki/List_of_S%26P_500_companies. By scraping this webpage's list of the S&P 500 companies, we can match each company in our existing data to its corresponding GICS sector and sub-industry. This will enable us to perform analysis by sector and/or sub-industry and thus eliminate biases in our modeling.

In [None]:
# Headers for the HTTP request
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
    'From': 'pleaseletmein@gmail.com'
}

# Make an HTTP request to the Wikipedia URL and store the response
response = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies', headers = headers)

# Parse the text from the webpage as HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Find the table element containing the data and both extract and store the data
table = soup.find('table')

# Read the HTML table into a data frame
sector_data = pd.read_html(str(table), flavor = 'html5lib')[0]

# Print the last five rows of the sector data frame
sector_data.tail()

### Webscraping From Slickcharts

We would also like to focus our attention on the top companies of each sector, as these companies drive the movement within their respective sectors. Hence, we will scrape the data from the following webpage: https://www.slickcharts.com/sp500. This webpage contains a list of the S&P 500 companies by weight, where weight is equal to a company's market cap divided by the overall value of the S&P 500. Ultimately, we will select the top companies of each sector by weight.

In [None]:
# Make an HTTP request to the Slickcharts URL and store the response
response = requests.get('https://www.slickcharts.com/sp500', headers = headers)

# Parse the text from the webpage as HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Find the table element containing the data and both extract and store the data
table = soup.find('table')

# Read the HTML table into a data frame
weight_data = pd.read_html(str(table), flavor = 'html5lib')[0]

# Print the last five rows of the sector data frame
weight_data.tail()

## Data Processing

At this point, we have three data frames containing data that was collected in the previous step. We will merge this data into a single data frame. Then, we will filter our data to include only the top five companies within each sector. As part of this process, we need to clean our data. Data cleaning will involve casting our data to the proper types, removing entries with missing values, and removing unnecessary columns.

### Cleaning the Sector Data

In [None]:
# Rename the sector and industry-related columns
sector_data = sector_data.rename(columns = {'GICS Sector': 'Sector', 'GICS Sub-Industry': 'Industry'})

# Drop unnecessary columns
sector_data = sector_data.drop(['Headquarters Location', 'Date added', 'CIK', 'Founded'], axis = 1)

# Print the last five rows of the data frame
sector_data.tail()

### Cleaning the Weight Data

In [None]:
# Drop all columns except Symbol and Weight
weight_data = weight_data.drop(['#', 'Company', 'Price', 'Chg', '% Chg'], axis = 1)

# Print the last five rows of the data frame
weight_data.tail()

### Merging the Three Data Frames

In [None]:
# Perform an inner join (merge) on all three data frames to create a single data frame
data = pd.merge(pd.merge(price_data, sector_data, on = 'Symbol'), weight_data, on = 'Symbol')

# Reindex the columns of the data frame
data = data.reindex(columns = ['Symbol', 'Security', 'Sector', 'Industry', 'Weight', 'Date', 'Open', 'High', 'Low', 'Close', 'Adjusted Close', 'Volume'])

# Cast the Date column's type to datetime
data['Date'] = pd.to_datetime(data['Date'], dayfirst = True)

# Print the last five rows of the resulting data frame
data.tail()

### Filtering the Top 5 Companies Within Each Sector

In [None]:
# Initialize an empty data frame to contain the filtered data
top_data = pd.DataFrame()

# Iterate across a list of the unique sectors
for sector in data['Sector'].unique():
    
    # Filter the data by the current sector
    sector_data = data[data['Sector'] == sector]

    # Compile a list of the top five weights in the current sector
    top_five_weights = sorted(sector_data['Weight'].unique(), reverse = True)[:5]
    
    # Filter the data by the top five weights
    sector_data = sector_data[sector_data['Weight'].isin(top_five_weights)]
    
    # Concatenate the top five companies' data into the accumulating dataframe
    top_data = pd.concat([top_data, sector_data], ignore_index = True)

# Print the last five rows of the filtered data frame
top_data.tail()

## Exploratory Data Analysis and Data Visualization

Before we fit a machine learning model to our data, we would like to visualize it by sector and preliminarily determine relationships between the data. In particular, we would like to analyze how strongly the stock prices of companies within the same sector are correlated.

For the remainder of our analysis, we will focus our attention on adjusted close price, which is explained in the following section.

### Plotting Adjusted Close Price vs. Date

Adjusted close price is the final price at which a security  trades at the end of a trading day, adjusting for dividends, stock splits, and new offerings. It is the most accurate representation of a company's stock price, and it is commonly used by investors and traders to track performance.

In [None]:
# Generate a plot for the top five companies in each sector
for sector in top_data['Sector'].unique():
    
    # Filter the data for the current sector
    sector_data = top_data[top_data['Sector'] == sector]
    
    # Reshape the data for plotting purposes
    sec_as_row = sector_data.pivot(index = 'Date', columns = 'Symbol', values = 'Adjusted Close')
    
    # Generate plot
    sec_as_row.plot(title = f'{sector}: Adjusted Close Price vs. Date', legend = True, xlabel = 'Date', ylabel = 'Adjusted Close Price', figsize = (10, 5))

Above are 11 line plots of adjusted close price vs. date for the the top five companies (by weight) in each sector.

In the Health Care sector, one company had a much higher close price while the other four were closely correlated with one another. This case of the top comapny having a significantly greater closing price while the other four were much lower but closer to each other is a general trend that is visible through several of these graphs. In addition to the Health Care sector, this trend is present in the Financials sector, Consumer Staples sector, and Industrials sector, but interestingly in the Industrials sector as the top company's adjusted close price began to fall, the other four companies' adjusted close price rose together rather than one company taking over and continuing the trend. Other sectors have closer adjusted close prices amongst the top 5 companies: for example, in the Information Technology sector, ACN, MSFT, and CRM follow similar growth trends and maintain a similar price over the years while CSCO and AAPL trail behind. Also, in the Energy sector, MPC, XOM, COP, and EOG, essentially follow the same trend and stock price while SLB is consistently lower, so within this sector four companies are equally competitive rather than the trend of one company dominance that was seen in other sectors.

### Plotting Volume vs. Date

Volume traded is the number of shares that are transferred between constituents during the trading day. This is an important metric for investors and traders to consider. 

In [None]:
# Generate a plot for the top five companies in each sector
for sector in top_data['Sector'].unique():
    
    # Filter the data for the current sector
    sector_data = top_data[top_data['Sector'] == sector]
    
    # Reshape the data for plotting purposes
    sec_as_row = sector_data.pivot(index = 'Date', columns = 'Symbol', values = 'Volume')
    
    # Generate plot
    sec_as_row.plot(title = f'{sector}: Volume vs. Date', legend = True, xlabel = 'Date', ylabel = 'Volume', figsize = (10, 5))

Above are 11 line plots of volume traded vs. date for the the top five companies (by weight) in each sector.

It is apparent that while there are correlations amongst the companies within each sector, one company often dominates the volume traded or has strong, isolated shifts. For example, in the Consumer Discretionary sector, Amazon (ticker AMZN) has the greatest volume traded since approximately 2000. It has had up to 2 billion dollar trading volumes at certain points. Similarly, in the Financials sector, Bank of America (ticker BAC) has the greatest volume traded since approximately 2010. It has had up to 1 billion dollar trading volumes at certain points, whereas its competitors have only had up to 600 million dollar trading volumes.

### Calculating Various Moving Averages

Moving average standardizes the price of a stock by converting it to a constantly updated average price. This average is calculated over a predetermined time period. The most relevant and commonly used time periods for calculating moving average are 10 days and 20 days.

In [None]:
# Lengths of moving averages (in days) to calculate
moving_averages = [10, 20]

# Iterate across the moving averages
for ma in moving_averages:

    # Iterate across each company
    for security in top_data['Security'].unique():
        
        # Filter the data for the current company
        security_data = top_data[top_data['Security'] == security]
        
        # Add a column containing the current company's moving average
        top_data[f'{ma}-Day Moving Average'] = top_data['Adjusted Close'].rolling(ma).mean()
    
# Print the last five rows of the data frame
top_data.tail()

### Calculating Daily Returns

Daily return is the percentage change in the price of stock over the course of a trading day. This will help us assess the risk of investing in a particular company.

In [None]:
# Initialize an empty data frame to contain the daily return values
return_data = pd.DataFrame()

# Iterate across the sectors
for security in top_data['Security'].unique():
    
    # Filter the data for the current sector
    security_data = top_data[top_data['Security'] == security]
    
    # Calculate the percent change i.e. daily return
    security_rets = pd.DataFrame(security_data['Adjusted Close'].pct_change())

    # Append this data to the accumulating data frame
    return_data = pd.concat([return_data, security_rets], ignore_index = True)

# Add the daily return values to the top company data frame
top_data['Daily Return'] = return_data

# Print the last five rows of the top data frame
top_data.tail()

### Plotting and Comparing the Daily Returns of Various Stocks

Next, we will plot the daily returns of various stocks against one another. This will help us assess whether the stock prices of companies in the same sector are strongly correlated or not. We expect that they are linearly and positively correlated.

In [None]:
# Initialize a data frame to contain the formatted data for plotting
formatted_data = top_data[['Symbol', 'Date', 'Daily Return']]

# Pivots the ticker symbols from a column's entries to column headers
formatted_data = formatted_data.pivot(index = 'Date', columns = 'Symbol', values = 'Daily Return')

# Print the last five rows of the formatted data frame
formatted_data.tail()

In [None]:
# Graph here

## Data Analysis, Hypothesis Testing, and Machine Learning

FIGURE OUT THE 5 X 5 for the 11 sectors

Extract top company by market cap in each category. Due to computer limitations this is a necessary action as training a machine learning model is incredibly time consuming and resource intensive.

In [None]:
# Initialize an empty data frame to contain the filtered data
topmost_data = pd.DataFrame()

# Iterate across a list of the unique sectors
for sector in top_data['Sector'].unique():
    
    # Filter the data by the current sector
    sector_data = top_data[top_data['Sector'] == sector]

    # Get top weight in the current sector
    top_weight = sorted(sector_data['Weight'].unique(), reverse = True)[:1]
    
    # Get company with the calculated top market cap
    sector_data = sector_data[sector_data['Weight'].isin(top_weight)]
    
    # Concatenate the company's data into the accumulating dataframe
    topmost_data = pd.concat([topmost_data, sector_data], ignore_index = True)

# Reassign top data
top_data = topmost_data

# Print the last five rows of the filtered data frame
top_data.tail()

In [None]:
# Get necessary columns
formatted_data = top_data[['Symbol', 'Date', 'Daily Return']]

# Pivot to change format such that symbols are columns
formatted_data = formatted_data.pivot(index='Date', columns='Symbol', values='Daily Return')

# Print last 5 rows
formatted_data.tail()

In [None]:
close_data = top_data.filter(['Close','Symbol'])
close_data = close_data[close_data['Symbol'] == 'AAPL']
close_data = close_data.drop(columns=['Symbol'])

pre_train = close_data.values

close_data.head()

training_data_len = int(np.ceil( len(pre_train) * .95 ))

training_data_len

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0,1))
scaled_data = scaler.fit_transform(pre_train)

scaled_data

In [None]:
train_data = scaled_data[0:int(training_data_len), :]

x_train = []
y_train = []

for i in range(60, len(train_data)):
    x_train.append(train_data[i-60:i, 0])
    y_train.append(train_data[i, 0])
    if i<= 61:
        print(x_train)
        print(y_train)
        print()
        
# Convert the x_train and y_train to numpy arrays 
x_train, y_train = np.array(x_train), np.array(y_train)

# Reshape the data
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
# x_train.shape

In [None]:
# Build the LSTM model
model = Sequential()
model.add(LSTM(128, return_sequences=True, input_shape= (x_train.shape[1], 1)))
model.add(LSTM(64, return_sequences=False))
model.add(Dense(25))
model.add(Dense(1))

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

# Train the model
model.fit(x_train, y_train, batch_size=1, epochs=1)

In [None]:
# Create the testing data set
# Create a new array containing scaled values from index 1543 to 2002 
test_data = scaled_data[training_data_len - 60: , :]
# Create the data sets x_test and y_test
x_test = []
y_test = pre_train[training_data_len:, :]
for i in range(60, len(test_data)):
    x_test.append(test_data[i-60:i, 0])
    
# Convert the data to a numpy array
x_test = np.array(x_test)

# Reshape the data
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1 ))

# Get the models predicted price values 
predictions = model.predict(x_test)
predictions = scaler.inverse_transform(predictions)

# Get the root mean squared error (RMSE)
rmse = np.sqrt(np.mean(((predictions - y_test) ** 2)))
rmse

In [None]:
# Plot the data
train = top_data[top_data['Symbol'] == 'AAPL'][:training_data_len]
valid = top_data[top_data['Symbol'] == 'AAPL'][training_data_len:]
valid['Predictions'] = predictions
# Visualize the data
plt.figure(figsize=(16,6))
plt.title('Model')
plt.xlabel('Date', fontsize=18)
plt.ylabel('Close Price USD ($)', fontsize=18)
plt.plot(train['Close'])
plt.plot(valid[['Close', 'Predictions']])
plt.legend(['Train', 'Val', 'Predictions'], loc='lower right')
plt.show()

## Insights