# A Deep Dive into the S&P 500: Predicting Stock Prices
Kanishk Chinnapapannagari, Aarav Naveen, Avyay Potarlanka, and Melvin Rajendran

## Introduction

In today’s evolving financial landscape, both investors and traders are constantly seeking an edge to make informed decisions. The stock market, which contains an intricate web of variables and is influenced by numerous factors, has proven to be a difficult environment to navigate.

In the past, investment-related decisions were often made based on analysis of historical trends. However, the advancement of data science and machine learning techniques has introduced a new opportunity to potentially predict future stock prices with reasonable accuracy and thus gain valuable insights.

This data science project delves into prediction of stock prices within the Standard & Poor’s 500 index, otherwise known as the S&P 500. This index contains 500 of the top companies in the United States, and it represents approximately 80% of the U.S. stock market’s total value. Hence, it serves as a strong indicator of the movement within the market. To learn more about the S&P 500 and other popular indices in the U.S., read this article: https://www.investopedia.com/insights/introduction-to-stock-market-indices/.

Throughout this project, we will follow a comprehensive data science approach that includes the following steps:
* Data collection
* Data processing
* Exploratory data analysis and data visualization
* Data analysis, hypothesis testing, and machine learning (ML)
* Insight formation

Our project aims to leverage predictive modeling techniques to provide insights to investors. The analysis herein will identify stocks that are undervalued and thus will increase in price in the near future, meaning investors should consider buying or holding shares. Likewise, it will also identify stocks that are overvalued and will soon decrease in price, indicating that investors should consider selling their position.

In [52]:
# Import necessary libraries
from bs4 import BeautifulSoup
import numpy as np
import os
import pandas as pd
import requests

## Data Collection

To gather information about the S&P 500 companies, we will be using the following dataset: https://www.kaggle.com/datasets/paultimothymooney/stock-market-data. This Kaggle dataset contains the date, volume, and prices for the NASDAQ, NYSE, and S&P 500. For the purposes of this project, we will only analyze the stock prices of companies in the S&P 500.

In [17]:
# Initialize an empty data frame to store the stock price data
price_data = pd.DataFrame()

# Initialize the path to the folder containing the data
folder_path = 'sp500-data'

# Iterate across each file in the folder by name
for file_name in os.listdir(folder_path):
    
    # Check if the current file is a CSV file
    if file_name.endswith('.csv'):
        
        # Read the current file into a temporary data frame
        temp = pd.read_csv(os.path.join(folder_path, file_name))
        
        # Extract the ticker from the current file's name
        ticker = file_name[0:-4]
        
        # Store the ticker in a new column in the temporary data frame 
        temp['Ticker'] = ticker
        
        # Concatenate the accumulating and temporary data frames
        price_data = pd.concat([price_data, temp], ignore_index = True)

# Print the first five rows of the price data frame
price_data.head()

Unnamed: 0,Date,Low,Open,Volume,High,Close,Adjusted Close,Ticker
0,18-11-1999,28.612303,32.546494,62546380.0,35.765381,31.473534,26.92976,A
1,19-11-1999,28.478184,30.713518,15234146.0,30.758226,28.880545,24.711119,A
2,22-11-1999,28.657009,29.551144,6577870.0,31.473534,31.473534,26.92976,A
3,23-11-1999,28.612303,30.400572,5975611.0,31.205294,28.612303,24.481602,A
4,24-11-1999,28.612303,28.701717,4843231.0,29.998213,29.372318,25.131901,A


We noticed that the Kaggle dataset does not contain sector data. For this reason, we will supplement our existing data with that which is contained in the following dataset: https://en.wikipedia.org/wiki/List_of_S%26P_500_companies. By webscraping this webpage's list of the S&P 500 companies, we can match each company in our existing data to its corresponding GICS sector and sub-industry. This will enable us to perform analysis by sector and/or sub-industry, and thus eliminate biases in our modeling.

In [14]:
# Headers for the HTTP request
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
    'From': 'pleaseletmein@gmail.com'
}

# Make an HTTP request to the Wikipedia URL and store the response
response = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies', headers = headers)

# Parse the text from the webpage as HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Find the table element containing the data and both extract and store the data
table = soup.find('table')

# Read the HTML table into a data frame
sector_data = pd.read_html(str(table), flavor = 'html5lib')[0]

# Print the first five rows of the sector data frame
sector_data.head()

Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916
2,ABT,Abbott,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1800,1888
3,ABBV,AbbVie,Health Care,Pharmaceuticals,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989


## Data Processing

In [15]:
# Store only top 5 by market cap in each sector
topsect_df = pd.DataFrame()

# Get unique sectors in an array
sectors = sector_data['Sector'].unique()

# Traverse unique sectors
for sector in sectors:
    # Create a sub dataframe for the sector
    sub_df = sector_data[sector_data['Sector'] == sector]

    # Sort sub dataframe by market cap
    sub_df = sub_df.sort_values('Market Cap', ascending=False, key=lambda x: x.astype(int))

    # Merge into dataframe of all sectors' top 5
    topsect_df = pd.concat([topsect_df, sub_df.head(5)], ignore_index = True)

# Print 5 rows
topsect_df.head()

KeyError: 'Sector'

In [57]:
# Merge both data frames into a single data frame
data = pd.merge(price_data, sector_data, on = 'Ticker')

# Remove two unneeded columns from the data frame
data = data.drop('Market Cap', axis = 1)
data = data.drop('PE Ratio', axis = 1)

# Print the first five rows of the data frame
data.head()

Unnamed: 0,Ticker,Company,Sector,Date,Open,High,Low,Close,Adjusted Close,Volume
0,XOM,Exxon Mobil Corporation,Energy,02-01-1970,1.929688,1.9375,1.925781,1.9375,0.169896,1174400.0
1,XOM,Exxon Mobil Corporation,Energy,05-01-1970,1.9375,1.96875,1.933594,1.96875,0.172636,1881600.0
2,XOM,Exxon Mobil Corporation,Energy,06-01-1970,1.96875,1.972656,1.945313,1.964844,0.172294,1232000.0
3,XOM,Exxon Mobil Corporation,Energy,07-01-1970,1.964844,1.964844,1.949219,1.953125,0.171266,918400.0
4,XOM,Exxon Mobil Corporation,Energy,08-01-1970,1.953125,1.96875,1.945313,1.957031,0.171608,1075200.0


## Exploratory Data Analysis and Data Visualization

## Data Analysis, Hypothesis Testing, and Machine Learning

## Insights