# A Deep Dive into the S&P 500: Predicting Stock Prices
Kanishk Chinnapapannagari, Aarav Naveen, Avyay Potarlanka, and Melvin Rajendran

## Introduction

In today’s evolving financial landscape, both investors and traders are constantly seeking an edge to make informed decisions. The stock market, which contains an intricate web of variables and is influenced by numerous factors, has proven to be a difficult environment to navigate.

In the past, investment-related decisions were often made based on analysis of historical trends. However, the advancement of data science and machine learning techniques has introduced a new opportunity to potentially predict future stock prices with reasonable accuracy and thus gain valuable insights.

This data science project delves into prediction of stock prices within the Standard & Poor’s 500 index, otherwise known as the S&P 500. This index contains 500 of the top companies in the United States, and it represents approximately 80% of the U.S. stock market’s total value. Hence, it serves as a strong indicator of the movement within the market. To learn more about the S&P 500 and other popular indices in the U.S., read this article: https://www.investopedia.com/insights/introduction-to-stock-market-indices/.

Throughout this project, we will follow a comprehensive data science approach that includes the following steps:
* Data collection
* Data processing
* Exploratory data analysis and data visualization
* Data analysis, hypothesis testing, and machine learning (ML)
* Insight formation

Our project aims to leverage predictive modeling techniques to provide insights to investors. The analysis herein will identify stocks that are undervalued and thus will increase in price in the near future, meaning investors should consider buying or holding shares. Likewise, it will also identify stocks that are overvalued and will soon decrease in price, indicating that investors should consider selling their position.

In [1]:
# Import necessary libraries
from bs4 import BeautifulSoup
import numpy as np
import os
import pandas as pd
import requests

## Data Collection

In [2]:
# Initialize an empty data frame to store the stock price data
data = pd.DataFrame()

# Initialize the path to the folder containing the data
folder_path = 'sp500-data'

# Iterate across each file in the folder by name
for file_name in os.listdir(folder_path):
    # Check if the current file is a CSV file
    if file_name.endswith('.csv'):
        # Read the current file into a temporary data frame
        temp = pd.read_csv(os.path.join(folder_path, file_name))
        
        # Extract the ticker from the current file's name
        ticker = file_name[0:-4]
        
        # Store the ticker in a new column in the temporary data frame 
        temp['Ticker'] = ticker
        
        # Concatenate the accumulating and temporary data frames
        data = pd.concat([data, temp], ignore_index = True)

# Reindex the data frame's columns
data = data.reindex(columns = ['Ticker', 'Date', 'Open', 'High', 'Low', 'Close', "Adjusted Close", 'Volume'])

# Print the first five rows of the data frame
data.head()

Unnamed: 0,Ticker,Date,Open,High,Low,Close,Adjusted Close,Volume
0,CSCO,16-02-1990,0.0,0.079861,0.073785,0.077257,0.054862,940636800.0
1,CSCO,20-02-1990,0.0,0.079861,0.074653,0.079861,0.056712,151862400.0
2,CSCO,21-02-1990,0.0,0.078993,0.075521,0.078125,0.055479,70531200.0
3,CSCO,22-02-1990,0.0,0.081597,0.078993,0.078993,0.056095,45216000.0
4,CSCO,23-02-1990,0.0,0.079861,0.078125,0.078559,0.055787,44697600.0


In [3]:
# Headers for the HTTP request
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
    'From': 'pleaseletmein@gmail.com'
}

# Make an HTTP request to Liberated Stock Trader's URL and store the response
response = requests.get('https://www.liberatedstocktrader.com/sp-500-companies-list-by-sector-market-cap/', headers = headers)

# Parse the text from the webpage as HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Find the table element containing the data and both extract and store the data
table = soup.findAll('table')[1]

In [4]:
# Store all html data in a dataframe
df = pd.read_html(str(table), flavor="html5lib")[0]

# New column names
columns = {
    0: 'Ticker', 
    1: 'Company',
    2: 'Sector',
    3: 'Market Cap',
    4: 'PE Ratio',
}

# Rename columns
df.rename(columns=columns, inplace=True)

# Remove first row as it is just titles
df = df.drop(0)

df.head()

Unnamed: 0,Ticker,Company,Sector,Market Cap,PE Ratio
1,MRO,Marathon Oil Corporation,Energy Minerals,17388156106,2262.602
2,MO,"Altria Group, Inc.",Consumer Non-Durables,81364639112,522.1622
3,ANET,"Arista Networks, Inc.",Electronic Technology,35807001092,471.4286
4,MTB,M&T Bank Corporation,Finance,25831948062,397.9184
5,CI,Cigna Corporation,Health Services,94846357195,332.8684


In [7]:
# Merging both dataframes to create 1 dataframe with all the data
df2 = pd.merge(data, df, on='Ticker') 

df2.head()

Unnamed: 0,Ticker,Date,Open,High,Low,Close,Adjusted Close,Volume,Company,Sector,Market Cap,PE Ratio
0,CSCO,16-02-1990,0.0,0.079861,0.073785,0.077257,0.054862,940636800.0,"Cisco Systems, Inc.",Electronic Technology,201297048340,33.0431
1,CSCO,20-02-1990,0.0,0.079861,0.074653,0.079861,0.056712,151862400.0,"Cisco Systems, Inc.",Electronic Technology,201297048340,33.0431
2,CSCO,21-02-1990,0.0,0.078993,0.075521,0.078125,0.055479,70531200.0,"Cisco Systems, Inc.",Electronic Technology,201297048340,33.0431
3,CSCO,22-02-1990,0.0,0.081597,0.078993,0.078993,0.056095,45216000.0,"Cisco Systems, Inc.",Electronic Technology,201297048340,33.0431
4,CSCO,23-02-1990,0.0,0.079861,0.078125,0.078559,0.055787,44697600.0,"Cisco Systems, Inc.",Electronic Technology,201297048340,33.0431


## Data Processing

## Exploratory Data Analysis and Data Visualization

## Data Analysis, Hypothesis Testing, and Machine Learning

## Insights