# A Deep Dive into the S&P 500: Predicting Stock Prices
Kanishk Chinnapapannagari, Aarav Naveen, Avyay Potarlanka, and Melvin Rajendran

## Introduction

In today’s evolving financial landscape, both investors and traders are constantly seeking an edge to make informed decisions. The stock market, which contains an intricate web of variables and is influenced by numerous factors, has proven to be a difficult environment to navigate.

In the past, investment-related decisions were often made based on analysis of historical trends. However, the advancement of data science and machine learning techniques has introduced a new opportunity to potentially predict future stock prices with reasonable accuracy and thus gain valuable insights.

This data science project delves into prediction of stock prices within the Standard & Poor’s 500 index, otherwise known as the S&P 500. This index contains 500 of the top companies in the United States, and it represents approximately 80% of the U.S. stock market’s total value. Hence, it serves as a strong indicator of the movement within the market. To learn more about the S&P 500 and other popular indices in the U.S., read this article: https://www.investopedia.com/insights/introduction-to-stock-market-indices/.

Throughout this project, we will follow a comprehensive data science approach that includes the following steps:
* Data collection
* Data processing
* Exploratory data analysis and data visualization
* Data analysis, hypothesis testing, and machine learning (ML)
* Insight formation

Our project aims to leverage predictive modeling techniques to provide insights to investors. The analysis herein will identify stocks that are undervalued and thus will increase in price in the near future, meaning investors should consider buying or holding shares. Likewise, it will also identify stocks that are overvalued and will soon decrease in price, indicating that investors should consider selling their position.

In [1]:
# Import necessary libraries
from bs4 import BeautifulSoup
import numpy as np
import os
import pandas as pd
import requests

## Data Collection

In [2]:
# Initialize an empty data frame to store the stock price data
price_data = pd.DataFrame()

# Initialize the path to the folder containing the data
folder_path = 'sp500-data'

# Iterate across each file in the folder by name
for file_name in os.listdir(folder_path):
    
    # Check if the current file is a CSV file
    if file_name.endswith('.csv'):
        
        # Read the current file into a temporary data frame
        temp = pd.read_csv(os.path.join(folder_path, file_name))
        
        # Extract the ticker from the current file's name
        ticker = file_name[0:-4]
        
        # Store the ticker in a new column in the temporary data frame 
        temp['Ticker'] = ticker
        
        # Concatenate the accumulating and temporary data frames
        price_data = pd.concat([price_data, temp], ignore_index = True)

# Reindex the data frame's columns
price_data = price_data.reindex(columns = ['Ticker', 'Date', 'Open', 'High', 'Low', 'Close', 'Adjusted Close', 'Volume'])

# Print the first five rows of the price data frame
price_data.head()

Unnamed: 0,Ticker,Date,Open,High,Low,Close,Adjusted Close,Volume
0,CSCO,16-02-1990,0.0,0.079861,0.073785,0.077257,0.054862,940636800.0
1,CSCO,20-02-1990,0.0,0.079861,0.074653,0.079861,0.056712,151862400.0
2,CSCO,21-02-1990,0.0,0.078993,0.075521,0.078125,0.055479,70531200.0
3,CSCO,22-02-1990,0.0,0.081597,0.078993,0.078993,0.056095,45216000.0
4,CSCO,23-02-1990,0.0,0.079861,0.078125,0.078559,0.055787,44697600.0


In [3]:
# Headers for the HTTP request
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
    'From': 'pleaseletmein@gmail.com'
}

# Make an HTTP request to Liberated Stock Trader's URL and store the response
response = requests.get('https://www.liberatedstocktrader.com/sp-500-companies-list-by-sector-market-cap/', headers = headers)

# Parse the text from the webpage as HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Find the table element containing the data and both extract and store the data
table = soup.findAll('table')[1]

# Read the HTML table into a data frame
sector_data = pd.read_html(str(table), flavor = 'html5lib')[0]

# Name the data frame's columns
sector_data.columns = ['Ticker', 'Company', 'Sector', 'Market Cap', 'PE Ratio']

# Drop the first row of the data frame because it contains the feature names
sector_data = sector_data.drop(0)

# Print the first five rows of the sector data frame
sector_data.head()

Unnamed: 0,Ticker,Company,Sector,Market Cap,PE Ratio
1,MRO,Marathon Oil Corporation,Energy Minerals,17388156106,2262.602
2,MO,"Altria Group, Inc.",Consumer Non-Durables,81364639112,522.1622
3,ANET,"Arista Networks, Inc.",Electronic Technology,35807001092,471.4286
4,MTB,M&T Bank Corporation,Finance,25831948062,397.9184
5,CI,Cigna Corporation,Health Services,94846357195,332.8684


## Data Processing

The article that was scraped includes several sectors that actually belong to one sector according to the
SP500 definition of a sector. Need to combine those into the 11 official SP500 Sectors

In [4]:
# Combine sectors into 11 official SP500 Sectors

# Energy
sector_data.loc[sector_data['Sector'] == 'Energy Minerals', 'Sector'] = 'Energy'

# Materials
sector_data.loc[sector_data['Sector'] == 'Non-Energy Minerals', 'Sector'] = 'Materials'

# Industrials
sector_data.loc[sector_data['Sector'] == 'Distribution Services', 'Sector'] = 'Industrials'
sector_data.loc[sector_data['Sector'] == 'Producer Manufacturing', 'Sector'] = 'Industrials'
sector_data.loc[sector_data['Sector'] == 'Process Industries', 'Sector'] = 'Industrials'
sector_data.loc[sector_data['Sector'] == 'Industrial Services', 'Sector'] = 'Industrials'
sector_data.loc[sector_data['Sector'] == '', 'Sector'] = 'Industrials'

# Consumer Discretionary
sector_data.loc[sector_data['Sector'] == 'Consumer Non-Durables', 'Sector'] = 'Consumer Discretionary'
sector_data.loc[sector_data['Sector'] == 'Consumer Services', 'Sector'] = 'Consumer Discretionary'
sector_data.loc[sector_data['Sector'] == 'Consumer Durables', 'Sector'] = 'Consumer Discretionary'
sector_data.loc[sector_data['Sector'] == 'Retail Trade', 'Sector'] = 'Consumer Discretionary'

# Information Technology
sector_data.loc[sector_data['Sector'] == 'Electronic Technology', 'Sector'] = 'Information Technology'
sector_data.loc[sector_data['Sector'] == 'Health Technology', 'Sector'] = 'Information Technology'
sector_data.loc[sector_data['Sector'] == 'Technology Services', 'Sector'] = 'Information Technology'

# Financials
sector_data.loc[sector_data['Sector'] == 'Finance', 'Sector'] = 'Financials'

# Health Care
sector_data.loc[sector_data['Sector'] == 'Health Services', 'Sector'] = 'Health Care'

# Communication Services
sector_data.loc[sector_data['Sector'] == 'Communications', 'Sector'] = 'Communication Services'

# Real Estate
sector_data.loc[sector_data['Sector'] == 'Commercial Services', 'Sector'] = 'Real Estate'

# Print the first five rows of the sector data frame
sector_data.head()

Unnamed: 0,Ticker,Company,Sector,Market Cap,PE Ratio
1,MRO,Marathon Oil Corporation,Energy,17388156106,2262.602
2,MO,"Altria Group, Inc.",Consumer Discretionary,81364639112,522.1622
3,ANET,"Arista Networks, Inc.",Information Technology,35807001092,471.4286
4,MTB,M&T Bank Corporation,Financials,25831948062,397.9184
5,CI,Cigna Corporation,Health Care,94846357195,332.8684


In [39]:
# Store only top 5 by market cap in each sector
topsect_df = pd.DataFrame()

# Get unique sectors in an array
sectors = sector_data['Sector'].unique()

# Traverse unique sectors
for sector in sectors:
    # Create a sub dataframe for the sector
    sub_df = sector_data[sector_data['Sector'] == sector]

    # Sort sub dataframe by market cap
    sub_df = sub_df.sort_values('Market Cap', ascending=False, key=lambda x: x.astype(int))

    # Merge into dataframe of all sectors' top 5
    topsect_df = pd.concat([topsect_df, sub_df.head(5)], ignore_index = True)

topsect_df

Unnamed: 0,Ticker,Company,Sector,Market Cap,PE Ratio
0,XOM,Exxon Mobil Corporation,Energy,466273190232,14.27901
1,CVX,Chevron Corporation,Energy,342408717940,14.25038
2,COP,ConocoPhillips,Energy,149727915989,24.18134
3,EOG,"EOG Resources, Inc.",Energy,76624961922,13.63957
4,OXY,Occidental Petroleum Corporation,Energy,59752012540,20.31586
5,AMZN,"Amazon.com, Inc.",Consumer Discretionary,971911570617,19.82644
6,WMT,Walmart Inc.,Consumer Discretionary,390523608487,13.69223
7,TSLA,"Tesla, Inc.",Consumer Discretionary,390171883951,0.0
8,PG,Procter & Gamble Company (The),Consumer Discretionary,355004268428,36.60612
9,HD,"Home Depot, Inc. (The)",Consumer Discretionary,336198872307,11.78849


## Exploratory Data Analysis and Data Visualization

## Data Analysis, Hypothesis Testing, and Machine Learning

## Insights