<h1 style= "color:#9370DB;"> Stock Analysis </h1>

In [1]:
# 📚 Libraries 
import kagglehub
import pandas as pd
import numpy as np
import os

# 📊 Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as g

# 🤖 Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error 

**Goals**

Investigate the performance of various sectors in the S&P 500. Analyze which sectors contribute the most to the index's overall performance and how they behave during market fluctuations. 

 - Annual return by sector.
- Identify high abnd low performing stocks among the S&P 500. 


<h2 style="color: #9370DB;"> 01 | Data Extraction </h2>

In [2]:
# Download latest version
path = kagglehub.dataset_download("andrewmvd/sp-500-stocks")

In [3]:
# Print all files in the dataset path
print(os.listdir(path))

['sp500_stocks.csv', 'sp500_companies.csv', 'sp500_index.csv']


In [4]:
csv_file_path = os.path.join(path, 'sp500_stocks.csv')
csv_file_path2 = os.path.join(path, 'sp500_companies.csv')
csv_file_path3 = os.path.join(path, 'sp500_index.csv')
data = pd.read_csv(csv_file_path)
df = pd.read_csv(csv_file_path2)
sp = pd.read_csv(csv_file_path3)

In [5]:
# Cleaning columns with snake_case 
data.columns = [col.lower().replace(" ", "_")for col in data.columns] 
df.columns = [col.lower().replace(" ", "_")for col in df.columns] 
sp.columns = [col.lower().replace(" ", "_")for col in sp.columns] 

<h3 style="color: #4169E1;">1.1 | Exploring the Data </h3>

In [6]:
data.sample(3)

Unnamed: 0,date,symbol,adj_close,close,high,low,open,volume
168414,2021-08-24,ANET,23.528126,23.528126,23.540625,23.160625,23.2325,6459200.0
234825,2016-07-14,TECH,26.548103,27.969999,28.625,27.93,28.625,1044400.0
1148365,2015-01-06,MA,78.130501,83.089996,83.779999,81.800003,83.660004,7690000.0


In [7]:
df.sample(3)

Unnamed: 0,exchange,symbol,shortname,longname,sector,industry,currentprice,marketcap,ebitda,revenuegrowth,city,state,country,fulltimeemployees,longbusinesssummary,weight
438,NMS,SWKS,"Skyworks Solutions, Inc.","Skyworks Solutions, Inc.",Technology,Semiconductors,89.41,14298536960,1136200000.0,-0.159,Irvine,CA,United States,9750.0,"Skyworks Solutions, Inc., together with its su...",0.000251
125,NYQ,DELL,Dell Technologies Inc.,Dell Technologies Inc.,Technology,Computer Hardware,118.69,83158605824,8726000000.0,0.091,Round Rock,TX,United States,120000.0,"Dell Technologies Inc. designs, develops, manu...",0.001458
93,NYQ,CB,Chubb Limited,Chubb Limited,Financial Services,Insurance - Property & Casualty,276.22,111343722496,11363000000.0,0.078,Zurich,,Switzerland,40000.0,Chubb Limited provides insurance and reinsuran...,0.001952


In [8]:
sp.sample(3)

Unnamed: 0,date,s&p500
907,2018-07-20,2801.83
379,2016-06-15,2071.5
980,2018-11-01,2740.37


### The Stock Analysis Dataset: CHANGE

- **Introduction**: Ronald A. Fisher in 1936 to demonstrate Linear Discriminant Analysis (LDA).
- **Type**: Multiclass classification dataset, ideal for supervised and unsupervised learning.
- **Features**: 
  - 4 numerical features: Sepal Length, Sepal Width, Petal Length, Petal Width.
  - Measurements describe physical dimensions of iris flowers.
- **Classes**: 3 flower species – Setosa, Versicolor, Virginica (50 samples each, 150 total).
- **Importance**:
  - Widely used for teaching classification, clustering, PCA, and visualization techniques.
  - Simple, clean, and balanced, making it ideal for learning machine learning concepts.
- **Applications**: Benchmarking algorithms like KMeans, Decision Trees, and Logistic Regression.

In [9]:
df.isna().sum()

exchange                0
symbol                  0
shortname               0
longname                0
sector                  0
industry                0
currentprice            0
marketcap               0
ebitda                 29
revenuegrowth           3
city                    0
state                  20
country                 0
fulltimeemployees       9
longbusinesssummary     0
weight                  0
dtype: int64

In [10]:
df.sector.value_counts()

sector
Technology                82
Industrials               70
Financial Services        67
Healthcare                63
Consumer Cyclical         55
Consumer Defensive        37
Utilities                 32
Real Estate               31
Communication Services    22
Energy                    22
Basic Materials           22
Name: count, dtype: int64

In [None]:
df.industry.value_counts()

In [None]:
df.isna().sum()

In [None]:
data.isna().sum()

In [None]:
sp

<h3 style="color: #4169E1;">1.2 | Copies</h3>

In [11]:
data2 = data.copy()

<h2 style="color: #9370DB;"> 02 | ⚒️ Data Cleaning </h2>

<h3 style="color: #4169E1;"> 2.1 | Dealing with Data types</h3>

In [None]:
data.dtypes

In [None]:
df.dtypes

In [None]:
sp.dtypes

<h3 style="color: #4169E1;"> 2.2 | Dealing with NaN values</h3>

In [12]:
data.isna().sum()

date              0
symbol            0
adj_close    101626
close        101626
high         101626
low          101626
open         101626
volume       101626
dtype: int64

In [13]:
# Delete NaN. 
data2.dropna(how='any', inplace=True)

In [14]:
data2.isna().sum()

date         0
symbol       0
adj_close    0
close        0
high         0
low          0
open         0
volume       0
dtype: int64

In [None]:
sp.isna().sum()

In [None]:
data2.symbol.value_counts()

In [None]:
df.isna().sum()

In [None]:
sp.isna().sum()

<h3 style="color: #4169E1;"> 2.3 | Dealing with Duplicates</h3>

In [15]:
data2.duplicated().sum()

0

In [None]:
df.duplicated().sum()

In [None]:
sp.duplicated().sum()

<h3 style="color: #4169E1;"> 2.5 | Dealing with outliers</h3>

<h3 style="color: #4169E1;"> 2.6 | Moving target to the right </h3>

<h3 style="color: #4169E1;"> 2.7 | Other Steps </h3>

In [16]:
# Delete Columns 
data2.drop(columns=['high', 'low', 'open','close'], inplace=True)

In [17]:
# Change to datetime. 
data2['date'] = pd.to_datetime(data2['date'])

In [18]:
data2['year'] = data2['date'].dt.year
data2['month'] = data2['date'].dt.month
data2['day'] = data2['date'].dt.day

In [19]:
cols = ['year', 'month', 'day', 'symbol', 'adj_close', 'volume']
data2 = data2[cols]
data2.head(3)

Unnamed: 0,year,month,day,symbol,adj_close,volume
0,2010,1,4,MMM,43.783875,3640265.0
1,2010,1,5,MMM,43.509624,3405012.0
2,2010,1,6,MMM,44.126659,6301126.0


In [20]:
# Drop rows where year is between 2010 and 2013 because SP500 for comparison we do have 2014. 
data2.drop(data2[(data2['year'] >= 2010) & (data2['year'] <= 2013)].index, inplace=True)

In [21]:
# Chat helped. 
annual_returns = data2.groupby(['symbol', 'year']).apply(lambda group: (group['adj_close'].iloc[-1] / group['adj_close'].iloc[0]) - 1).reset_index(name='annual_return').round(4)

  annual_returns = data2.groupby(['symbol', 'year']).apply(lambda group: (group['adj_close'].iloc[-1] / group['adj_close'].iloc[0]) - 1).reset_index(name='annual_return').round(4)


In [22]:
pivoted_df = annual_returns.pivot(index='symbol', columns='year', values='annual_return')

In [23]:
pivoted_df.sample(3)

year,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
HPE,,-0.0782,0.5473,0.083,-0.0723,0.2145,-0.2343,0.407,0.021,0.0903,0.3209
PLD,0.2041,0.0264,0.3046,0.2592,-0.0552,0.6048,0.157,0.7883,-0.2989,0.2162,-0.1487
CSGP,0.0096,0.1474,-0.0458,0.5952,0.1366,0.7875,0.4892,-0.1172,-0.0198,0.1254,-0.0922


In [24]:
pivoted_df = pivoted_df.rename(columns={2014:'ar_2014',2015: 'ar_2015',2016:'ar_2016',2017: 'ar_2017', 
                                          2018:'ar_2018', 2019: 'ar_2019',2020: 'ar_2020', 2021: 'ar_2021', 2022:'ar_2022', 2023:'ar_2023',2024: 'ar_2024'})

In [25]:
definitive = pd.merge (df, pivoted_df, on='symbol')
definitive

Unnamed: 0,exchange,symbol,shortname,longname,sector,industry,currentprice,marketcap,ebitda,revenuegrowth,...,ar_2015,ar_2016,ar_2017,ar_2018,ar_2019,ar_2020,ar_2021,ar_2022,ar_2023,ar_2024
0,NMS,AAPL,Apple Inc.,Apple Inc.,Technology,Consumer Electronics,246.49,3725893566464,1.346610e+11,0.061,...,-0.0208,0.1238,0.4804,-0.0705,0.8874,0.7824,0.3765,-0.2861,0.5394,0.3278
1,NMS,NVDA,NVIDIA Corporation,NVIDIA Corporation,Technology,Semiconductors,139.31,3411701923840,6.118400e+10,1.224,...,0.6645,2.3292,0.9043,-0.3285,0.7341,1.1802,1.2448,-0.5144,2.4610,1.8930
2,NMS,MSFT,Microsoft Corporation,Microsoft Corporation,Technology,Software - Infrastructure,448.99,3338186784768,1.365520e+11,0.160,...,0.2188,0.1651,0.3974,0.2022,0.5826,0.3994,0.5521,-0.2836,0.5696,0.2106
3,NMS,AMZN,"Amazon.com, Inc.","Amazon.com, Inc.",Consumer Cyclical,Internet Retail,230.26,2421184004096,1.115830e+11,0.110,...,1.1907,0.1772,0.5517,0.2632,0.2006,0.7160,0.0464,-0.5071,0.7704,0.5358
4,NMS,GOOGL,Alphabet Inc.,Alphabet Inc.,Communication Services,Internet Content & Information,195.40,2399824510976,1.234700e+11,0.151,...,0.4692,0.0435,0.3037,-0.0263,0.2699,0.2805,0.6783,-0.3915,0.5674,0.4193
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,NYQ,HII,"Huntington Ingalls Industries,","Huntington Ingalls Industries, Inc.",Industrials,Aerospace & Defense,191.52,7494043648,1.071000e+09,-0.024,...,0.1491,0.5029,0.2615,-0.1545,0.3244,-0.3181,0.1651,0.2680,0.1633,-0.2480
499,NYQ,CE,Celanese Corporation,Celanese Corporation,Basic Materials,Chemicals,68.04,7437588480,1.851000e+09,-0.028,...,0.1425,0.2159,0.3718,-0.1405,0.3795,0.0976,0.3594,-0.3756,0.5564,-0.5515
500,NYQ,FMC,FMC Corporation,FMC Corporation,Basic Materials,Agricultural Inputs,56.45,7046992384,7.033000e+08,0.085,...,-0.3028,0.5106,0.6741,-0.2186,0.5627,0.1725,-0.0103,0.1542,-0.4807,-0.1044
501,NMS,QRVO,"Qorvo, Inc.","Qorvo, Inc.",Technology,Semiconductors,69.06,6528013824,6.731300e+08,-0.052,...,-0.2770,0.0396,0.2590,-0.1183,0.8998,0.4339,-0.0498,-0.4315,0.2633,-0.3659


<h2 style="color: #9370DB;"> 03 | EDA (Exploratory Data Analysis) </h2>

<h3 style="color: #4169E1;"> Optional | Selecting Numerical </h3>

In [26]:
cat = definitive.select_dtypes(exclude='number')
cat.head(5)

Unnamed: 0,exchange,symbol,shortname,longname,sector,industry,city,state,country,longbusinesssummary
0,NMS,AAPL,Apple Inc.,Apple Inc.,Technology,Consumer Electronics,Cupertino,CA,United States,"Apple Inc. designs, manufactures, and markets ..."
1,NMS,NVDA,NVIDIA Corporation,NVIDIA Corporation,Technology,Semiconductors,Santa Clara,CA,United States,NVIDIA Corporation provides graphics and compu...
2,NMS,MSFT,Microsoft Corporation,Microsoft Corporation,Technology,Software - Infrastructure,Redmond,WA,United States,Microsoft Corporation develops and supports so...
3,NMS,AMZN,"Amazon.com, Inc.","Amazon.com, Inc.",Consumer Cyclical,Internet Retail,Seattle,WA,United States,"Amazon.com, Inc. engages in the retail sale of..."
4,NMS,GOOGL,Alphabet Inc.,Alphabet Inc.,Communication Services,Internet Content & Information,Mountain View,CA,United States,Alphabet Inc. offers various products and plat...


In [27]:
num = definitive.select_dtypes(include='number')
num.head(5)

Unnamed: 0,currentprice,marketcap,ebitda,revenuegrowth,fulltimeemployees,weight,ar_2014,ar_2015,ar_2016,ar_2017,ar_2018,ar_2019,ar_2020,ar_2021,ar_2022,ar_2023,ar_2024
0,246.49,3725893566464,134661000000.0,0.061,164000.0,0.065323,0.4263,-0.0208,0.1238,0.4804,-0.0705,0.8874,0.7824,0.3765,-0.2861,0.5394,0.3278
1,139.31,3411701923840,61184000000.0,1.224,29600.0,0.059814,0.2868,0.6645,2.3292,0.9043,-0.3285,0.7341,1.1802,1.2448,-0.5144,2.461,1.893
2,448.99,3338186784768,136552000000.0,0.16,228000.0,0.058526,0.2842,0.2188,0.1651,0.3974,0.2022,0.5826,0.3994,0.5521,-0.2836,0.5696,0.2106
3,230.26,2421184004096,111583000000.0,0.11,1551000.0,0.042449,-0.2202,1.1907,0.1772,0.5517,0.2632,0.2006,0.716,0.0464,-0.5071,0.7704,0.5358
4,195.4,2399824510976,123470000000.0,0.151,181269.0,0.042074,-0.0475,0.4692,0.0435,0.3037,-0.0263,0.2699,0.2805,0.6783,-0.3915,0.5674,0.4193


<h3 style="color: #4169E1;">3.1 | Descriptive Statistics </h3>

In [28]:
definitive.describe()

Unnamed: 0,currentprice,marketcap,ebitda,revenuegrowth,fulltimeemployees,weight,ar_2014,ar_2015,ar_2016,ar_2017,ar_2018,ar_2019,ar_2020,ar_2021,ar_2022,ar_2023,ar_2024
count,503.0,503.0,474.0,500.0,494.0,503.0,470.0,475.0,480.0,482.0,485.0,491.0,495.0,495.0,497.0,499.0,503.0
mean,227.857982,113395800000.0,7031397000.0,0.070484,57744.96,0.001988,0.2027,0.04468,0.197927,0.258513,-0.036073,0.351187,0.191706,0.328461,-0.083842,0.19722,0.202896
std,516.807881,347697100000.0,16227770000.0,0.180071,139469.3,0.006096,0.231231,0.252591,0.277881,0.260206,0.223571,0.314482,0.546766,0.29435,0.280172,0.3557,0.361134
min,9.84,5802753000.0,-3991000000.0,-0.602,28.0,0.000102,-0.3562,-0.753,-0.7106,-0.4308,-0.5762,-0.5433,-0.5803,-0.3505,-0.7107,-0.4807,-0.619
25%,71.615,20051990000.0,1623194000.0,0.002,10200.0,0.000352,0.0568,-0.1087,0.0484,0.094325,-0.1929,0.18555,-0.05095,0.1313,-0.2662,-0.0248,0.0042
50%,125.33,37834930000.0,2941705000.0,0.05,21595.0,0.000663,0.19405,0.0407,0.17865,0.22335,-0.0378,0.3279,0.1276,0.3069,-0.1146,0.13,0.1655
75%,235.495,82471960000.0,6017250000.0,0.109,54762.25,0.001446,0.31375,0.17345,0.309775,0.388,0.0931,0.4749,0.30255,0.47235,0.0607,0.3203,0.3548
max,8848.69,3725894000000.0,149547000000.0,1.632,2100000.0,0.065323,1.6721,1.7228,3.0939,1.4272,1.0659,4.311,7.2005,1.9002,1.0713,2.461,3.3733


In [29]:
frequ = cat.sector.value_counts()
frequ

sector
Technology                82
Industrials               70
Financial Services        67
Healthcare                63
Consumer Cyclical         55
Consumer Defensive        37
Utilities                 32
Real Estate               31
Communication Services    22
Energy                    22
Basic Materials           22
Name: count, dtype: int64

In [30]:
table = cat.sector.value_counts(normalize=True).round(2)
table

sector
Technology                0.16
Industrials               0.14
Financial Services        0.13
Healthcare                0.13
Consumer Cyclical         0.11
Consumer Defensive        0.07
Utilities                 0.06
Real Estate               0.06
Communication Services    0.04
Energy                    0.04
Basic Materials           0.04
Name: proportion, dtype: float64

In [31]:
frequency_table = pd.concat([frequ,table], axis = 1)
frequency_table


Unnamed: 0_level_0,count,proportion
sector,Unnamed: 1_level_1,Unnamed: 2_level_1
Technology,82,0.16
Industrials,70,0.14
Financial Services,67,0.13
Healthcare,63,0.13
Consumer Cyclical,55,0.11
Consumer Defensive,37,0.07
Utilities,32,0.06
Real Estate,31,0.06
Communication Services,22,0.04
Energy,22,0.04


In [None]:
# crosstab sector
pd.crosstab(index=definitive['sector'],
            columns='count')

<h3 style="color: #4169E1;"> 3.2 | Checking Distributions</h3>

<h3 style="color: #4169E1;"> 3.3 | Checking our target distribution</h3>

In [None]:
#pearson 
num.corrwith(definitive['currentprice'])

In [None]:
num.corrwith(df['currentprice'], method='spearman').sort_values(ascending=False)[:5]

<h3 style="color: #4169E1;">3.4 | Checking Outliers </h3>

<h3 style="color: #4169E1;">3.5 | Looking for Correlations </h3>

In [None]:
correlation_matrix = num.corr()
correlation_matrix

In [None]:
# Correlation Matrix-Heatmap Plot
mask = np.zeros_like(correlation_matrix)
mask[np.triu_indices_from(mask)] = True 
f, ax = plt.subplots(figsize=(20, 10))
sns.set(font_scale=1.5)

ax = sns.heatmap(correlation_matrix, mask=mask, annot=True, annot_kws={"size": 12}, linewidths=.5, cmap="BuPu", fmt=".2f", ax=ax) # round to 2 decimal places
ax.set_title("Correlation Heatmap", fontsize=20) 

In [None]:
# Plotting scatter plots for each numerical column against 'currentprice' to visualize their relationships
for col in num.columns:
    plt.figure(figsize=(5, 5))
    plt.title('Scatter plot of price vs ' + col)
    sns.scatterplot(data=definitive, x=col, y='currentprice')
    plt.show()

<h2 style="color: #9370DB;"> 04 | Data Processing </h2>

<h3 style="color: #4169E1;"> 4.1 | X-Y Split</h3>

<h3 style="color: #4169E1;"> 4.2 | Selecting the Model</h3>

<h4 style="color: #00BFFF;"> 4.2.1 | Selecting Model: Linear Regression </h4>

<h4 style="color: #00BFFF;"> 4.2.2 | Selecting Model: Ridge Regression </h4>

<h4 style="color: #00BFFF;"> 4.2.3 | Selecting Model: Lasso Regression </h4>

<h4 style="color: #00BFFF;"> 4.2.4 | Selecting Model: Decision Tree Regression </h4>

<h4 style="color: #00BFFF;"> 4.2.5 | Selecting Model: KNN Regression </h4>

<h4 style="color: #00BFFF;"> 4.2.6 | Selecting Model: XGBoost Regression </h4>

<h3 style="color: #4169E1;"> 4.3 | Final Comparision</h3>

<h2 style="color: #9370DB;"> 05 | Improving Model </h2>

<h3 style="color: #4169E1;"> 5.1 | Normalization with MinMaxScaler</h3>

<h3 style="color: #4169E1;"> 5.2 | Standardization with StandardScaler</h3>

<h3 style="color: #4169E1;"> 5.3 | Normzalization with Long Transform</h3>

<h3 style="color: #4169E1;"> 5.4 | Feature Engineering </h3>

<h2 style="color: #9370DB;"> 06 | Reporting </h2>