<h1 style= "color:#9370DB;"> Stock Analysis </h1>

In [1]:
# 📚 Libraries 
import kagglehub
import pandas as pd
import numpy as np
import os

# New liabraries. 
import scipy.stats as st
import statsmodels.api as sm
import statsmodels.formula.api as smf

# 📊 Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as g

# 🤖 Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error 

### The Stock Analysis Dataset:


**First impressions:**
    
_____________

The **S&P 500** is a stock market index tracking the performance of the largest 500 publicly traded companies listed on U.S. stock exchanges.

Investors have long used the S&P 500 as a benchmark for their investments as it tends to signal overall market health. 
The index is a popular choice for long-term inverstors who wish to watch growth over the coming deacades. 

The dataset contains: 
- S&P 500 **Index**: Contains the daily price of the index, representing the overall performance of the 500 companies in the S&P 500.
- S&P 500 **Stocks**: Includes the daily stock prices for each company within the index, providing insights into individual stock movements. 
- S&P 500 **Companies**: Provides detailed information about each company, including metrics such as Name, Sector, Marketcap, Ebitda, Weight.

The data types are even: (13 int or float / 13 objects).

Our **project goal** is to identify the performance of various sectors in the S&P 500. After reading the [documentation](https://www.kaggle.com/datasets/andrewmvd/sp-500-stocks) we will proceed with the following **strategy**:

1. The **target** of our dataset will be `currentprice`, which is the actual price of the stock right now.
2. Through **Exploratory Data Analysis** we will identify the features that contribute to this prediction.


_____________

<h2 style="color: #9370DB;"> 01 | Data Extraction </h2>

In [2]:
data = pd.read_csv('sp500_stocks.csv')
df = pd.read_csv('sp500_companies.csv')
sp = pd.read_csv('sp500_index.csv')

In [3]:
# Cleaning columns with snake_case 
data.columns = [col.lower().replace(" ", "_")for col in data.columns] 
df.columns = [col.lower().replace(" ", "_")for col in df.columns] 
sp.columns = [col.lower().replace(" ", "_")for col in sp.columns] 

<h3 style="color: #4169E1;">1.1 | Exploring the Data </h3>

In [4]:
data.sample(3)

Unnamed: 0,date,symbol,adj_close,close,high,low,open,volume
1195928,2013-07-10,MSFT,28.582773,34.700001,34.810001,34.32,34.34,29658800.0
1052365,2010-07-02,KHC,,,,,,
178939,2019-03-14,T,14.737947,22.87009,23.028702,22.817221,22.87009,29249014.0


### Dataset Description: 

A brief analysis of each column. 
- `Date`: The specific date for which the stock date is recorded. 
- `Symbol`: A unique "ticker" code that identifies the company on the stock exchange. 
- `Adj_close`: The closing price of the stock after adjustments for dividends, splits, or other corporate actions. 
- `Close`: The unadjusted closing price of the stock on a given date.  
- `High`: The highest price at which the stock traded during the day.  
- `Low`: The lowest price at which the stock traded during the day. 
- `Open`: The price at which the stock started trading at the beginning of the day.
- `Volume`: The total number of shares traded during the day.

In [5]:
df.sample(3)

Unnamed: 0,exchange,symbol,shortname,longname,sector,industry,currentprice,marketcap,ebitda,revenuegrowth,city,state,country,fulltimeemployees,longbusinesssummary,weight
143,NMS,ORLY,"O'Reilly Automotive, Inc.","O'Reilly Automotive, Inc.",Consumer Cyclical,Specialty Retail,1257.78,72612519936,3685245000.0,0.038,Springfield,MO,United States,92709.0,"O'Reilly Automotive, Inc., together with its s...",0.001278
75,NYQ,UNP,Union Pacific Corporation,Union Pacific Corporation,Industrials,Railroads,233.57,141603454976,12030000000.0,0.025,Omaha,NE,United States,30518.0,"Union Pacific Corporation, through its subsidi...",0.002492
451,NYQ,SWK,"Stanley Black & Decker, Inc.","Stanley Black & Decker, Inc.",Industrials,Tools & Accessories,84.46,13020691456,1663700000.0,-0.051,New Britain,CT,United States,50500.0,"Stanley Black & Decker, Inc. provides hand too...",0.000229


### Dataset Description: 

A brief analysis of each column: 
- `Exchange`: A marketplace where stocks, bonds or other comodities are traded. (Example: NYSE, NASDAQ).
- `Symbol`: A unique "ticker" code that identifies the company on the stock exchange. 
- `Shortname`: The abbreviated name of the company. 
- `Longname`: The full name of the company. 
- `Sector`: The broader industry classification that the company belongs to, such as Technology, Healthcare, etc. 
- `Industry`: A more specific classification of the company's operations (e.g., Software, Pharmaceuticals).
- `Currentprice`: The most recent price at which the company's stock was sold or bought. 
- `Marketprice`: The total market value of the company's outstanding shares, calculated as: Current Price X Outstanding Shares. 
- `Ebitda`: (Earnings Before Interest Taxes Depreciation and Amortization ) Measures how profitable a company is before paying interest, taxes, and taking depreciation and amortization. 
- `Revenuegrowth`: The percentage increase or decrease in sales between periods, calculated as: 
- `City`: The city where the company's headquarters is located. 
- `State`: The state where the company's headquarters is located.
- `Country`: The country of the company's origin. 
- `Fulltimeemployees`: The total number of employes of the company's business activities. 
- `Longbusinesssummary`: A breif description and overview of the company's business activities. 
- `Weight`: Represents the weight of the company's market cap relative to the total market cap, used in index calculations in the S&P 500. 


In [6]:
sp.sample(3)

Unnamed: 0,date,s&p500
2404,2024-06-28,5460.48
1352,2020-04-23,2797.8
288,2016-02-01,1939.38


### Dataset Description: 

A brief analysis of each colunn: 

- `Date`: The specific date for which the S&P 500 date is recorded. 
- `s&p500`: The closing price of the S&P 500 on a given date.  

<h3 style="color: #4169E1;">1.2 | Copies</h3>

In [8]:
data2 = data.copy()
df2 = df.copy()
sp2 = sp.copy()

<h2 style="color: #9370DB;"> 02 | ⚒️ Data Cleaning </h2>

<h3 style="color: #4169E1;"> 2.1 | Dealing with Data types</h3>

In [None]:
data.dtypes

In [None]:
df.dtypes

In [None]:
sp.dtypes

<h3 style="color: #4169E1;"> 2.2 | Dealing with NaN values</h3>

In [None]:
df.isna().sum()

In [None]:
sp.isna().sum()

In [None]:
data.isna().sum()

In [14]:
# Delete NaN. TELL WHY I'M dropping 
data2.dropna(how='any', inplace=True)

In [None]:
data2.isna().sum()

In [None]:
# Use this as a reference conunting the same dproing them that without droping them. 
data2.symbol.value_counts()

<h3 style="color: #4169E1;"> 2.3 | Dealing with Duplicates</h3>

In [None]:
data2.duplicated().sum()

In [None]:
df.duplicated().sum()

In [None]:
sp.duplicated().sum()

<h3 style="color: #4169E1;"> 2.5 | Dealing with outliers</h3>

In [None]:
def outlier_slayer(data): 
    """
    Automatically removes outliers based on Q1, Q3
    """
    for column in data.select_dtypes(include=[np.number]):
        Q1 = data[column].quantile(0.25)
        Q3 = data[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        data = data[(data[column] >= lower_bound) & (data[column] <= upper_bound)]
    return data

In [None]:
df = outlier_slayer(df, "price")

<h3 style="color: #4169E1;"> 2.6 | Moving target to the right </h3>

<h3 style="color: #4169E1;"> 2.7 | Other Steps </h3>

In [20]:
# Delete Columns 
data2.drop(columns=['high', 'low', 'open','close'], inplace=True)

In [21]:
# Change to datetime. 
data2['date'] = pd.to_datetime(data2['date'])

In [22]:
data2['year'] = data2['date'].dt.year
data2['month'] = data2['date'].dt.month
data2['day'] = data2['date'].dt.day

In [None]:
cols = ['year', 'month', 'day', 'symbol', 'adj_close', 'volume']
data2 = data2[cols]
data2.head(3)

In [24]:
# Drop rows where year is between 2010 and 2013 because SP500 for comparison we do have 2014. 
data2.drop(data2[(data2['year'] >= 2010) & (data2['year'] <= 2014)].index, inplace=True)

In [None]:
# Chat helped. 
annual_returns = data2.groupby(['symbol', 'year']).apply(lambda group: (group['adj_close'].iloc[-1] / group['adj_close'].iloc[0]) - 1).reset_index(name='annual_return').round(4)

In [26]:
pivoted_df = annual_returns.pivot(index='symbol', columns='year', values='annual_return')

In [None]:
pivoted_df.sample(3)

In [28]:
pivoted_df = pivoted_df.rename(columns={2015: 'ar_2015',2016:'ar_2016',2017: 'ar_2017', 
                                          2018:'ar_2018', 2019: 'ar_2019',2020: 'ar_2020', 2021: 'ar_2021', 2022:'ar_2022', 2023:'ar_2023',2024: 'ar_2024'})

In [29]:
cols = ['exchange', 'symbol', 'shortname','longname','sector','industry',
        'marketcap','ebitda', 'revenuegrowth', 'city', 'state', 'country',   
        'fulltimeemployees', 'longbusinesssummary', 'weight', 'mean_2015', 'mean_2016', 'mean_2017', 'mean_2018', 'mean_2019',
        'mean_2020', 'mean_2021', 'mean_2022', 'mean_2023', 'mean_2024', 'currentprice']

In [None]:
definitive = pd.merge (df, pivoted_df, on='symbol')
definitive

In [31]:
cols = ['exchange', 'symbol', 'shortname','longname','sector','industry',
        'marketcap','ebitda', 'revenuegrowth', 'city', 'state', 'country',   
        'fulltimeemployees', 'longbusinesssummary', 'weight', 'ar_2015', 'ar_2016', 'ar_2017', 'ar_2018', 'ar_2019',
        'ar_2020', 'ar_2021', 'ar_2022', 'ar_2023', 'ar_2024', 'currentprice']

In [32]:
definitive = definitive[cols]

In [None]:
df.nunique()

In [None]:
df.sector.value_counts()

<h2 style="color: #9370DB;"> 03 | EDA (Exploratory Data Analysis) </h2>

<h3 style="color: #4169E1;"> Optional | Selecting Numerical </h3>

In [None]:
cat = definitive.select_dtypes(exclude='number')
cat.head(5)

In [None]:
num = definitive.select_dtypes(include='number')
num.head(5)

<h3 style="color: #4169E1;">3.1 | Descriptive Statistics </h3>

In [None]:
definitive.describe()

In [None]:
frequ = cat.sector.value_counts()
frequ

In [None]:
table = cat.sector.value_counts(normalize=True).round(2)
table

In [None]:
frequency_table = pd.concat([frequ,table], axis = 1)
frequency_table


<h3 style="color: #4169E1;"> 3.2 | Checking Distributions</h3>

In [None]:
color = '#9370DB'

nrows, ncols = 5, 4 

fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(20, 16))

axes = axes.flatten()

for i, ax in enumerate(axes):
    if i >= len(num.columns):
        ax.set_visible(False)  # hide unesed plots
        continue
    ax.hist(num.iloc[:, i], bins=30, color=color, edgecolor='black')
    ax.set_title(num.columns[i])

plt.tight_layout()
plt.show()

<h3 style="color: #4169E1;"> 3.3 | Checking our target distribution</h3>

In [None]:
# Without the filter 650.000 and taking out the outliers. 
sns.histplot(definitive["currentprice"], color=color, kde=True);

In [None]:
#pearson 
num.corrwith(definitive['currentprice']).sort_values(ascending=False)

In [None]:
#Spearman
num.corrwith(df['currentprice'], method='spearman').sort_values(ascending=False)[:5]

<h3 style="color: #4169E1;">3.4 | Checking Outliers </h3>

In [None]:
color = '#9370DB'

# grid size
nrows, ncols = 5, 4 

fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(20, 16))

axes = axes.flatten()

for i, ax in enumerate(axes):
    if i >= len(num.columns):
        ax.set_visible(False)
        continue
    ax.boxplot(num.iloc[:, i].dropna(), vert=False, patch_artist=True, 
               boxprops=dict(facecolor=color, color='black'), 
               medianprops=dict(color='yellow'), whiskerprops=dict(color='black'), 
               capprops=dict(color='black'), flierprops=dict(marker='o', color='red', markersize=5))
    ax.set_title(num.columns[i], fontsize=10)
    ax.tick_params(axis='x', labelsize=8)

plt.tight_layout()
plt.show()

<h3 style="color: #4169E1;">3.5 | Looking for Correlations </h3>

In [None]:
num_corr = num.corr()
num_corr

In [None]:
# Correlation Matrix-Heatmap Plot
mask = np.zeros_like(num_corr)
mask[np.triu_indices_from(mask)] = True 
f, ax = plt.subplots(figsize=(20, 10))
sns.set(font_scale=1.5)

ax = sns.heatmap(num_corr, mask=mask, annot=True, annot_kws={"size": 12}, linewidths=.5, cmap="BuPu", fmt=".2f", ax=ax) # round to 2 decimal places
ax.set_title("Correlation Heatmap", fontsize=20) 

In [None]:
# Plotting scatter plots for each numerical column against 'currentprice' to visualize their relationships
for col in num.columns:
    plt.figure(figsize=(5, 5))
    plt.title('Scatter plot of price vs ' + col)
    sns.scatterplot(data=definitive, x=col, y='currentprice')
    plt.show()

We will use **one-way ANOVA** to determine if there is a statistically significant difference in **stock price** based on **sector**.

#### Define Hypotheses
- **Null Hypothesis (H₀)**: There is no difference in mean stock prices between sectors such as at **Technolgies**, **Industrials**, and **Finance** companies.
- **Alternative Hypothesis (H₁)**: At least one group mean is different.

In [None]:
# Extract salaries for Data Scientists by company size
df_small = df[(df["job_title"] == "Data Scientist") & (df["company_size"] == "Small")]["salary_in_usd"]
df_medium = df[(df["job_title"] == "Data Scientist") & (df["company_size"] == "Medium")]["salary_in_usd"]
df_large = df[(df["job_title"] == "Data Scientist") & (df["company_size"] == "Large")]["salary_in_usd"]

In [None]:
# Perform One-Way ANOVA
f_stat, p_value = st.f_oneway(df_small, df_medium, df_large)
print(f"F-Statistic: {f_stat:.2f}")
print(f"P-Value: {p_value:.4f}")
print()

# Significance level
alpha = 0.05

# Decision-Making
if p_value > alpha:
    print("Fail to Reject the Null Hypothesis: Company size has no significant impact on data scientist salaries.")
else:
    print("Reject the Null Hypothesis: There is a significant difference in salaries based on company size.")

<h2 style="color: #9370DB;"> 04 | Data Processing </h2>

<h3 style="color: #4169E1;"> 4.1 | X-Y Split</h3>

<h3 style="color: #4169E1;"> 4.2 | Selecting the Model</h3>

<h4 style="color: #00BFFF;"> 4.2.1 | Selecting Model: Linear Regression </h4>

<h4 style="color: #00BFFF;"> 4.2.2 | Selecting Model: Ridge Regression </h4>

<h4 style="color: #00BFFF;"> 4.2.3 | Selecting Model: Lasso Regression </h4>

<h4 style="color: #00BFFF;"> 4.2.4 | Selecting Model: Decision Tree Regression </h4>

<h4 style="color: #00BFFF;"> 4.2.5 | Selecting Model: KNN Regression </h4>

<h4 style="color: #00BFFF;"> 4.2.6 | Selecting Model: XGBoost Regression </h4>

<h3 style="color: #4169E1;"> 4.3 | Final Comparision</h3>

<h2 style="color: #9370DB;"> 05 | Improving Model </h2>

<h3 style="color: #4169E1;"> 5.1 | Normalization with MinMaxScaler</h3>

<h3 style="color: #4169E1;"> 5.2 | Standardization with StandardScaler</h3>

<h3 style="color: #4169E1;"> 5.3 | Normzalization with Long Transform</h3>

<h3 style="color: #4169E1;"> 5.4 | Feature Engineering </h3>

<h2 style="color: #9370DB;"> 06 | Reporting </h2>