# Machine Learning Project: Stock Prediction using Machine Learning

In this project we will predict the value of PayPal stock using Machine learning

# Importing the Paypal Stock information

## Imports and Data Cleaning

In [37]:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import numpy as np
import sklearn.svm as svc
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings("ignore")

Now we download the Standard & Poor 500 dataset (or keep it updated)

In [38]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("yash16jr/s-and-p500-daily-update-dataset")

print("Path to dataset files:", path)

Path to dataset files: /Users/guilhermealves/.cache/kagglehub/datasets/yash16jr/s-and-p500-daily-update-dataset/versions/297


In [39]:

import os

path = "/Users/guilhermealves/.cache/kagglehub/datasets/yash16jr/s-and-p500-daily-update-dataset/versions/275"

print(os.listdir(path))

['SnP_daily_update.csv']


In [40]:

df = pd.read_csv(
    # Use o caminho correto para o seu ficheiro .csv
    '/Users/guilhermealves/.cache/kagglehub/datasets/yash16jr/s-and-p500-daily-update-dataset/versions/275/SnP_daily_update.csv',
    
    header=[0, 1],         # Mantém o MultiIndex para os nomes das ações
    index_col=0,          # <--- CORREÇÃO 1: Usa a primeira coluna ('Date') como índice
    parse_dates=True      # <--- CORREÇÃO 2: Converte a coluna de índice para datetime
)

print(df.head())


Price           Close                                                      \
Ticker              A      AAPL ABBV ABNB        ABT      ACGL        ACN   
Date                                                                        
2010-01-04  19.891678  6.424604  NaN  NaN  18.414782  7.601905  31.492178   
2010-01-05  19.675604  6.435713  NaN  NaN  18.266010  7.576549  31.686796   
2010-01-06  19.605700  6.333346  NaN  NaN  18.367451  7.543795  32.023655   
2010-01-07  19.580280  6.321635  NaN  NaN  18.519613  7.499420  31.993717   
2010-01-08  19.573919  6.363664  NaN  NaN  18.614286  7.484628  31.866457   

Price                                        ...   Volume                    \
Ticker           ADBE        ADI        ADM  ...       WY     WYNN      XEL   
Date                                         ...                              
2010-01-04  37.090000  21.975159  20.614487  ...  1832400  4741400  2670400   
2010-01-05  37.700001  21.940468  20.725853  ...  1724500  5644300 

In [42]:
df_pypl_completo = df.xs('PYPL', level=1, axis=1)

In [43]:
df_pypl_completo.columns = [col.lower() for col in df_pypl_completo.columns]

if 'price' in df_pypl_completo.columns:
    df_pypl_completo = df_pypl_completo.drop(columns=['price'])

df_pypl_completo.dropna(subset=['close'], inplace=True)

df_pypl = df_pypl_completo.copy()

print("DataFrame da PayPal (PYPL) Limpo e Pronto:")
print(df_pypl.head())
print(f"Número de colunas: {df_pypl.shape[1]}")
print(f"Colunas existentes: {list(df_pypl.columns)}")


DataFrame da PayPal (PYPL) Limpo e Pronto:
                close       high        low       open     volume
Date                                                             
2015-07-06  36.709999  39.750000  36.000000  38.000000  5866600.0
2015-07-07  36.619999  37.810001  36.000000  37.720001  7359000.0
2015-07-08  34.700001  36.360001  34.529999  36.340000  5387700.0
2015-07-09  34.500000  35.520000  33.990002  35.099998  3760100.0
2015-07-10  34.689999  35.189999  33.980000  34.660000  4472800.0
Número de colunas: 5
Colunas existentes: ['close', 'high', 'low', 'open', 'volume']


# Time Series Analysis


Stock market predicition must be preceded by a rigorous Time Series Analysis. Financial data is unique because observations are not independent and identifcally distributed, they exibit autocorrelation and non-stationarity.


Time Series Analysis serves two critical functions:

- Risk Management: Quantifying extreme events. This will provide context for interpreting machine learning models.

- Validation and feature engineering: It allow us to confirm the statistical properties of data (Stationarity and other)

## Discrete Analysis

### Logarithmic Return 

### Histogram of returns

## Stationary Test

## Time Dependecy Analysis

# PayPal Stock Prediction using a Support Vector Machine 


The idea of Support Vector Machine is by finding a hyperplane to divide the data into groups, this will classify if the stock is going up or down based on historic data.

This only shows history up to 2015-07-06 because PayPal only turned public in July of 2015 (The S&P500 file starts at 2010)

Since the Support Vector Machine is a classifier we need to define the target

- 1 if tomorrow's price is BIGGER than today
- 0 if tomorrow's price is SMALLER than today

In [44]:
df_pypl["Target"]= np.where(df_pypl["close"].shift(-1) > df_pypl["close"], 1, 0)

In [45]:
df_pypl["SMA_20"]= df_pypl["close"].rolling(window=20).mean()

In [46]:
df_pypl["SMA_50"]=df_pypl["close"].rolling(window=50).mean()

In [47]:
df_ml_final = df_pypl.dropna()

In [48]:
features = ["open", "high", "low", "close", "volume", "SMA_20", "SMA_50"]
X = df_ml_final [features]
y = df_ml_final["Target"]

In [49]:
print( "DataFrame da Paypal com features e Target")
print(df_ml_final[features + ["Target"]].tail())
print(f"\nNúmero de amostras prontas para o SVM (X e y): {X.shape[0]}")
print(f"Número de Features (X): {X.shape[1]}")

DataFrame da Paypal com features e Target
                 open       high        low      close      volume   SMA_20  \
Date                                                                          
2025-10-13  70.720001  70.930000  68.162003  68.860001  20036800.0  69.3580   
2025-10-14  67.410004  69.709999  66.769997  69.150002  12968500.0  69.4730   
2025-10-15  69.385002  69.875000  67.839996  67.980003  11247300.0  69.4410   
2025-10-16  68.129997  68.599998  65.419998  66.050003  19258500.0  69.3175   
2025-10-17  65.535004  67.709999  65.449997  67.410004  11141800.0  69.2770   

             SMA_50  Target  
Date                         
2025-10-13  68.8970       1  
2025-10-14  68.9198       0  
2025-10-15  68.9228       0  
2025-10-16  68.8554       1  
2025-10-17  68.8392       0  

Número de amostras prontas para o SVM (X e y): 2540
Número de Features (X): 7


In [50]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, shuffle=False)

In [51]:
np.random.seed(42)
num_samples= X.shape[0]

scaler = StandardScaler()
X_train_scaled= scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [53]:
model = SVC (kernel= "rbf", C=1.0, gamma = "scale", random_state=42)

In [54]:
model.fit(X_train_scaled, y_train)

In [55]:
y_pred = model.predict(X_test_scaled)

In [56]:
print(accuracy_score(y_test, y_pred))

0.5511811023622047


In [57]:
cm= confusion_matrix(y_test, y_pred)
print(cm)

[[ 29 209]
 [ 19 251]]


In [58]:
print(classification_report(y_test, y_pred, target_names=['Descida (0)', 'Subida (1)']))

              precision    recall  f1-score   support

 Descida (0)       0.60      0.12      0.20       238
  Subida (1)       0.55      0.93      0.69       270

    accuracy                           0.55       508
   macro avg       0.57      0.53      0.45       508
weighted avg       0.57      0.55      0.46       508

