Made by Rob Verbeek
# Scraping stock data

### Dependencies

In order to scrape the data I'm using the yahoo finance. In order to use this we have to install the yfinance library.

### Defining the time period

To define the time period of which we want to scrape data, we get the current time as the endpoint. For the start-date we use a timedelta to substract the amount of days we want to go in the past from the current time period.
Since I'm looking at data of the last 10 years I used a timedelta of 3652 days.

### Downloading and saving the data

To find the the correct stocks to download we have to use the ticker that they go by, the stocks and their tickers that we want to scrape are stored in a dictionary.

This dictionary is then used in a loop to download all the stocks, rename the column names to be all lowercase, and remove spaces with underscore and finally then saved in a csv file.



In [None]:
#%pip install yfinance

In [None]:
import yfinance as yf
import time
from datetime import datetime, timedelta
import pandas as pd

In [None]:
# Define the list of MAMAA stock tickers
mamaa_stocks = {
    'Meta': 'META',
    'Apple': 'AAPL',
    'Microsoft': 'MSFT',
    'Amazon': 'AMZN',
    'Alphabet': 'GOOGL',
    'SP500' : '^GSPC'
}

# Get today's date and the date from 10 year ago
end_date = datetime.now()
start_date = end_date - timedelta(days=3652)

# Loop through each stock ticker and download historical data
for company, ticker in mamaa_stocks.items():
    try:
        print(f"Fetching data for {company} ({ticker})...")
        
        # Download stock data for the last year
        stock_data = yf.download(ticker, start=start_date.strftime('%Y-%m-%d'), end=end_date.strftime('%Y-%m-%d'))
        # lower all column names
        stock_data.columns = map(str.lower, stock_data.columns)
        stock_data.columns = stock_data.columns.str.replace(" ", "_")
        print(stock_data.columns)
        
        # Save data to CSV file
        csv_filename = f"{ticker}_last_decade.csv"
        stock_data.to_csv(csv_filename)
        
        print(f"Data for {company} ({ticker}) saved as {csv_filename}")
        
        # Add a short delay to avoid overwhelming the server
        time.sleep(2)  # 2-second pause between requests
        
    except Exception as e:
        print(f"Error fetching data for {company} ({ticker}): {e}")

print("Done fetching MAMAA stock data!")


# Cleaning

Even though I already used the 'lower' function in the download, the date column refused to be lowered. That's why I load the data again and then lower the columns again which does finally return the date column without capital letters.

**This is only done on the SP500 data as this is the dataset that was used to train and predict.**

In [None]:
# Load stock data
csv_filename = 'Data/SP500_last_decade.csv'
df = pd.read_csv(csv_filename)
df.columns = map(str.lower, df.columns)

# Preparing data for Training

Since time-series models with PyCaret can use a datetime as the index of the dataframe I set the date column as the index.

Next I drop the columns which I don't need for training, this is every column except 'close' the reason behind this is because using the open, high and low columns have very similar values to the 'close' column that we are trying to predict and will make the model way too accurate while in reality this isn't the case.

I also decided to leave out the 'adj_close' column because this is essentially the same value as the 'close' column but with dividends being taken into account.
That leaves us with a datetime index and our target the 'close' column. 

## Time Series Consistency

Because the timeseries models need a structure I have to add the frequency of the data, since we are using stock data this means every business day ('B'). 
This ensures that the dataset aligns with a business day calendar (Monday to Friday), which is the standard in financial markets. Missing days (e.g., holidays or weekends) will be added with NaN values.

## Missing Values

Since the dataset does seem to be missing a few days, I am forward-filling the data the reason why I chose to forward fill is because stock prices typically don't change on non-trading days, making forward-filling a reasonable approach for imputation.

In [None]:
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
df.drop(['adj_close','open','high','low','volume'], axis=1, inplace=True)
df = df.asfreq('B')
df['close'].fillna(method='ffill', inplace=True)
df