# Retrieving Dataset from `yfinance` python API

The Tickers we would be using mainly to extract the financial data price of the futures are the following from Yahoo Finance:
- `CL=F` -> Priority (West Texas Intermediate Barrel):
- `GC=F` -> Priority (Gold Futures):

We would be using `yf.Ticker` and see the columns, we want to retrieve the following in order to create a **Regression models** for days: 
* `Ticker:` Market quotation symbol for the future, in our case would be `CL=F, GC=F` and many more that can retrieve the prices of gold and oil futures.
* `Commodity:` Name of the asset being analyzed, we have basically two: oil and gold. This would be key when partitioning in the AWS Athena for Statistical arbitrage | Momentum Trading.
* `Date:` The date the data was recorded. Format `YYYY-MM-DD`.
* `Open:` Market opening price.
* `High:` Highest price during the trading day
* `Low:` Lowest price during the day.
* `Close:` Market Closing Price.
* `Volume:` Number of contracts traded during the day

In [2]:
# Import necessary modules
import yfinance as yf
import pandas as pd
from datetime import datetime

In [None]:
from typing import List, Union, Dict

# Let's create the different tickets for NYMEX oil and COMEX gold
future_tickers = ["CL=F","GC=F"]

def assert_yyyy_mm_dd(ticker_date:str) -> Union[None, ValueError]:
    '''
    ChatGPT generated: Function that will check wether a ticker_date is
    in the format yyyy_mm_dd or not.
    Args:
        ticker_date (str): String containing the ticker_date in format
        YYYY-MM-DD.
    '''
    try:
        datetime.strptime(ticker_date, "%Y-%m-%d")
    except ValueError as e:
        print(e)
        raise AssertionError(f"Invalid date format, expected YYYY-MM-DD: {ticker_date}")

def get_stocks_data_yahoo(
    tickers: List[str],
    start_date: str,
    end_date: str) -> dict:

    '''
    Function to download the date from multiple desired tickers using the yahoo
    API through yfinance module in python:
    Args:
        - tickers (List[str]) : A list of strings containing all of the tickers necessary
        - start_date (str): A string in the format of YYYY-MM-DD to get a range from.
        - end_date (str): A string in the format of YYYY-MM-DD to get a range from.
    Returns:
        List[pd.DataFrame]: List of pd.DataFrames contianing the following columns per
        DataFrame:

    '''

    # Assert the input type is gonna be list
    assert isinstance(tickers, list), f"Please use a list of proper tickers, current type is: {type(tickers)}"

    # Make sure is the correct datetime format.
    assert_yyyy_mm_dd(start_date)
    assert_yyyy_mm_dd(end_date)

    # Returns a pd.Series containing a multi->index label of the tickers
    data = yf.download(tickers=tickers, start=start_date, end=end_date)

    # Let's create a holder to df_tickers_splitted
    df_tickers_splitted = []

    for ticker in tickers:

        # Create a list to hold the ticker data
        ticker_data = {}
        
        #Add the datetime index orginally
        ticker_data["Date"] = data.index
        total_rows = len(ticker_data["Date"])

        # Iterate across each of the columns
        for col in data.columns:
            # If string ticker in the column, use it
            if ticker in col:
                # Let's get the col_name and ticker_name
                col_name, ticker_name = col[0], col[1]

                # Assign the proper ticker_data to col_name
                ticker_data[col_name] = data[col].to_list()
        
        # At the very end of the loop, add a new name called 'Ticker'
        ticker_data["Ticker"] = [ticker_name]*total_rows

        # Get the 
        df_to_pass = pd.DataFrame(ticker_data)
        df_tickers_splitted.append()
    return df_tickers_splitted

In [7]:
# West Texas Intermediate Barrel
data = get_stocks_data_yahoo(
    tickers=future_tickers, start_date="2010-01-01", end_date="2026-01-29")

  data = yf.download(tickers=tickers, start=start_date, end=end_date)
[*********************100%***********************]  2 of 2 completed


In [8]:
data

[           Date      Close       High        Low       Open  Volume Ticker
 0    2010-01-04  81.510002  81.680000  79.629997  79.629997  263542   CL=F
 1    2010-01-05  81.769997  82.000000  80.949997  81.629997  258887   CL=F
 2    2010-01-06  83.180000  83.519997  80.849998  81.430000  370059   CL=F
 3    2010-01-07  82.660004  83.360001  82.260002  83.199997  246632   CL=F
 4    2010-01-08  82.750000  83.470001  81.800003  82.650002  310377   CL=F
 ...         ...        ...        ...        ...        ...     ...    ...
 4037 2026-01-22  59.360001  60.820000  58.959999  60.680000  324349   CL=F
 4038 2026-01-23  61.070000  61.360001  59.520000  59.660000  283419   CL=F
 4039 2026-01-26  60.630001  61.709999  60.320000  61.220001  281062   CL=F
 4040 2026-01-27  62.389999  62.630001  60.139999  60.779999  360208   CL=F
 4041 2026-01-28  63.209999  63.570000  62.070000  62.580002  360208   CL=F
 
 [4042 rows x 7 columns],
            Date        Close         High          Low     

In [41]:
data.index

DatetimeIndex(['2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07',
               '2010-01-08', '2010-01-11', '2010-01-12', '2010-01-13',
               '2010-01-14', '2010-01-15',
               ...
               '2026-01-14', '2026-01-15', '2026-01-16', '2026-01-20',
               '2026-01-21', '2026-01-22', '2026-01-23', '2026-01-26',
               '2026-01-27', '2026-01-28'],
              dtype='datetime64[ns]', name='Date', length=4042, freq=None)