Extract and Transform Notebook

Description:
This notebook is designed to download the [Stock Market Dataset](https://www.kaggle.com/datasets/jacksoncrow/stock-market-dataset/data) - historical daily prices of Nasdaq-traded stocks and ETFs from Kaggle, and transform the data to a useable format.


In [2]:
# Import the necessary dependencies.
import pandas as pd
import os
import zipfile
import shutil

In [3]:
# Download the dataset from kaggle
# !kaggle datasets download -d jacksoncrow/stock-market-dataset -p "C:\Users\USER\Desktop\Projects\Data-Profiling-and-Quality-Testing\data\raw_data"

In [7]:
# Download the metadata
# !kaggle datasets metadata jacksoncrow/stock-market-dataset -p "C:\Users\USER\Desktop\Projects\Data-Profiling-and-Quality-Testing\data\raw_data"

Downloaded metadata to C:\Users\USER\Desktop\Projects\Data-Profiling-and-Quality-Testing\data\raw_data\dataset-metadata.json


In [8]:
# Path to the downloaded zip file and the folder to extarct the files to.
zipped_file = r"C:\Users\USER\Desktop\Projects\Data-Profiling-and-Quality-Testing\data\raw_data\stock-market-dataset.zip"
extracted = r"C:\Users\USER\Desktop\Projects\Data-Profiling-and-Quality-Testing\data\extracted"

In [9]:
# Extracting the zip file
if not os.path.exists(extracted):
    os.makedirs(extracted)
    print(f"Directory created: '{extracted}'")
else:
    print(f"Directory '{extracted}' already exists.")

# Extract files to the destination directory
with zipfile.ZipFile(zipped_file, 'r') as zip_ref:
    zip_ref.extractall(path=extracted)

print(f"All files extracted to: '{extracted}'")

Directory 'C:\Users\USER\Desktop\Projects\Data-Profiling-and-Quality-Testing\data\extracted' already exists.


FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\USER\\Desktop\\Projects\\Data-Profiling-and-Quality-Testing\\data\\extracted\\etfs\\PRN.csv'

Further manual operations on the zip file shows that the error above is due to fact that the PRN.csv file is named after some windows reserved names for specific operations. In this case, PRN - print, for print operation.
Thus, to extract the files in the zip file, the PRN.csv file would be renamed to PRN_FILE.csv before extraction.

In [11]:
# Extract the zip file
with zipfile.ZipFile(zipped_file, 'r') as zip_ref:
    for file in zip_ref.infolist():
        original_filename = file.filename
        # Ensure filename replacements are correct and assigned properly
        filename = original_filename.replace('PRN.csv', 'PRN_FILE.csv')
        
        # Define the full path for the extracted file
        path = os.path.join(extracted, filename)

        # Ensure the directory exists where the file will be extracted.
        directory = os.path.dirname(path)
        if not os.path.exists(directory):
            os.makedirs(directory, exist_ok=True)  # Use exist_ok=True
    
        # Extracting to path. Overwrite if files already exist in path.
        with zip_ref.open(file) as source, open(path, 'wb') as target:
            shutil.copyfileobj(source, target)

print(f"All files extracted to: {extracted}")


All files extracted to: C:\Users\USER\Desktop\Projects\Data-Profiling-and-Quality-Testing\data\extracted


In [15]:
# Merge the stock files into one
def merge_stock_files(input_path, output_file):
    """
    Merge multiple stock CSV files in a director into a single CSV file.
    A new column 'Stock' will be added to the single CSV, containig the stock name derived from each file name.
    
    Arg:
    input_path: The path to the directory containing all the CSV files to be merged.
    output_path: The path to where the merged CSV file wil be saved.
    """

    # Empty list for the stocks names.
    stocks = []

    # Iterate over each file in the input path.
    for filename in os.listdir(input_path):
        if filename.endswith('.csv'):
            # Define the full path to the file
            file_path = os.path.join(input_path, filename)

            # Read the CSV file paths into a dataframe
            df = pd.read_csv(file_path)

            # Extract the stock name from the filename
            stock_name = filename.replace('.csv', '')

            # Add a new column with the stock name
            df['Stcoks'] = stock_name

            # Append the dataframe to the list
            stocks.append(df)

    # Concatenate all dataframes into one
    combined_df = pd.concat(stocks)

    # Save the combined dataframe to a CSV file
    combined_df.to_csv(output_file, index=True)

    print(f"Combined CSV file creadted at '{output_file}'")


In [17]:
# Merge the etfs CSV
input_path = r"C:\Users\USER\Desktop\Projects\Data-Profiling-and-Quality-Testing\data\extracted\etfs"
output_file = r"C:\Users\USER\Desktop\Projects\Data-Profiling-and-Quality-Testing\data\merged\etfs\etfs.csv"
merge_stock_files(input_path, output_file)

Combined CSV file creadted at 'C:\Users\USER\Desktop\Projects\Data-Profiling-and-Quality-Testing\data\merged\etfs\etfs.csv'


In [18]:
# Merge the stocks CSV
input_path = r"C:\Users\USER\Desktop\Projects\Data-Profiling-and-Quality-Testing\data\extracted\stocks"
output_file = r"C:\Users\USER\Desktop\Projects\Data-Profiling-and-Quality-Testing\data\merged\stocks\stocks.csv"
merge_stock_files(input_path, output_file)

Combined CSV file creadted at 'C:\Users\USER\Desktop\Projects\Data-Profiling-and-Quality-Testing\data\merged\stocks\stocks.csv'
