# 01_data_exploration.ipynb

## Project: Bitcoin in a 3-Asset Portfolio (BTC, S&P 500, Gold)

### Main Objective:
The objective of this notebook is to build and prepare two datasets that will be used in the replication of Philipp Schottler's study on Bitcoin in a diversified portfolio.

This notebook is focused on data import, cleaning, transformation, and aggregation to a monthly frequency.

---

## Study Replication:
Philipp Schottler analyzes the role of Bitcoin within a diversified portfolio consisting of Bitcoin, the S&P 500, and Gold. His study focuses on the period between September 2014 and November 2021.

---

## This notebook will create two datasets:

1. `final_df_study`: Dataset restricted to the original study period → from September 2014 to November 2021.

2. `final_df_extended`: Extended dataset including all available data → from September 2014 to February 2025.

---

## This notebook covers:
- Import and cleaning of data for:
  - Bitcoin (BTC)
  - S&P 500
  - Gold (XAU/USD)
  - CPI (Inflation Index)

- Transformation of the data to monthly frequency.

- Creation of the final datasets ready for:
  - Study Replication
  - Extended Analysis (including recent data)

---

## The full analysis will be developed in the next notebook:

> `02_study_replication.ipynb`

---


In [7]:
import os
import time
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

In [8]:
# Crear carpeta temporal Spark
temp_path = os.path.join(os.getcwd(), 'spark-temp')
os.makedirs(temp_path, exist_ok=True)
aaaaaa
# Definir variables de entorno
os.environ['JAVA_HOME'] = os.environ['CONDA_PREFIX'] + '\Library'
os.environ['SPARK_LOCAL_DIRS'] = temp_path

print('JAVA_HOME:', os.environ.get('JAVA_HOME'))
print('SPARK_LOCAL_DIRS:', os.environ.get('SPARK_LOCAL_DIRS'))

JAVA_HOME: C:\anaconda\envs\pyspark_env1\Library
SPARK_LOCAL_DIRS: C:\Users\TESTER\Desktop\Laboral\GIT\btc-3-asset-portfolio-extension\notebooks\spark-temp


In [9]:
# Crear Spark Session y medir tiempo
start_time = time.time()

spark = SparkSession.builder \
.master("local")\
.config('spark.local.dir', temp_path) \
    .appName('btcproject') \
        .getOrCreate()

#.config('spark.driver.memory', '8g') \
#.config('spark.executor.memory', '8g') \



spark.sparkContext.setLogLevel('ERROR')

end_time = time.time()
print('Spark Version:', spark.version)
print(f'Tiempo total en crear SparkSession: {round(end_time - start_time, 2)} segundos')

Spark Version: 3.5.5
Tiempo total en crear SparkSession: 0.04 segundos


# Data Cleaning
## Btc Dataset

In [10]:
df_btc = spark.read \
.option("header", True) \
.option("sep", ";") \
.option("inferSchema", True) \
.csv('../data/BTC_All_graph_coinmarketcap.csv')



df_btc.printSchema()
df_btc.select(
    F.min("timestamp").alias("Fecha Minima"),
    F.max("timestamp").alias("Fecha Maxima")
).show()

root
 |-- timeOpen: timestamp (nullable = true)
 |-- timeClose: timestamp (nullable = true)
 |-- timeHigh: timestamp (nullable = true)
 |-- timeLow: timestamp (nullable = true)
 |-- name: integer (nullable = true)
 |-- open: double (nullable = true)
 |-- high: double (nullable = true)
 |-- low: double (nullable = true)
 |-- close: double (nullable = true)
 |-- volume: double (nullable = true)
 |-- marketCap: double (nullable = true)
 |-- timestamp: timestamp (nullable = true)

+--------------------+--------------------+
|        Fecha Minima|        Fecha Maxima|
+--------------------+--------------------+
|2014-10-01 01:59:...|2025-04-01 01:59:...|
+--------------------+--------------------+



In [11]:
df_btc.orderBy(F.col("timeOpen").asc()).toPandas().head(10)

Unnamed: 0,timeOpen,timeClose,timeHigh,timeLow,name,open,high,low,close,volume,marketCap,timestamp
0,2014-09-01 02:00:00,2014-10-01 01:59:59.999,2014-09-04 17:19:00,2014-09-29 09:03:59,2781,477.786987,493.928009,372.23999,386.944,34707300.0,5158621000.0,2014-10-01 01:59:59.999
1,2014-10-01 02:00:00,2014-11-01 00:59:59.999,2014-10-14 08:04:00,2014-10-05 18:34:00,2781,387.427002,411.697998,289.29599,338.321014,12545400.0,4549893000.0,2014-11-01 00:59:59.999
2,2014-11-01 01:00:00,2014-12-01 00:59:59.999,2014-11-13 06:19:00,2014-11-02 12:59:00,2781,338.649994,457.092987,320.626007,378.046997,9194440.0,5125958000.0,2014-12-01 00:59:59.999
3,2014-12-01 01:00:00,2015-01-01 00:59:59.999,2014-12-02 10:54:00,2014-12-18 09:09:00,2781,378.248993,384.037994,304.231995,320.192993,13942900.0,4377511000.0,2015-01-01 00:59:59.999
4,2015-01-01 01:00:00,2015-02-01 00:59:59.999,2015-01-01 01:04:00,2015-01-14 23:54:01,2781,320.434998,320.434998,171.509995,217.464005,23348200.0,2997692000.0,2015-02-01 00:59:59.999
5,2015-02-01 01:00:00,2015-03-01 00:59:59.999,2015-02-15 10:49:21,2015-02-01 15:34:01,2781,216.867004,265.610992,212.014999,254.263,13949300.0,3531777000.0,2015-03-01 00:59:59.999
6,2015-03-01 01:00:00,2015-04-01 01:59:59.999,2015-03-10 16:04:19,2015-03-25 08:34:21,2781,254.283005,300.044006,236.514999,244.223999,22672000.0,3420113000.0,2015-04-01 01:59:59.999
7,2015-04-01 02:00:00,2015-05-01 01:59:59.999,2015-04-06 04:34:22,2015-04-26 18:04:23,2781,244.223007,261.798004,214.873993,236.145004,33818600.0,3332095000.0,2015-05-01 01:59:59.999
8,2015-05-01 02:00:00,2015-06-01 01:59:59.999,2015-05-09 09:59:19,2015-05-07 09:19:19,2781,235.938995,247.804001,228.572998,230.190002,14730800.0,3273756000.0,2015-06-01 01:59:59.999
9,2015-06-01 02:00:00,2015-07-01 01:59:59.999,2015-06-30 17:49:22,2015-06-01 23:59:21,2781,230.233002,267.867004,221.296005,263.071991,44533800.0,3770230000.0,2015-07-01 01:59:59.999


In [12]:
from pyspark.sql.window import Window

In [13]:
df_btc = df_btc.withColumn("month", F.date_format(F.col("timeOpen"), "yyyy-MM"))

monthly_df_btc = (
    df_btc.groupBy("month")
    .agg(
        F.last("close").alias("close_btc"),
        F.max("high").alias("high_btc"),
        F.min("low").alias("low_btc")
    )
    .orderBy("month")
)

monthly_df_btc.show()

monthly_df_btc.select(
    F.min("month").alias("min_month"),
    F.max("month").alias("max_month")
).show()


+-------+--------------+--------------+--------------+
|  month|     close_btc|      high_btc|       low_btc|
+-------+--------------+--------------+--------------+
|2014-09|386.9440002441|493.9280090332|372.2399902344|
|2014-10|338.3210144043|411.6979980469|289.2959899902|
|2014-11|378.0469970703|457.0929870605|320.6260070801|
|2014-12|320.1929931641|384.0379943848|304.2319946289|
|2015-01|217.4640045166|320.4349975586|171.5099945068|
|2015-02|254.2630004883|265.6109924316|212.0149993896|
|2015-03|244.2239990234|300.0440063477|236.5149993896|
|2015-04|236.1450042725|261.7980041504|214.8739929199|
|2015-05|230.1900024414|247.8040008545|228.5729980469|
|2015-06|263.0719909668|267.8670043945| 221.296005249|
|2015-07|284.6499938965|314.3940124512|253.5050048828|
|2015-08|230.0559997559|285.7149963379|199.5670013428|
|2015-09|236.0599975586|259.1820068359|225.1170043945|
|2015-10|314.1659851074|334.1690063477|235.6159973145|
|2015-11|377.3210144043|495.5620117188|300.9970092773|
|2015-12| 

In [14]:
monthly_df_btc_study = monthly_df_btc.filter(
    (F.col("month") >= "2014-09") & (F.col("month") <= "2021-11")
)

monthly_df_btc_extended = monthly_df_btc.filter(
    (F.col("month") >= "2014-09") & (F.col("month") <= "2025-03")
)

monthly_df_btc_new = monthly_df_btc.filter(
    (F.col("month") >= "2021-11") & (F.col("month") <= "2025-03")
)

monthly_df_btc_study.show(1)
monthly_df_btc_study.orderBy(F.col("month").desc()).limit(1).show()

+-------+--------------+--------------+--------------+
|  month|     close_btc|      high_btc|       low_btc|
+-------+--------------+--------------+--------------+
|2014-09|386.9440002441|493.9280090332|372.2399902344|
+-------+--------------+--------------+--------------+
only showing top 1 row

+-------+----------------+----------------+----------------+
|  month|       close_btc|        high_btc|         low_btc|
+-------+----------------+----------------+----------------+
|2021-11|57005.4254735998|68789.6259389221|53569.7639630968|
+-------+----------------+----------------+----------------+



# Bitcoin Price CAGR (2014 - 2021)

## Data Used

- Initial Date: October 1, 2014  
- Initial Price: 386.94 USD  

- Final Date: December 1, 2021  
- Final Price: 57,005.43 USD  

- Period: 7.17 years  

---

## CAGR Formula

CAGR = (Final Value / Initial Value) ^ (1 / Number of years) - 1

Applying the values:

CAGR = (57,005.43 / 386.94) ^ (1 / 7.17) - 1

Result:

CAGR ≈ 1.00701 → 100.70% per year

---

## Interpretation

The CAGR of Bitcoin between October 2014 and December 2021 was approximately:

### 100.70% per year

This means that, on average, Bitcoin doubled its value every year during this period.

An extraordinary return, very uncommon in traditional markets.


## SP 500 Dataset

In [15]:
df_sp500 = spark.read \
.option("header", True) \
.option("inferSchema", True) \
.csv('../data/VUSA ETF Stock Price History.csv')

df_sp500.printSchema()
df_sp500.select(
    F.min("Date").alias("Fecha Minima"),
    F.max("Date").alias("Fecha Maxima")
).show()

root
 |-- Date: string (nullable = true)
 |-- Price: double (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Vol.: string (nullable = true)
 |-- Change %: string (nullable = true)

+------------+------------+
|Fecha Minima|Fecha Maxima|
+------------+------------+
|  01/01/2013|  12/01/2024|
+------------+------------+



In [16]:
df_sp500 = df_sp500.withColumn("Date", F.to_date("Date", "MM/dd/yyyy"))

df_sp500.orderBy(F.col("Date").asc()).toPandas().head(10)

Unnamed: 0,Date,Price,Open,High,Low,Vol.,Change %
0,2012-06-01,16.41,16.14,16.44,15.94,11.11K,2.02%
1,2012-07-01,16.81,16.52,16.89,16.42,0.56K,2.44%
2,2012-08-01,16.93,16.76,17.23,16.68,23.62K,0.77%
3,2012-09-01,16.92,16.89,17.24,16.83,26.52K,-0.09%
4,2012-10-01,16.61,17.02,17.3,16.56,52.18K,-1.80%
5,2012-11-01,16.83,16.6,17.06,16.12,22.20K,1.29%
6,2012-12-01,16.36,16.86,17.23,16.37,19.67K,-2.79%
7,2013-01-01,17.98,16.84,18.26,16.66,29.32K,9.89%
8,2013-02-01,19.1,18.03,19.25,18.02,64.96K,6.23%
9,2013-03-01,19.6,19.01,19.99,19.01,139.10K,2.62%


In [17]:
df_sp500.filter(F.col("Date") == "2014-01-31").show()

+----+-----+----+----+---+----+--------+
|Date|Price|Open|High|Low|Vol.|Change %|
+----+-----+----+----+---+----+--------+
+----+-----+----+----+---+----+--------+



In [18]:
df_sp500 = df_sp500.withColumnRenamed("Vol.", "Volume")
df_sp500 = df_sp500.withColumn("Volume", F.regexp_replace("Volume", "K", "").cast("double") * 1000)


df_sp500 = df_sp500.withColumn("month", F.date_format(F.col("Date"), "yyyy-MM"))


windowSpec_close = Window.partitionBy("month").orderBy(F.col("Date").desc())


last_close_df = df_sp500.withColumn("row_number", F.row_number().over(windowSpec_close)) \
                        .filter(F.col("row_number") == 1) \
                        .select("month", F.col("Price").alias("close_sp500"))


agg_df = df_sp500.groupBy("month").agg(
    F.max("High").alias("high_sp500"),
    F.min("Low").alias("low_sp500")
)


monthly_df_sp500 = last_close_df.join(agg_df, on="month").orderBy("month")

# Show result
monthly_df_sp500.show()


+-------+-----------+----------+---------+
|  month|close_sp500|high_sp500|low_sp500|
+-------+-----------+----------+---------+
|2012-06|      16.41|     16.44|    15.94|
|2012-07|      16.81|     16.89|    16.42|
|2012-08|      16.93|     17.23|    16.68|
|2012-09|      16.92|     17.24|    16.83|
|2012-10|      16.61|      17.3|    16.56|
|2012-11|      16.83|     17.06|    16.12|
|2012-12|      16.36|     17.23|    16.37|
|2013-01|      17.98|     18.26|    16.66|
|2013-02|       19.1|     19.25|    18.02|
|2013-03|       19.6|     19.99|    19.01|
|2013-04|      19.47|      19.8|    18.58|
|2013-05|      20.82|     21.29|    19.36|
|2013-06|      20.15|     20.55|    19.26|
|2013-07|      21.23|     21.31|    19.99|
|2013-08|      20.12|     21.49|    19.99|
|2013-09|      19.75|     20.58|    19.73|
|2013-10|      20.89|     21.07|    19.65|
|2013-11|      21.07|     21.38|    20.73|
|2013-12|      21.16|     21.48|    20.55|
|2014-01|      20.67|     21.53|    20.41|
+-------+--

In [19]:
monthly_df_sp500_study = monthly_df_sp500.filter(
    (F.col("month") >= "2014-09") & (F.col("month") <= "2021-11")
)

monthly_df_sp500_extended = monthly_df_sp500.filter(
    (F.col("month") >= "2014-09") & (F.col("month") <= "2025-02")
)

monthly_df_sp500_new = monthly_df_sp500.filter(
    (F.col("month") >= "2021-11") & (F.col("month") <= "2025-02")
)


monthly_df_sp500_study.show(1)
monthly_df_sp500_study.orderBy(F.col("month").desc()).limit(1).show()


+-------+-----------+----------+---------+
|  month|close_sp500|high_sp500|low_sp500|
+-------+-----------+----------+---------+
|2014-09|      23.26|     23.75|    22.91|
+-------+-----------+----------+---------+
only showing top 1 row

+-------+-----------+----------+---------+
|  month|close_sp500|high_sp500|low_sp500|
+-------+-----------+----------+---------+
|2021-11|      65.81|     67.24|    63.55|
+-------+-----------+----------+---------+



### CAGR SP500 desde 2014-09 hasta 2021-11
### Precio inicial = 23.26
### Precio final = 65.81
### Años = 7.17
### CAGR = 15.6% anual


In [20]:
# Obtener el mes mínimo y máximo
min_max_month = monthly_df_sp500_study.agg(
    F.min("month").alias("start_month"),
    F.max("month").alias("end_month")
).collect()[0]

start_month = min_max_month["start_month"]
end_month = min_max_month["end_month"]

# Obtener precios
start_price = monthly_df_sp500_study.filter(F.col("month") == start_month).select("close_sp500").collect()[0][0]
end_price = monthly_df_sp500_study.filter(F.col("month") == end_month).select("close_sp500").collect()[0][0]

# Calcular años
from dateutil import relativedelta
import datetime

start_date = datetime.datetime.strptime(start_month, "%Y-%m")
end_date = datetime.datetime.strptime(end_month, "%Y-%m")

diff = relativedelta.relativedelta(end_date, start_date)
years = diff.years + diff.months / 12

# Calcular CAGR
CAGR = (end_price / start_price) ** (1 / years) - 1

print(f"CAGR SP500 desde {start_month} hasta {end_month} es: {CAGR:.4%}")


CAGR SP500 desde 2014-09 hasta 2021-11 es: 15.6180%


## Gold USD 

In [21]:
df_gold = spark.read \
.option("header", True) \
.option("sep", ";") \
.option("inferSchema", True) \
.csv('../data/XAU_USD Historical Data.csv')


df_gold.printSchema()
df_gold.select(
    F.min("Date").alias("Fecha Minima"),
    F.max("Date").alias("Fecha Maxima")
).show()

root
 |-- Date: string (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Close: double (nullable = true)
 |-- Volume: integer (nullable = true)

+----------------+----------------+
|    Fecha Minima|    Fecha Maxima|
+----------------+----------------+
|2004.06.11 07:15|2025.02.03 12:30|
+----------------+----------------+



In [22]:
df_gold.orderBy(F.col("Date").asc()).toPandas().head(10)

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,2004.06.11 07:15,384.0,384.3,383.8,384.3,12
1,2004.06.11 07:30,383.8,384.3,383.6,383.8,12
2,2004.06.11 07:45,383.3,383.8,383.3,383.8,20
3,2004.06.11 08:00,383.8,384.1,383.6,383.6,8
4,2004.06.11 08:15,383.6,384.3,383.5,383.5,20
5,2004.06.11 08:30,383.3,383.5,383.3,383.5,7
6,2004.06.11 08:45,383.3,383.3,383.1,383.1,6
7,2004.06.11 09:00,383.1,384.1,383.1,383.6,9
8,2004.06.11 09:15,383.8,384.0,382.8,383.0,15
9,2004.06.11 09:30,383.1,383.3,382.8,383.0,20


In [23]:
df_gold = df_gold.withColumn("timestamp", F.to_timestamp("Date", "yyyy.MM.dd HH:mm"))


df_gold = df_gold.withColumn("Date_only", F.to_date("timestamp"))


windowSpec_gold = Window.partitionBy("Date_only").orderBy(F.col("timestamp").desc())


df_gold = df_gold.withColumn("row_number", F.row_number().over(windowSpec_gold))


df_gold = (
    df_gold.filter(F.col("row_number") == 1)
           .drop("row_number", "timestamp", "Date")
)


df_gold.show(10)
df_gold.printSchema()

+-----+-----+-----+-----+------+----------+
| Open| High|  Low|Close|Volume| Date_only|
+-----+-----+-----+-----+------+----------+
|384.1|384.1|384.1|384.1|     1|2004-06-11|
|382.6|382.8|382.6|382.8|     4|2004-06-14|
|388.6|388.6|387.1|388.6|     5|2004-06-15|
|383.6|383.8|383.6|383.8|     4|2004-06-16|
|389.1|389.1|387.3|387.6|     4|2004-06-17|
|394.3|394.3|394.3|394.3|     2|2004-06-18|
|393.0|393.1|393.0|393.1|     5|2004-06-21|
|394.3|394.3|394.1|394.1|     2|2004-06-22|
|395.8|395.8|395.6|395.6|     2|2004-06-23|
|401.1|401.1|401.1|401.1|     1|2004-06-24|
+-----+-----+-----+-----+------+----------+
only showing top 10 rows

root
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Close: double (nullable = true)
 |-- Volume: integer (nullable = true)
 |-- Date_only: date (nullable = true)



In [24]:
df_gold = df_gold.withColumn("month", F.date_format(F.col("Date_only"), "yyyy-MM"))


windowSpec_close_gold = Window.partitionBy("month").orderBy(F.col("Date_only").desc())


last_close_gold = df_gold.withColumn("row_number", F.row_number().over(windowSpec_close_gold)) \
                         .filter(F.col("row_number") == 1) \
                         .select("month", F.col("Close").alias("close_gold"))


agg_gold = df_gold.groupBy("month").agg(
    F.max("High").alias("high_gold"),
    F.min("Low").alias("low_gold")
)


monthly_df_gold = last_close_gold.join(agg_gold, on="month").orderBy("month")


monthly_df_gold.orderBy(F.col("month").desc()).toPandas().head(5)

Unnamed: 0,month,close_gold,high_gold,low_gold
0,2025-02,2794.43,2795.01,2794.21
1,2025-01,2799.23,2802.11,2635.62
2,2024-12,2624.61,2718.74,2583.62
3,2024-11,2650.33,2744.36,2560.94
4,2024-10,2743.77,2788.08,2607.57


In [25]:
monthly_df_gold_study = monthly_df_gold.filter(
    (F.col("month") >= "2014-09") & (F.col("month") <= "2021-11")
)

monthly_df_gold_extended = monthly_df_gold.filter(
    (F.col("month") >= "2014-09") & (F.col("month") <= "2025-02")
)

monthly_df_gold_new = monthly_df_gold.filter(
    (F.col("month") >= "2021-11") & (F.col("month") <= "2025-02")
)


monthly_df_gold_study.show(1)
monthly_df_gold_study.orderBy(F.col("month").desc()).limit(1).show()



+-------+----------+---------+--------+
|  month|close_gold|high_gold|low_gold|
+-------+----------+---------+--------+
|2014-09|   1207.83|  1286.02| 1207.83|
+-------+----------+---------+--------+
only showing top 1 row

+-------+----------+---------+--------+
|  month|close_gold|high_gold|low_gold|
+-------+----------+---------+--------+
|2021-11|   1774.43|  1867.46| 1769.87|
+-------+----------+---------+--------+



### CAGR GOLD desde 2014-09 hasta 2021-11
### Precio inicial = 1207.83 USD
### Precio final = 1774.43 USD
### Años = 7.17
### CAGR = 5.55% anual


## CPI dataset

### This DataFrame has:
### 1) observation_date: The date.
### 2) CPIAUCSL_NBD20140101: The CPI rebased so January 1, 2014 = 100.
### A value of 100.11008 on 2014-02-01 means prices increased by ~0.11008% since 2014-01-01.



In [26]:
df_cpi = spark.read \
.option("header", True) \
.option("inferSchema", True) \
.csv('../data/CPI_US.csv')

df_cpi.show(5)
df_cpi.printSchema()
df_cpi.select(
    F.min("observation_date").alias("Fecha Minima"),
    F.max("observation_date").alias("Fecha Maxima")
).show()

+----------------+--------------------+
|observation_date|CPIAUCSL_NBD20140101|
+----------------+--------------------+
|      2014-01-01|               100.0|
|      2014-02-01|           100.11008|
|      2014-03-01|           100.31451|
|      2014-04-01|           100.50151|
|      2014-05-01|           100.69277|
+----------------+--------------------+
only showing top 5 rows

root
 |-- observation_date: date (nullable = true)
 |-- CPIAUCSL_NBD20140101: double (nullable = true)

+------------+------------+
|Fecha Minima|Fecha Maxima|
+------------+------------+
|  2014-01-01|  2025-03-01|
+------------+------------+



In [27]:
monthly_df_cpi = (
    df_cpi
    .withColumn("month", F.date_format(F.col("observation_date"), "yyyy-MM"))
    .withColumnRenamed("CPIAUCSL_NBD20140101", "avg_cpi")
    .drop("observation_date")
    .orderBy(F.desc("month"))
)

monthly_df_cpi_study = monthly_df_cpi.filter(
    (F.col("month") >= "2014-09") & (F.col("month") <= "2021-11")
)

monthly_df_cpi_extended = monthly_df_cpi.filter(
    (F.col("month") >= "2014-09") & (F.col("month") <= "2025-02")
)

monthly_df_cpi_new = monthly_df_cpi.filter(
    (F.col("month") >= "2021-11") & (F.col("month") <= "2025-02")
)

monthly_df_cpi_study.show(5)


+---------+-------+
|  avg_cpi|  month|
+---------+-------+
|118.50328|2021-11|
|117.52746|2021-10|
|116.42838|2021-09|
|115.92261|2021-08|
|115.58813|2021-07|
+---------+-------+
only showing top 5 rows



# Join Final

In [28]:
# Final Dataset - Study Period (2014-09 to 2021-11)
final_df_study = (
    monthly_df_btc_study
    .join(monthly_df_sp500_study, on="month", how="inner")
    .join(monthly_df_gold_study, on="month", how="inner")
    .join(monthly_df_cpi_study, on="month", how="inner")
    .orderBy("month")
)

# Final Dataset - Extended Period (2014-09 to 2025-02)
final_df_extended = (
    monthly_df_btc_extended
    .join(monthly_df_sp500_extended, on="month", how="inner")
    .join(monthly_df_gold_extended, on="month", how="inner")
    .join(monthly_df_cpi_extended, on="month", how="inner")
    .orderBy("month")
)


final_df_study.select(
    F.min("month").alias("Fecha Minima"),
    F.max("month").alias("Fecha Maxima")
).show()

final_df_extended.select(
    F.min("month").alias("Fecha Minima"),
    F.max("month").alias("Fecha Maxima")
).show()

final_df_study.toPandas().to_csv("../data/final_df_study.csv", index=False)
final_df_extended.toPandas().to_csv("../data/final_df_extended.csv", index=False)


+------------+------------+
|Fecha Minima|Fecha Maxima|
+------------+------------+
|     2014-09|     2021-11|
+------------+------------+

+------------+------------+
|Fecha Minima|Fecha Maxima|
+------------+------------+
|     2014-09|     2025-02|
+------------+------------+



In [29]:
print("Study Dataset (2014-09 to 2021-11)")
final_df_study.toPandas()

Study Dataset (2014-09 to 2021-11)


Unnamed: 0,month,close_btc,high_btc,low_btc,close_sp500,high_sp500,low_sp500,close_gold,high_gold,low_gold,avg_cpi
0,2014-09,386.944000,493.928009,372.239990,23.26,23.75,22.91,1207.83,1286.02,1207.83,100.93035
1,2014-10,338.321014,411.697998,289.295990,23.95,24.10,21.70,1171.52,1248.55,1170.79,100.91037
2,2014-11,378.046997,457.092987,320.626007,25.28,25.29,23.80,1165.88,1201.13,1140.71,100.72039
3,2014-12,320.192993,384.037994,304.231995,25.41,25.70,23.85,1186.83,1230.94,1173.87,100.40971
4,2015-01,217.464005,320.434998,171.509995,25.40,26.29,24.76,1283.56,1301.79,1188.11,99.77007
...,...,...,...,...,...,...,...,...,...,...,...
82,2021-07,41626.195676,42235.547709,29360.955838,60.01,60.78,58.74,1814.28,1829.79,1776.21,115.58813
83,2021-08,47166.687945,50482.076408,37458.003993,62.53,62.79,59.71,1814.26,1818.28,1728.22,115.92261
84,2021-09,43790.895625,52853.763796,39787.609798,61.14,62.77,60.31,1757.16,1828.00,1725.69,116.42838
85,2021-10,61318.957767,66930.387271,43320.022979,63.65,63.71,59.66,1782.98,1807.50,1753.85,117.52746


In [30]:

print("Extended Dataset (2014-10 to 2025-02)")
final_df_extended.toPandas()

Extended Dataset (2014-10 to 2025-02)


Unnamed: 0,month,close_btc,high_btc,low_btc,close_sp500,high_sp500,low_sp500,close_gold,high_gold,low_gold,avg_cpi
0,2014-09,386.944000,493.928009,372.239990,23.26,23.75,22.91,1207.83,1286.02,1207.83,100.93035
1,2014-10,338.321014,411.697998,289.295990,23.95,24.10,21.70,1171.52,1248.55,1170.79,100.91037
2,2014-11,378.046997,457.092987,320.626007,25.28,25.29,23.80,1165.88,1201.13,1140.71,100.72039
3,2014-12,320.192993,384.037994,304.231995,25.41,25.70,23.85,1186.83,1230.94,1173.87,100.40971
4,2015-01,217.464005,320.434998,171.509995,25.40,26.29,24.76,1283.56,1301.79,1188.11,99.77007
...,...,...,...,...,...,...,...,...,...,...,...
121,2024-10,70215.185633,73577.209658,58895.207808,84.47,85.85,81.06,2743.77,2788.08,2607.57,134.11819
122,2024-11,96449.055813,99655.501079,66803.649996,90.21,90.91,83.46,2650.33,2744.36,2560.94,134.49432
123,2024-12,93429.202811,108268.447080,91317.135460,89.64,91.23,87.86,2624.61,2718.74,2583.62,134.98478
124,2025-01,102405.027084,109114.884834,89260.100189,93.18,93.83,89.51,2799.23,2802.11,2635.62,135.61508


In [31]:
# Tu código final
spark.stop()
print("SparkSession cerrada correctamente.")


SparkSession cerrada correctamente.
