In [0]:
%sh
#To keep it simple, we'll download and extract the dataset using standard bash commands 
#Install 7zip to extract the file
apt-get install -y p7zip-full

rm -rf /tmp/quant || true
mkdir -p /tmp/quant
cd /tmp/quant
#Download & extract the quant archive
curl -L https://archive.org/download/stackexchange/quant.stackexchange.com.7z -o quant.7z
7z x quant.7z 
#Move the dataset to our main bucket
rm -rf /dbfs/dbdemos/product/llm/quant/raw || true
mkdir -p /dbfs/dbdemos/product/llm/quant/raw
cp -f Posts.xml /dbfs/dbdemos/product/llm/quant/raw

Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  p7zip
Suggested packages:
  p7zip-rar
The following NEW packages will be installed:
  p7zip p7zip-full
0 upgraded, 2 newly installed, 0 to remove and 42 not upgraded.
Need to get 1549 kB of archives.
After this operation, 5847 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 p7zip amd64 16.02+dfsg-8 [363 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 p7zip-full amd64 16.02+dfsg-8 [1186 kB]


debconf: delaying package configuration, since apt-utils is not installed


Fetched 1549 kB in 3s (473 kB/s)
Selecting previously unselected package p7zip.
(Reading database ... (Reading database ... 5%(Reading database ... 10%(Reading database ... 15%(Reading database ... 20%(Reading database ... 25%(Reading database ... 30%(Reading database ... 35%(Reading database ... 40%(Reading database ... 45%(Reading database ... 50%(Reading database ... 55%(Reading database ... 60%(Reading database ... 65%(Reading database ... 70%(Reading database ... 75%(Reading database ... 80%(Reading database ... 85%(Reading database ... 90%(Reading database ... 95%(Reading database ... 100%(Reading database ... 100130 files and directories currently installed.)
Preparing to unpack .../p7zip_16.02+dfsg-8_amd64.deb ...
Unpacking p7zip (16.02+dfsg-8) ...
Selecting previously unselected package p7zip-full.
Preparing to unpack .../p7zip-full_16.02+dfsg-8_amd64.deb ...
Unpacking p7zip-full (16.02+dfsg-8) ...
Setting up p7zip (16.02+dfsg-8) ...
Setting up p7z

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0 46.1M    0  111k    0     0  65918      0  0:12:14  0:00:01  0:12:13 95558  6 46.1M    6 3072k    0     0  1213k      0  0:00:38  0:00:02  0:00:36 1542k 12 46.1M   12 6127k    0     0  1718k      0  0:00:27  0:00:03  0:00:24 2024k 21 46.1M   21 9951k    0     0  2233k      0  0:00:21  0:00:04  0:00:17 2540k 31 46.1M   31 14.4M    0     0  2707k      0  0:00:17  0:00:05  0:00:12 3003k 43 46.1M   43 19.9M    0     0  3127k      0  0:00:15  0:00:06  0:00:09 4238k 54 46.1M   54 25.3M    0     0  3477k      0  0:00:13  0:00:07  0:00:06 4641k 64 46.1M   64 29.6M    0     0  3585k      0  0:0


7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=C.UTF-8,Utf16=on,HugeFiles=on,64 bits,32 CPUs AMD EPYC 7R32 (830F10),ASM,AES-NI)

Scanning the drive for archives:
1 file, 48416494 bytes (47 MiB)

Extracting archive: quant.7z
--
Path = quant.7z
Type = 7z
Physical Size = 48416494
Headers Size = 333
Method = BZip2
Solid = -
Blocks = 8

Everything is Ok

Files: 8
Size:       266808589
Compressed: 48416494


In [0]:
%fs ls /dbdemos/product/llm/quant/raw

path,name,size,modificationTime
dbfs:/dbdemos/product/llm/quant/raw/Posts.xml,Posts.xml,73662724,1687046644000


In [0]:
quant_raw_path = "/dbdemos/product/llm/quant/raw"
print(f"loading raw xml dataset under {quant_raw_path}")
raw_quant = spark.read.format("xml").option("rowTag", "row").load(f"{quant_raw_path}/Posts.xml")
raw_quant.show(10)

loading raw xml dataset under /dbdemos/product/llm/quant/raw
+-----------------+------------+--------------------+--------------------+-------------+--------------------+---------------+--------------------+--------------+---+--------------------+--------------------+----------------------+-----------------+-----------------+------------+---------+-----------+------+--------------------+--------------------+----------+
|_AcceptedAnswerId|_AnswerCount|               _Body|         _ClosedDate|_CommentCount| _CommunityOwnedDate|_ContentLicense|       _CreationDate|_FavoriteCount|_Id|   _LastActivityDate|       _LastEditDate|_LastEditorDisplayName|_LastEditorUserId|_OwnerDisplayName|_OwnerUserId|_ParentId|_PostTypeId|_Score|               _Tags|              _Title|_ViewCount|
+-----------------+------------+--------------------+--------------------+-------------+--------------------+---------------+--------------------+--------------+---+--------------------+--------------------+--------

In [0]:
from bs4 import BeautifulSoup
from pyspark.sql.functions import col, udf, length, pandas_udf

#UDF to transform html content as text
@pandas_udf("string")
def html_to_text(html):
  return html.apply(lambda x: BeautifulSoup(x).get_text())

quant_df =(raw_quant
                  .filter("_Score >= 5") # keep only good answer/question
                  .filter(length("_Body") <= 1000) #remove too long questions
                  .withColumn("text", html_to_text("_Body")) #Convert html to text
                  .withColumnsRenamed({"_Id": "id", "_ParentId": "parent_id"})
                  .select("id", "text", "parent_id"))

quant_df.show(10)

+---+--------------------+---------+
| id|                text|parent_id|
+---+--------------------+---------+
|  1|To get the ball r...|     null|
|  2|I like Statistics...|        1|
|  3|I want to start l...|     null|
|  4|John C. Hull's "O...|        1|
|  5|How do you model ...|     null|
|  6|This may be too b...|        1|
|  8|I like the follow...|        1|
|  9|It seems that VIX...|     null|
| 10|Clark,\nThis is o...|        1|
| 11|Options, Futures,...|        1|
+---+--------------------+---------+
only showing top 10 rows



In [0]:
docs_df = quant_df.withColumn('text_length', length(col('text')))\
                    .orderBy(col('text_length').desc()).limit(10)\
                    .select('text','text_length')
docs_df.show(10)

+--------------------+-----------+
|                text|text_length|
+--------------------+-----------+
|Co-integration is...|        986|
|Well, what you fi...|        983|
|VIX is calculated...|        980|
|The answer to you...|        977|
|The estimation of...|        975|
|Recently I've rea...|        970|
|According to my u...|        967|
|Buy copies of Bre...|        966|
|There have been n...|        965|
|Assuming you avoi...|        963|
+--------------------+-----------+



In [0]:
from typing import Iterator
import pandas as pd 
from transformers import pipeline
import torch

#Make sure we clean the memory
try:
    torch.cuda.empty_cache()
    from numba import cuda
    cuda.get_current_device().reset()
except Exception as e:
    print(f"Couldn't clean the memory: {e}")

@pandas_udf("string")
def summarize(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
    # Load the model for summarization
    torch.cuda.empty_cache()
    summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", device="cuda:0")
    def summarize_txt(text):
      return summarizer(text)[0]['summary_text']

    for serie in iterator:
        # get a summary for each row
        yield serie.apply(summarize_txt)

docs_df = docs_df.repartition(1)\
                 .withColumn("summary", summarize("text"))\
                 .withColumn('summary_length', length(col('summary')))
display(docs_df)



text,text_length,summary,summary_length
"Co-integration is a measure / indicator of the long running relationship between 2 or more time series.  A short answer to how you can use it, is the pairs trading strategy or in Econometrics can be used to formulate a regression.  using the classic example, you can use 2 stocks like Coke (C) and Pepsi (P) (or commodities such as Gold and Silver) in a pairs trading strategy. Your strategy will involve first finding out if the stocks are co-integrated; if they are then you will need to have a strat like:  - If the spread (C - P) > threshold then sell C and buy P - If the spread (C - P) < threshold then buy C and sell P  The idea here is that if the spread widens say C increases then eventually P will also increase or C will eventually revert to some long running value.  The key challenge here is to determine when the spread is at its optimal value, so that you know when to enter / exit the trade  Very quick and basic, there are tons of info on this strat on the web.",986,Co-integration is a measure / indicator of the long running relationship between 2 or more time series . The idea here is that if the spread widens say C increases then eventually P will also increase or C will eventually revert to some long running value . The key challenge is to determine when the spread is at its optimal value .,334
"Well, what you find is that the introduction of stochastic vol changes the delta of your options. So what does this mean? If the new delta reduces the variance of your hedged portfolio versus the pure local vol model , then it means that the introduction of stochastic vol has resulted in a better description of market dynamics versus the pure local vol model. Secondly, what you also find is that you can have different models all of which reprice the vanilla options, but that some exotic options have very different prices in the different models. For example , the introduction of stochastic vol can be done in a way that preserves the vanilla option prices , but it lowers the value of forward implied volatilities in the model versus a simple local vol model. Thus, exotics that depend on forward vols ( cliques, Bermudan etc) are priced very differently. Hence another reason to introduce stochastic vol is to improve the pricing of exotics, given the vanilla market.",983,"Stochastic vol has resulted in a better description of market dynamics versus the pure local vol model . Exotics that depend on forward vols ( cliques, Bermudan etc) are priced very differently . Another reason to introduce stochastic vol is to improve the pricing of exotics, given the vanilla market .",304
"VIX is calculated from a basket of SPX options, and VIX futures expire into following expiration, e.g. September VIX futures that will expire next Wednesday will use SPX October options chain to calculate settlement value. If $B$ is the value of the basket then VIX value at expiration is $\sqrt{ B }$. Then VIX futures price is the expectation of the basket $VIX _{F} = E[\sqrt{ B }]$. Delta of the VIX futures price with respect to the basket would be $$\ \frac{\partial VIX _{F}}{\partial B} = \frac{\partial E[\sqrt{ B }]}{\partial B}$$ As you can see that taking that expectation is not simple, since there is no simple connection between VIX futures greeks and SPX options greeks because of the expectation and square root. So ""use the chain rule and linearity of the derivative"" approach would not get you anywhere. But that does not mean that such derivative is 0. Such derivative can be calculated in Malliavin sense, but that is probably not what you're looking for.",980,"VIX is calculated from a basket of SPX options, and VIX futures expire into following expiration . If $B$ is the value of the basket then VIX value at expiration is $\sqrt{ B }$. If VIX price is the expectation of basket $VIX _{F} = E[\sqrt { B }]$. Delta of the Vix futures price with respect to the basket would be$$\ \frac{\partial VIX _ {F]{\partial B} = \frac{{ B}$$",372
"The answer to your first four questions is affirmative. Option-adjusting the spread makes an equivalence between everything theoretically possible, but the quality of results depends significantly on the quality of your interest rate model and its calibration. My personal opinion, though, is that the results need to be treated carefully because the OAS model does not (typically) include stochastic credit spreads and potential capital structure changes, and therefore tends to underprice the embedded options. For a bond with a single call date, Delta would be the risk-neutral exercise probability, but that situation is nearly nonexistent. Since the interest rate model used for OAS can easily compute the exercise probability alongside valuation, you should just use the model to get it. If you are not computing OAS yourself, you are probably working with pretty pathetic numbers because most commercial sources are poorly calibrated (I'm looking at you, Bloomberg).",977,Option-adjusting the spread makes an equivalence between everything theoretically possible . The quality of the model depends significantly on the quality of your model and its calibration . The OAS model does not (typically) include credit spreads and potential capital structure changes and therefore tends to underprice the embedded options .,346
"The estimation of a covariance matrix is unstable unless the number of historical observations $T$ is greater than the number of securities $N$ (5000 in your example). Consider that 10 years of data represents only 120 monthly observations and about 2500 daily observations. Depending on the application, using data dating farther back than 10 years may be impractical and undesirable for many reasons -- de-listed stocks, regime changes, etc. In fact, risk management applications often require covariance estimations over recent periods of time (1-3 years). Computational applications ranging from portfolio construction to Monte Carlo simulation generally require that the estimated covariance matrix is non-singular and positive definite. If N is greater than T, then the estimated covariance matrix will be singular. Furthermore a variety of small sample problems persist until the number of observations is an order of magnitude larger than the number of securities.",975,The estimation of a covariance matrix is unstable unless the number of historical observations $T$ is greater than number of securities $N$ (5000 in your example) Consider that 10 years of data represents only 120 monthly observations and about 2500 daily observations . Risk management applications often require covariance estimations over recent periods of time .,367
"Recently I've read some books about quantative approach to fundamental investing: - What works on Wall Street - James O'Shaughnessy - Quantitative Value - Wesley Gray, Tobias Carlisle - Quantitative Strategies - Richard Tortoriello Basically, their research methodology, can be summarized as, we have a set of indicators: - value (E/P, EBIT/TEV, S/P, ...) - momentum (RSI, ...) - quality (Piotroski score,...) - growth (PEG, ...) We rank stocks and assign to deciles. We decide how often we rebalance portfolio (rather low frequency) and which strategy to apply. We calculate return,cagr, sharpe etc. for every decile/strategy. I'm looking for free/open-source framework/library to reproduce similar research. I can't use yahoo data (non-yahoo stock exchange), so I need to load my own data. I consider to use python pandas for this, but maybe a better solution exists. Unfortunately, I've only found libraries for pair trading and technical analysis for single stock.",970,"Recently I've read some books about quantative approach to fundamental investing . I'm looking for free/open-source framework/library to reproduce similar research . I can't use yahoo data (non-yahoo stock exchange) so I need to load my own data . I consider to use python pandas for this, but maybe a better solution exists .",327
"According to my understanding, synthetic CDOs are essentially credit default swaps (CDS) for a bunch of loans, stored in a special purpose vehicle (SPV). Here, the investor (the one who buys the synthetic CDO) is essentially buying insurance against defaults of loans that the investor doesn't hold. Therefore, if the loan defaults the investor gets paid by the issuer of the CDSs, and until that happens the investor pays a premium to the issuer of the CDSs (somehow?). My understanding is mostly based on the assumption that 1) buying synthetic CDO (i.e. being an investor) means betting against the loans, 2) synthetic CDOs don't contain any loans. I wouldn't be surprised if I'm wrong, so please correct me where I'm wrong. In particular, it doesn't make sense to me why is there even CDO in the name of this synthetic CDO, as it is just a basket of CDSs and there are no loans in it. Also, how can the investor payout the premium to the issuer if it is bundled?",967,"Synthetic CDOs are essentially credit default swaps (CDS) for a bunch of loans stored in a special purpose vehicle . Here, the investor (the one who buys the synthetic CDO) is essentially buying insurance against defaults of loans that the investor doesn't hold . If the loan defaults the investor gets paid by the issuer of the CDSs .",336
"Buy copies of Brent Oksendal's ""Stochastic Differential Equations An Introduction with Applications"" and Thomas Bjork's ""Arbitrage Theory in Continuous Time."" These are well written graduate level textbooks. I can't promise it will be painless, but if you want to understand continuous time derivative pricing models these are a place to start. Another option is to not worry about continuous time models and get a copy of Stanley R. Pliska's ""Introduction to Mathematical Finance."" It is a graduate textbook covering discrete time models. To use these models all you need to know is linear algebra and how to optimize linear equations using the simplex method. (not to be confused with the simplex numerical optimization algorithm.) Bluntly put: Ito Integration can be viewed two ways. 1) As an incomplete Riemann Stieltjes Integral 2) An extended Lebesgue Integral. If you have no idea what either of the above two things are, go with the descrete time models.",966,"Buy copies of Brent Oksendal's ""Stochastic Differential Equations An Introduction with Applications"" and Thomas Bjork's ""Arbitrage Theory in Continuous Time"" These are well written graduate level textbooks . To use these models all you need to know is linear algebra and how to optimize linear equations using the simplex method .",331
"There have been numerous exotic trading desk blow ups lately, related to various reasons. However, in particular, one bank had some issues where they were pricing autocallable notes with Local Volatility and not producing a Delta ""true up"" using Stochastic Volatility that is common among other banks. In other words, Delta of the autocallable notes is higher in magnitude under Local Volatility compared to Stochastic Volatility. Since the bank thought it was holding more (negative) Delta as a result of the Local Volatility model, they bought too much stock to hedge and had large losses when the market declined. Can someone provide an intuitive explanation of why Delta is higher in autocallable products under Local Volatility compared to Stochastic Volatility? The price of the product is different under the two volatility models on account of vol-of-vol differences, but it's not entirely clear to me why the Delta difference is in this direction. Thanks.",965,"Bank had some issues where they were pricing autocallable notes with Local Volatility and not producing a Delta ""true up"" using Stochastic Volatility . Bank thought it was holding more (negative) Delta as a result of the local Volatility model, they bought too much stock to hedge and had large losses when the market declined .",329
"Assuming you avoid data-snooping bias and all the potential pitfalls of using the past to predict the future, trusting genetic algorithms to find the ""right"" solution pretty much boils down to the same bet you make when you actively manage a portfolio, whether quantitatively or discretionary. If you believe in market efficiency then increasing your transaction costs from active management is illogical. If, however you believe there are structural & psychological patterns or ""flaws"" to be exploited and the payoff is worth the time and money for researching and implementing a strategy the logical choice is active management. Running a GA derived strategy is an implicit bet against market efficiency. You're basically saying ""I think there are mis-valuations that occur from some reason"" (masses of irrational people, mutual funds herding because of mis-aligned incentives, etc.) and ""running this GA can sort this mass of data out way quicker than I can.""",963,"Running a GA derived strategy is an implicit bet against market efficiency . You're basically saying ""I think there are mis-valuations that occur from some reason"" (masses of irrational people, mutual funds herding because of mis-aligned incentives) and ""running this GA can sort this mass of data out way quicker than I can""",326
