# Descriptive Statistics

Descriptive statistics describe, and summarize the features of a dataset found in a given study, presented in a summary that describes the data sample and measurements.

## yfinance Basic Usage

In [36]:
import yfinance as yf

In [37]:
# Create a Ticker object for Apple (AAPL)
ticker_aapl = yf.Ticker("AAPL")
# Get general information about Apple
# print(ticker_aapl.info)
# ticker_aapl.info['address1'] # print addressline 1
ticker_aapl.info['longBusinessSummary'] # print longBusinessSummary

'Apple Inc. designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessories worldwide. The company offers iPhone, a line of smartphones; Mac, a line of personal computers; iPad, a line of multi-purpose tablets; and wearables, home, and accessories comprising AirPods, Apple TV, Apple Watch, Beats products, and HomePod. It also provides AppleCare support and cloud services; and operates various platforms, including the App Store that allow customers to discover and download applications and digital content, such as books, music, video, games, and podcasts. In addition, the company offers various services, such as Apple Arcade, a game subscription service; Apple Fitness+, a personalized fitness service; Apple Music, which offers users a curated listening experience with on-demand radio stations; Apple News+, a subscription news and magazine service; Apple TV+, which offers exclusive original content; Apple Card, a co-branded credit card; and Apple P

In [38]:
# Download historical market data for the past year
data_aapl_1y = ticker_aapl.history(period="1y")
print(data_aapl_1y.head())

                                 Open        High         Low       Close  \
Date                                                                        
2023-09-13 00:00:00-04:00  175.611353  176.397339  173.094234  173.323074   
2023-09-14 00:00:00-04:00  173.114174  175.203490  172.696314  174.845322   
2023-09-15 00:00:00-04:00  175.581521  175.601423  172.935074  174.119003   
2023-09-18 00:00:00-04:00  175.581530  178.466775  175.273110  177.063950   
2023-09-19 00:00:00-04:00  176.616244  178.715502  176.228230  178.158356   

                              Volume  Dividends  Stock Splits  
Date                                                           
2023-09-13 00:00:00-04:00   84267900        0.0           0.0  
2023-09-14 00:00:00-04:00   60895800        0.0           0.0  
2023-09-15 00:00:00-04:00  109205100        0.0           0.0  
2023-09-18 00:00:00-04:00   67257600        0.0           0.0  
2023-09-19 00:00:00-04:00   51826900        0.0           0.0  


## Google Stock Price sample example

In [39]:
import numpy as np
import pandas as pd
from scipy.stats import skew, kurtosis

In [40]:
# let's import some data
df = yf.download("GOOG")
print(df.head()) # let's see what he have in here

[*********************100%***********************]  1 of 1 completed

                Open      High       Low     Close  Adj Close     Volume
Date                                                                    
2004-08-19  2.490664  2.591785  2.390042  2.499133   2.493011  897427216
2004-08-20  2.515820  2.716817  2.503118  2.697639   2.691030  458857488
2004-08-23  2.758411  2.826406  2.716070  2.724787   2.718112  366857939
2004-08-24  2.770615  2.779581  2.579581  2.611960   2.605561  306396159
2004-08-25  2.614201  2.689918  2.587302  2.640104   2.633636  184645512





In [41]:
# let's say we only want the Adjusted Close
# The adjusted closing price amends a stock's closing price to reflect that stock's value after accounting for any corporate actions
# The closing price is the raw price, which is just the cash value of the last transacted price before the market closes (investopedia.com)
df_close = yf.download("GOOG")["Adj Close"]
df_close.head()

[*********************100%***********************]  1 of 1 completed


Date
2004-08-19    2.493011
2004-08-20    2.691030
2004-08-23    2.718112
2004-08-24    2.605561
2004-08-25    2.633636
Name: Adj Close, dtype: float64

In [42]:
# same thing with taking only the variation
# computing the variation with pct_change(1)
# plus dropping missing values with dropna()
df_close_pct = yf.download("GOOG")["Adj Close"].pct_change(1).dropna() # different assets are put on same scale
df_close_pct

[*********************100%***********************]  1 of 1 completed


Date
2004-08-20    0.079430
2004-08-23    0.010064
2004-08-24   -0.041408
2004-08-25    0.010775
2004-08-26    0.018019
                ...   
2024-09-06   -0.040794
2024-09-09   -0.015731
2024-09-10    0.003143
2024-09-11    0.014266
2024-09-12    0.022281
Name: Adj Close, Length: 5050, dtype: float64

In [43]:
df_close_pct.head()

Date
2004-08-20    0.079430
2004-08-23    0.010064
2004-08-24   -0.041408
2004-08-25    0.010775
2004-08-26    0.018019
Name: Adj Close, dtype: float64

## Central Tendency Measure

### Mean

The mean (average) of a data set is found by adding all numbers in the data set and then dividing by the number of values in the set.

In [44]:
# mean with numpy, working with df_close_pct from above
# axis specifies in which way you want to do the mean; only necessary if more than on column
# axis=0 -> we want to do the mean on the row of this dataframe
mean = np.mean(df_close_pct, axis=0) * 100 # multiplying by 100 to have the value in percentage
print(f"Daily mean: {'%.2f' % mean} %")

Daily mean: 0.10 %


In [45]:
# annualization of the mean return
annual_mean = mean * 252 # 252 is the number of days the market is open
print(f"Annual mean: {'%.2f' % annual_mean} %")

Annual mean: 25.30 %


In [46]:
# day mean return -> monthly mean return
monthly_mean = mean * 21 # 21 or 20 is the number of days the market is open per month
print(f"Monthly mean: {'%.2f' % monthly_mean} %")

Monthly mean: 2.11 %


### Median

The median is the value in the middle of a data set, meaning that 50% of data points have a value smaller or equal to the median and 50% of data points have a value higher or equal to the median. The mean can be biased if we have a lot of extreme values.

In [47]:
# median with numpy, working with df_close_pct from above
# axis specifies in which way you want to do the mean; only necessary if more than on column
# axis=0 -> we want to do the mean on the row of this dataframe
median = np.median(df_close_pct, axis=0) * 100 # multiplying by 100 to have the value in percentage
print(f"Daily median: {'%.2f' % median} %") # result shows that there are more values positive than negative

Daily median: 0.08 %


### The Percentile

**Percentile:** a number denoting the position of a data point within a numeric dataset by indicating the percentage of the dataset with a lesser value. For example, a data point that falls at the 80th percentile has a value greater than 80 percent of the data points within the dataset. Values need to be sorted ascending.

In [48]:
# percentile with numpy, working with df_close_pct from above
# axis specifies in which way you want to do the mean; only necessary if more than on column
# axis=0 -> we want to do the mean on the row of this dataframe
# we need to specify the percentil we want -> 2nd param: percentil in decimal, 10% = 0.1
percentile_10 = np.quantile(df_close_pct, 0.1, axis=0) * 100 # multiplying by 100 to have the value in percentage
print(f"Percentile 10%: {'%.2f' % percentile_10} %") # result shows that there are more values positive than negative

Percentile 10%: -1.93 %


In [49]:
percentile_50 = np.quantile(df_close_pct, 0.5, axis=0) * 100 # multiplying by 100 to have the value in percentage
print(f"Percentile 50%: {'%.2f' % percentile_50} %") # result shows that there are more values positive than negative

Percentile 50%: 0.08 %


In [50]:
percentile_99 = np.quantile(df_close_pct, 0.99, axis=0) * 100 # multiplying by 100 to have the value in percentage
print(f"Percentile 99%: {'%.2f' % percentile_99} %") # result shows that there are more values positive than negative

Percentile 99%: 5.70 %


## Standard Dispersion Measurement

### Variance

The variance is a dispersion metric. The higher the variance, the more the values are dispersed around the mean. If you have a high variance you will have a risky strategy. If you have for example a strategy with a return of 10% per year and a variance close to zero then you have a vera good strategy.

- Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value (Wikipedia).

- Variance is a measurement of the spread between numbers in a data set. Investors use the variance equation to evaluate a portfolio's asset allocation (Investopia).

In [51]:
# variance with numpy
var = np.var(df_close_pct, axis=0) * 100
print(f"Daily variance: {'%.2f' % var} %")
# the variance we get doesn't have any interpretational value

Daily variance: 0.04 %


### Standard Deviation

The variance is not iterpretable. Instead the standard deviation give us iterpretable value. The standard deviation is the square root of the variance.

The standard deviation of the variation of prices of an asset is its **volatility**. It is a good indicator to compute (or quantify) the risk of a strategy.

- A standard deviation (or σ) is a measure of how dispersed the data is in relation to the mean. Low, or small, standard deviation indicates data are clustered tightly around the mean, and high, or large, standard deviation indicates data are more spread out (National Library of Medicine).

- In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its mean. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range (Wikipedia).

In [52]:
# standard deviation with numpy

# annualization var: var_annualy = var_daily * 252 
# annualization std: sqrt(var_annualy) = sqrt(var_daily * 252) = sqrt(var_daily) * sqrt(252)
# and sqrt(var) = sqrt

std = np.std(df_close_pct, axis=0) * 100
print(f"Daily volatility: {'%.2f' % std} %")

Daily volatility: 1.93 %


In [53]:
# annualization of the mean return
annual_std = std * np.sqrt(252)
print(f"Yearly volatility: {'%.2f' % annual_std} %")

Yearly volatility: 30.65 %


In [54]:
# day mean return -> monthly mean return
monthly_std = std * np.sqrt(21)
print(f"Monthly volatility: {'%.2f' % monthly_std} %")

Monthly volatility: 8.85 %


## Relationship Measure

### Covariance and Covariance Matrix

The covariance allows us to understand the relationship between two sets of values. The more the covariance will be positive, the more the relationship will be strong.

- Covariance in probability theory and statistics is a measure of the joint variability of two random variables. The sign of the covariance, therefore, shows the tendency in the linear relationship between the variables (Wikipedia).

In [55]:
# compute a variance-covariance matrix between two assets
# import several assets
df = yf.download(["GOOG", "EURUSD=X"])["Adj Close"].dropna().pct_change(1).dropna()
df

[*********************100%***********************]  2 of 2 completed


Ticker,EURUSD=X,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-08-20 00:00:00+00:00,-0.004042,0.079430
2004-08-23 00:00:00+00:00,-0.013793,0.010064
2004-08-24 00:00:00+00:00,-0.005763,-0.041408
2004-08-25 00:00:00+00:00,0.000338,0.010775
2004-08-26 00:00:00+00:00,0.001077,0.018019
...,...,...
2024-09-06 00:00:00+00:00,0.002767,-0.040794
2024-09-09 00:00:00+00:00,-0.002151,-0.015731
2024-09-10 00:00:00+00:00,-0.004481,0.003143
2024-09-11 00:00:00+00:00,-0.001499,0.014266


In [56]:
# in python you only can compute the covariance matrix
# variance covariance matrix with numpy
# default rowvar=True means each column is a symbol
# in our case each column is a symbol so we set rowvar=False
mat = np.cov(df, rowvar=False)
mat
# in matrix:
# variance EURUSD // covariance EURUSD-GOOG
# covariance GOOG-EURUSD // variance GOOG

array([[5.21720807e-05, 6.59933130e-06],
       [6.59933130e-06, 3.73997279e-04]])

In [57]:
# extracting the covariance from the matrix
# covariance GOOG-EURUSD
mat[0][1]

np.float64(6.59933130338434e-06)

In [58]:
# another example, importing several assets
df = yf.download(["GOOG", "EURUSD=X", "MSFT", "AMZN", "TSLA"])["Adj Close"].dropna().pct_change(1).dropna()
df

[*********************100%***********************]  5 of 5 completed


Ticker,AMZN,EURUSD=X,GOOG,MSFT,TSLA
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010-06-30 00:00:00+00:00,0.005985,0.003584,-0.020495,-0.012870,-0.002511
2010-07-01 00:00:00+00:00,0.015559,0.022689,-0.012271,0.006519,-0.078473
2010-07-02 00:00:00+00:00,-0.016402,0.005079,-0.006690,0.004750,-0.125683
2010-07-06 00:00:00+00:00,0.008430,0.004216,-0.001100,0.023636,-0.160937
2010-07-07 00:00:00+00:00,0.030620,0.000720,0.032403,0.020151,-0.019243
...,...,...,...,...,...
2024-09-06 00:00:00+00:00,-0.036539,0.002767,-0.040794,-0.016381,-0.084459
2024-09-09 00:00:00+00:00,0.023397,-0.002151,-0.015731,0.010007,0.026290
2024-09-10 00:00:00+00:00,0.023660,-0.004481,0.003143,0.020901,0.045776
2024-09-11 00:00:00+00:00,0.027680,-0.001499,0.014266,0.021342,0.008666


In [59]:
mat = np.cov(df, rowvar=False)
pd.DataFrame(mat, columns=df.columns, index=df.columns) # putting in data frame for better visualization

Ticker,AMZN,EURUSD=X,GOOG,MSFT,TSLA
Ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AMZN,0.000426,3.960943e-06,0.0002129728,0.0001900333,0.000261
EURUSD=X,4e-06,2.933636e-05,-1.428352e-07,-4.050031e-07,-2e-06
GOOG,0.000213,-1.428352e-07,0.0002971436,0.0001791337,0.000203
MSFT,0.00019,-4.050031e-07,0.0001791337,0.0002628376,0.000204
TSLA,0.000261,-1.825147e-06,0.0002032272,0.0002037929,0.001291


### Correlation

The correlation is the standardized value of the covariance. As the covariance cannot be interpreted very well, we us the correlation to really understand the relationship.

- Correlation is a statistical measure (expressed as a number) that describes the size and direction of a relationship between two or more variables. A correlation between variables, however, does not automatically mean that the change in one variable is the cause of the change in the values of the other variable (Australian Bureau of Statistics).

- In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data (Wikipedia).

- Correlation is a statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate) (JMP Statistical Software).

**Interpretaion:**

- -1 < corr < 0: negative relationship
- corr = 0: no relationship
- 0 < corr < 1: positive relationship

In [60]:
# correlation matrix with pandas
# to compute the correlation we need to use the variance of this asset
df.corr()

Ticker,AMZN,EURUSD=X,GOOG,MSFT,TSLA
Ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AMZN,1.0,0.035446,0.598837,0.568138,0.35173
EURUSD=X,0.035446,1.0,-0.00153,-0.004612,-0.00938
GOOG,0.598837,-0.00153,1.0,0.640989,0.328181
MSFT,0.568138,-0.004612,0.640989,1.0,0.349913
TSLA,0.35173,-0.00938,0.328181,0.349913,1.0
