# **US financial market | Linear algebra**
*Original dataset source: Content property of Economática, financial information platform*
<br>*Used dataset source: **[us_2022q1_service_industries.csv](https://github.com/myrosandrade89/IA95022/tree/main/Statistics/Reto/dataset)***
<br>*Author: Myroslava Sánchez Andrade A01730712*
<br>*Creation date: 05/10/2022*
<br>*Last updated:*

---
## **Overview**
The purpose of this repository is the analysis of the first 2022 quarter financial statements of all US public service industry companies listed on the New York Exchange and NASDAQ using linear algebra; being `['epsp', 'medium firm', 'big firm', 'profit margin', 'book/market', 'short leverage']` the explicative variables (independent) and `['f1 stock return']` the dependent variable.

---
## **Configuration**

In [16]:
# Importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [17]:
# Importing the dataset
us_2022q1_service_industries = pd.read_csv('data/us_2022q1_service_industries.csv')
us_2022q1_service_industries = us_2022q1_service_industries[~us_2022q1_service_industries.isin([np.nan, np.inf, -np.inf]).any(1)].reset_index()
us_2022q1_service_industries

Unnamed: 0,index,firm,q,medium firm,big firm,epsp,profit margin,book/market,short leverage,f1 stock return
0,1,AAWW,2022q1,1,0,0.032303,0.078591,0.090121,0.104785,-0.093966
1,2,ABM,2022q1,1,0,0.024686,0.039252,-0.620570,0.013874,-0.008219
2,3,ABNB,2022q1,0,1,-0.000173,-0.012454,-3.134384,0.003722,-0.418310
3,5,ACCD,2022q1,1,0,-0.029390,-0.368574,-0.317665,0.005126,-0.863745
4,6,ACHC,2022q1,0,1,0.010327,0.098657,-0.827813,0.009602,0.077769
...,...,...,...,...,...,...,...,...,...,...
644,784,ZNGA,2022q1,0,1,-0.002341,-0.035446,-1.281077,0.003055,-0.230480
645,785,ZS,2022q1,0,1,-0.002950,-0.392936,-3.922008,0.008119,-0.308016
646,786,ZUO,2022q1,1,0,-0.018348,-0.387928,-2.419384,0.029738,-0.481159
647,787,ZVO,2022q1,0,0,-0.266288,-0.120666,-0.907766,0.000000,-0.625444


In [18]:
# Defining the dataset of the variables
us_2022q1_service_industries_variables = us_2022q1_service_industries[['medium firm', 'big firm', 'epsp', 'profit margin', 'book/market', 'short leverage', 'f1 stock return']]
us_2022q1_service_industries_variables

Unnamed: 0,medium firm,big firm,epsp,profit margin,book/market,short leverage,f1 stock return
0,1,0,0.032303,0.078591,0.090121,0.104785,-0.093966
1,1,0,0.024686,0.039252,-0.620570,0.013874,-0.008219
2,0,1,-0.000173,-0.012454,-3.134384,0.003722,-0.418310
3,1,0,-0.029390,-0.368574,-0.317665,0.005126,-0.863745
4,0,1,0.010327,0.098657,-0.827813,0.009602,0.077769
...,...,...,...,...,...,...,...
644,0,1,-0.002341,-0.035446,-1.281077,0.003055,-0.230480
645,0,1,-0.002950,-0.392936,-3.922008,0.008119,-0.308016
646,1,0,-0.018348,-0.387928,-2.419384,0.029738,-0.481159
647,0,0,-0.266288,-0.120666,-0.907766,0.000000,-0.625444


---
## **Exploratory analysis**

#### ***Variance-Covariance matrix***

In [5]:
# Defining the global variables
x = us_2022q1_service_industries_variables
x_transpose = x.T
n = x.shape[0]
matrix_one = np.full((n, 1), 1)

In [7]:
# Calculating variance-covariance matrix
var_cov_matrix = (1 / (n - 1)) * (x_transpose.dot(x) - (1 / n) * (x_transpose.dot(matrix_one)).dot(x_transpose.dot(matrix_one).T))
var_cov_matrix

Unnamed: 0,medium firm,big firm,epsp,profit margin,book/market,short leverage,f1 stock return
medium firm,0.218113,-0.112271,0.005541,0.010994,0.006237,-0.001835,-0.00165
big firm,-0.112271,0.227782,0.004596,0.057885,-0.22872,-0.000607,0.013645
epsp,0.005541,0.004596,0.009382,0.038666,-0.013437,-0.000278,0.010968
profit margin,0.010994,0.057885,0.038666,0.705664,-0.093393,-0.001302,0.106076
book/market,0.006237,-0.22872,-0.013437,-0.093393,1.099332,-0.000392,-0.026442
short leverage,-0.001835,-0.000607,-0.000278,-0.001302,-0.000392,0.002085,0.001129
f1 stock return,-0.00165,0.013645,0.010968,0.106076,-0.026442,0.001129,0.141972


In [19]:
# Proving that the calculation is correct
x.cov()

Unnamed: 0,medium firm,big firm,epsp,profit margin,book/market,short leverage,f1 stock return
medium firm,0.218113,-0.112271,0.005541,0.010994,0.006237,-0.001835,-0.00165
big firm,-0.112271,0.227782,0.004596,0.057885,-0.22872,-0.000607,0.013645
epsp,0.005541,0.004596,0.009382,0.038666,-0.013437,-0.000278,0.010968
profit margin,0.010994,0.057885,0.038666,0.705664,-0.093393,-0.001302,0.106076
book/market,0.006237,-0.22872,-0.013437,-0.093393,1.099332,-0.000392,-0.026442
short leverage,-0.001835,-0.000607,-0.000278,-0.001302,-0.000392,0.002085,0.001129
f1 stock return,-0.00165,0.013645,0.010968,0.106076,-0.026442,0.001129,0.141972


**Variance:** the variance of a variable X is the average of squared deviations from each individual value Xi from its mean (the average of the squared difference between the observed values of a variable and its mean).

**Covariance:** it measures the joint probability of two variables, it is the average of product deviations between a variable X and a variable Y from their corresponding means (we cannot understand the magnitude, only its sign).

In the above calculation we can observe the variance in the diagonal and the covariance the in non-diagonal.

#### ***Correlation matrix***

In [26]:
# Defining the global variables
variance = np.diag(var_cov_matrix).reshape(1, 7)
denominator = np.sqrt(variance) * np.sqrt(variance.T) # Standard deviation

In [28]:
# Calculating the correlation matrix
corr_matrix = var_cov_matrix / denominator
corr_matrix

Unnamed: 0,medium firm,big firm,epsp,profit margin,book/market,short leverage,f1 stock return
medium firm,1.0,-0.503697,0.122485,0.028022,0.012737,-0.086047,-0.009375
big firm,-0.503697,1.0,0.099425,0.144381,-0.457068,-0.027844,0.075877
epsp,0.122485,0.099425,1.0,0.475211,-0.132308,-0.062771,0.300534
profit margin,0.028022,0.144381,0.475211,1.0,-0.106035,-0.03394,0.335132
book/market,0.012737,-0.457068,-0.132308,-0.106035,1.0,-0.008192,-0.06693
short leverage,-0.086047,-0.027844,-0.062771,-0.03394,-0.008192,1.0,0.065621
f1 stock return,-0.009375,0.075877,0.300534,0.335132,-0.06693,0.065621,1.0


In [9]:
# Proving that the calculation is correct
x.corr()

Unnamed: 0,medium firm,big firm,epsp,profit margin,book/market,short leverage,f1 stock return
medium firm,1.0,-0.503697,0.122485,0.028022,0.012737,-0.086047,-0.009375
big firm,-0.503697,1.0,0.099425,0.144381,-0.457068,-0.027844,0.075877
epsp,0.122485,0.099425,1.0,0.475211,-0.132308,-0.062771,0.300534
profit margin,0.028022,0.144381,0.475211,1.0,-0.106035,-0.03394,0.335132
book/market,0.012737,-0.457068,-0.132308,-0.106035,1.0,-0.008192,-0.06693
short leverage,-0.086047,-0.027844,-0.062771,-0.03394,-0.008192,1.0,0.065621
f1 stock return,-0.009375,0.075877,0.300534,0.335132,-0.06693,0.065621,1.0


**Correlation:** is the statistical relationship between two variables (a scaled interpretation of the covariance). 

In the above calculation we can appreciate that the diagonal of the matrix is full of 1s, which makes sense since the correlation of a variable with itself is 100%. The non-diagonal values that approximate to 1 or -1 mean that there is a strong positive or negative relation between 2 variables (when a variable x goes up/down, variable y goes up/down). On the other hand, the non-diagonal values that approximate to 0 mean that there is a low correlation between two variables.

Since we are looking for new variables to predict the stock returns, a set of independet variables with really high correlation could cause a unreliable estimation of the beta coefficients (multicollinearity). In this case, we can observe that the highest correlation is `0.4752` between  `'profit margin'` and `'epsp'` which is tolerable.

#### ***Detection of leverage points***

In [31]:
# Defining the global variables
x_variables = x[['medium firm', 'big firm', 'epsp', 'profit margin', 'book/market', 'short leverage']]
x_variables_transpose = x_variables.T

In [37]:
# Calculating the hat matrix
hat_matrix = x_variables.dot(np.linalg.inv(x_variables_transpose.dot(x_variables)).dot(x_variables_transpose))
hat_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,639,640,641,642,643,644,645,646,647,648
0,0.014433,0.006338,-0.004864,0.006173,0.002234,0.005019,0.000116,0.002756,-0.002361,0.011604,...,-0.002125,-0.002529,-0.002981,0.000509,-0.001052,0.000383,-0.006976,0.001832,-0.004208,0.003327
1,0.006338,0.005545,-0.001269,0.005685,0.001191,0.005027,0.000550,0.004524,-0.000786,-0.001127,...,-0.000761,-0.001991,-0.000517,0.003437,0.000246,0.000665,-0.002177,0.003504,-0.001426,-0.001358
2,-0.004864,-0.001269,0.007575,-0.001746,0.002277,-0.000244,0.003706,0.000904,0.006329,0.002145,...,0.001887,0.007614,0.006043,0.003200,0.004721,0.003434,0.009193,0.002408,0.002355,-0.002853
3,0.006173,0.005685,-0.001746,0.006649,0.001738,0.005052,0.000926,0.004751,-0.001403,-0.003417,...,-0.001570,-0.003127,-0.000632,0.002707,0.000124,0.001184,-0.002861,0.003146,0.000053,0.002848
4,0.002234,0.001191,0.002277,0.001738,0.006459,0.000340,0.005411,-0.000334,0.002941,0.001621,...,-0.001498,0.000873,0.003577,-0.002400,0.004733,0.005645,0.000814,-0.002148,-0.001687,-0.000302
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
644,0.000383,0.000665,0.003434,0.001184,0.005645,0.000205,0.005145,0.000045,0.003566,0.000747,...,-0.000912,0.002089,0.004165,-0.001273,0.004723,0.005297,0.002654,-0.001190,-0.000548,0.000165
645,-0.006976,-0.002177,0.009193,-0.002861,0.000814,-0.000643,0.003072,0.001287,0.007126,0.002849,...,0.002770,0.010035,0.006711,0.004781,0.004239,0.002654,0.012163,0.004017,0.002587,0.000748
646,0.001832,0.003504,0.002408,0.003146,-0.002148,0.004445,-0.000964,0.005341,0.001523,0.001612,...,0.001496,0.003564,0.001023,0.007401,-0.000307,-0.001190,0.004017,0.007148,0.002380,0.001156
647,-0.004208,-0.001426,0.002355,0.000053,-0.001687,0.000461,-0.000557,0.001257,0.002031,-0.001193,...,0.000839,0.002239,0.001343,0.002953,0.000868,-0.000548,0.002587,0.002380,0.016053,-0.004063


In [118]:
# Defining the leverage
leverages = np.diagonal(hat_matrix)
leverages_mean = (leverages.sum() + 1) / leverages.shape[0]
leverages_mean

0.010785824345146378

In [124]:
# Storing unusual Xs indexes in an array
unusual_x = np.nonzero(leverages > (2 * leverages_mean))
unusual_x

(array([  9,  20,  26,  32,  48,  49,  55,  61, 139, 146, 182, 236, 266,
        343, 372, 409, 412, 418, 420, 422, 454, 455, 465, 484, 488, 516,
        542, 549, 560, 598, 607, 624, 626, 648], dtype=int64),)

The **leverage** is a number between 0 and 1 that calculates the distance between the $x_i$ data point and the mean of the $x$ values for all $n$ data points. The sum of the $h_{ii}$ ($k + 1$) equals the number of parameters including the intercept.

The **hat matrix** (the $n x n$ matrix) was calculated using the formula $H = X(X'X)^{-1}X'$, it contains the leverages that allow us to determine whether the $x$ values are extreme and therefore potentially influential in the regression model analysis.

To determine extreme $x$ values there is a rule that recommends to flag observations whose leverage value ($h_{ii}$) is more than 3 times larger than the mean leverage value: $unusual X = h_{ii} > 3\left(\sum_{i=1}^n h_{ii} \over n \right)$

Actually an $x_i$ data point that has a high leverage, may or may not be influential. A data point has large influence only if it affects the estimated regression function.

#### ***Detection of outliers***

In [108]:
# Defining the global variables
y_variable = x['f1 stock return']
y_predictions = hat_matrix.dot(y_variable)
mse = np.square(y_variable - y_predictions).sum() / y_variable.shape[0]

In [109]:
# Calculating the residuals
residuals = y_variable - y_predictions
residuals

0      0.135975
1      0.231742
2     -0.098095
3     -0.532457
4      0.268839
         ...   
644    0.009668
645    0.105686
646   -0.051697
647   -0.311473
648    0.051012
Length: 649, dtype: float64

In [122]:
# Calculating the standardize residuals
standardize_residuals = residuals / np.sqrt(mse * (1 - leverages)) 
standardize_residuals = np.array(standardize_residuals)

In [123]:
# Storing outliers indexes in an array
outliers = np.nonzero(np.absolute(standardize_residuals) > 2)
outliers

(array([ 49,  54,  56,  65, 131, 173, 197, 204, 235, 266, 270, 320, 376,
        447, 461, 488, 534, 544, 585, 624, 639], dtype=int64),)