-
Notifications
You must be signed in to change notification settings - Fork 10
/
econometrics-cheatsheet-en.tex
625 lines (464 loc) · 27.3 KB
/
econometrics-cheatsheet-en.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
\documentclass[10pt, a4paper, landscape]{extarticle}
% ----- packages -----
\usepackage{amsmath, amsfonts, amssymb} % better math
\usepackage{enumitem} % better lists
\usepackage{geometry} % margins
\usepackage{graphicx} % scaling
\usepackage{hyperref} % hyperlinks
\usepackage{multicol} % multiple columns
\usepackage{parskip} % paragraph spacing
\usepackage{scrlayer-scrpage} % page foot
\usepackage{tikz} % plots
\usepackage{titlesec} % titles
% ----- random seed -----
\pgfmathsetseed{12}
% ----- custom commands -----
\newcommand{\E}{\mathrm{E}}
\newcommand{\Var}{\mathrm{Var}}
\newcommand{\se}{\mathrm{se}}
\newcommand{\Cov}{\mathrm{Cov}}
\newcommand{\Corr}{\mathrm{Corr}}
\newcommand{\SSR}{\mathrm{SSR}}
\newcommand{\SSE}{\mathrm{SSE}}
\newcommand{\SST}{\mathrm{SST}}
\newcommand{\tr}{\mathsf{T}}
% ----- page customization -----
\geometry{margin=1cm} % margins config
\pagenumbering{gobble} % remove page numeration
\setlength{\parskip}{0cm} % paragraph spacing
% title spacing
\titlespacing{\section}{0pt}{2ex}{1ex}
\titlespacing{\subsection}{0pt}{1ex}{0ex}
\titlespacing{\subsubsection}{0pt}{0.5ex}{0ex}
% ----- document -----
\begin{document}
\cfoot{\href{https://github.com/marcelomijas/econometrics-cheatsheet}{\normalfont \footnotesize CS-24.2-EN - github.com/marcelomijas/econometrics-cheatsheet - CC-BY-4.0 license}}
\setlength{\footskip}{12pt}
\begin{multicols}{3}
\begin{center}
\textbf{\LARGE \href{https://github.com/marcelomijas/econometrics-cheatsheet}{Econometrics Cheat Sheet}}
{\footnotesize By Marcelo Moreno - Universidad Rey Juan Carlos}
{\footnotesize The Econometrics Cheat Sheet Project}
\end{center}
\section*{Basic concepts}
\subsection*{Definitions}
\textbf{Econometrics} - is a social science discipline with the objective of quantify the relationships between economic agents, test economic theories and evaluate and implement government and business policies.
\textbf{Econometric model} - is a simplified representation of the reality to explain economic phenomena.
\textbf{\textsl{Ceteris paribus}} - if all the other relevant factors remain constant.
\subsection*{Data types}
\textbf{Cross section} - data taken at a given moment in time, an static \textsl{photo}. Order doesn't matter.
\textbf{Time series} - observation of variables across time. Order does matter.
\textbf{Panel data} - consist of a time series for each observation of a cross section.
\textbf{Pooled cross sections} - combines cross section from different time periods.
\subsection*{Phases of an econometric model}
\begin{enumerate}[leftmargin=*]
\setlength{\multicolsep}{0pt}
\begin{multicols}{2}
\item Specification.
\item Estimation.
\columnbreak
\item Validation.
\item Utilization.
\end{multicols}
\end{enumerate}
\subsection*{Regression analysis}
Study and predict the mean value of a variable (dependent variable, $y$) regarding the base of fixed values of other variables (independent variables, $x$'s). In econometrics it is common to use Ordinary Least Squares (OLS) for regression analysis.
\subsection*{Correlation analysis}
Correlation analysis don't distinguish between dependent and independent variables.
\begin{itemize}[leftmargin=*]
\item Simple correlation measures the grade of linear association between two variables.
\begin{center}
$r = \frac{\Cov(x, y)}{\sigma_x \cdot \sigma_y} = \frac{\sum_{i=1}^n ((x_i - \overline{x}) \cdot (y_i - \overline{y}))}{\sqrt{\sum_{i=1}^n (x_i - \overline{x})^2 \cdot \sum_{i=1}^n (y_i - \overline{y})^2}}$
\end{center}
\item Partial correlation measures the grade of linear association between two variables controlling a third.
\end{itemize}
\columnbreak
\section*{Assumptions and properties}
\subsection*{Econometric model assumptions}
Under this assumptions, the OLS estimator will present good properties. \textbf{Gauss-Markov} assumptions:
\begin{enumerate}[leftmargin=*]
\item \textbf{Parameters linearity} (and weak dependence in time series). $y$ must be a linear function of the $\beta$'s.
\item \textbf{Random sampling}. The sample from the population has been randomly taken. (Only when cross section)
\item \textbf{No perfect collinearity}.
\begin{itemize}[leftmargin=*]
\item There are no independent variables that are constant: $\Var(x_j) \neq 0, \; \forall j = 1, \ldots, k$.
\item There isn't an exact linear relation between independent variables.
\end{itemize}
\item \textbf{Conditional mean zero and correlation zero}.
\begin{enumerate}[leftmargin=*, label=\alph*.]
\item There aren't systematic errors: $\E(u \mid x_1, \ldots, x_k) = \E(u) = 0 \rightarrow$ \textbf{strong exogeneity} (a implies b).
\item There are no relevant variables left out of the model: $\Cov(x_j, u) = 0, \; \forall j = 1, \ldots, k \rightarrow$ \textbf{weak exogeneity}.
\end{enumerate}
\item \textbf{Homoscedasticity}. The variability of the residuals is the same for all levels of $x$: \\ $\Var(u \mid x_1, \ldots, x_k) = \sigma^2_u$
\item \textbf{No auto-correlation}. Residuals don't contain information about any other residuals: \\ $\Corr(u_t, u_s \mid x_1, \ldots, x_k) = 0, \; \forall t \neq s$.
\item \textbf{Normality}. Residuals are independent and identically distributed: $u \sim \mathcal{N} (0, \sigma^2_u)$
\item \textbf{Data size}. The number of observations available must be greater than $(k + 1)$ parameters to estimate. (It is already satisfied under asymptotic situations)
\end{enumerate}
\subsection*{Asymptotic properties of OLS}
Under the econometric model assumptions and the Central Limit Theorem (CLT):
\begin{itemize}[leftmargin=*]
\item Hold 1 to 4a: OLS is \textbf{unbiased}. $\E(\hat{\beta}_j) = \beta_j$
\item Hold 1 to 4: OLS is \textbf{consistent}. $\mathrm{plim}(\hat{\beta}_j) = \beta_j$ (to 4b left out 4a, weak exogeneity, biased but consistent)
\item Hold 1 to 5: \textbf{asymptotic normality} of OLS (then, 7 is necessarily satisfied): $u \underset{a}{\sim} \mathcal{N} (0, \sigma^2_u)$
\item Hold 1 to 6: \textbf{unbiased estimate} of $\sigma^2_u$. $\E(\hat{\sigma}^2_u) = \sigma^2_u$
\item Hold 1 to 6: OLS is \textcolor{blue}{BLUE} (Best Linear Unbiased Estimator) or \textbf{efficient}.
\item Hold 1 to 7: hypothesis testing and confidence intervals can be done reliably.
\end{itemize}
\section*{Ordinary Least Squares}
\textbf{Objective} - minimize the Sum of Squared Residuals (SSR):
\begin{center}
$\min \sum_{i=1}^n \hat{u}_i^2$, where $\hat{u}_i = y_i - \hat{y}_i$
\end{center}
\subsection*{Simple regression model}
\setlength{\multicolsep}{2pt}
\setlength{\columnsep}{-40pt}
\begin{multicols}{2}
\begin{tikzpicture}[scale=0.15]
% \draw [step=2, gray, very thin] (-10, -10) grid (10, 10); % background grid
\draw [thick, <->] (-10, 10) node [anchor=south] {$y$} -- (-10, -10) -- (10, -10) node [anchor=north] {$x$}; %axis
\draw [red, thick] plot [domain=-10:10] (\x, {1 + 0.5*\x}); % regression line
\draw plot [only marks, mark=*, mark size=6, domain=-8:8, samples=20] (\x, {rnd*5 - 1.5 + 0.5*\x}); % data points
\draw (-9.3, -9.6) -- (-9.5, -9.6) -- (-9.5, -4.3) -- (-9.3, -4.3) node [anchor=north west] {$\beta_0$}; % beta0
\draw (-2, 0) -- (1.5, 0) arc (0:25:3.5); % beta1 arc
\draw (3, -0.5) node {$\beta_1$}; % beta1
\end{tikzpicture}
\columnbreak
Equation:
\begin{center}
$y_i = \beta_0 + \beta_1 x_i + u_i$
\end{center}
Estimation:
\begin{center}
$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$
\end{center}
where:
\begin{center}
$\hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x}$
$\hat{\beta}_1 = \frac{\Cov(y, x)}{\Var(x)}$
\end{center}
\end{multicols}
\subsection*{Multiple regression model}
\setlength{\multicolsep}{2pt}
\setlength{\columnsep}{-40pt}
\begin{multicols}{2}
\begin{tikzpicture}[scale=0.15]
\draw [thick, ->] (-10, -10) -- (-3, -4) node [anchor=north west] {$x_2$}; % x2 axis
\draw [thick, <->] (-10, 10) node [anchor=south] {$y$} -- (-10, -10) -- (10, -10) node [anchor=north] {$x_1$}; % y and x1 axis
% regression grid
\draw [red, thick] (-10, -4) -- (-5, 3);
\draw [red, thick] (-8.5, -3.5) -- (-3.5, 3.5);
\draw [red, thick] (-7, -3) -- (-2, 4);
\draw [red, thick] (-5.5, -2.5) -- (-0.5, 4.5);
\draw [red, thick] (-4, -2) -- (1, 5);
\draw [red, thick] (-2.5, -1.5) -- (2.5, 5.5);
\draw [red, thick] (-1, -1) -- (4, 6);
\draw [red, thick] (0.5, -0.5) -- (5.5, 6.5);
\draw [red, thick] (2, 0) -- (7, 7);
\draw [red, thick] (3.5, 0.5) -- (8.5, 7.5);
\draw [red, thick] (5, 1) -- (10, 8);
\draw [red, thick] (-10, -4) -- (5, 1);
\draw [red, thick] (-8.75, -2.25) -- (6.25, 2.75);
\draw [red, thick] (-7.5, -0.5) -- (7.5, 4.5);
\draw [red, thick] (-6.25, 1.25) -- (8.75, 6.25);
\draw [red, thick] (-5, 3) -- (10, 8);
\draw plot [only marks, mark=*, mark size=6, domain=-8:8, samples=20] (\x, {rnd*6.5 - 1.5 + 0.5*\x}); % data points
\draw (-9.3, -9.6) -- (-9.5, -9.6) -- (-9.5, -4.3) -- (-9.3, -4.3) node [anchor=north west] {$\beta_0$}; % beta0
\end{tikzpicture}
\columnbreak
Equation:
\begin{center}
$y_i = \beta_0 + \beta_1 x_{1i} + \cdots + \beta_k x_{ki} + u_i$
\end{center}
Estimation:
\begin{center}
$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_{1i} + \cdots + \hat{\beta}_k x_{ki}$
\end{center}
where:
\begin{center}
$\hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x}_1 - \cdots - \hat{\beta}_k \overline{x}_k$
$\hat{\beta}_j = \frac{\Cov(y, \text{resid } x_j)}{\Var(\text{resid } x_j)}$
\end{center}
Matrix: $\hat{\beta} = (X^\tr X)^{-1} (X^\tr y)$
\end{multicols}
\subsection*{Interpretation of coefficients}
\begin{center}
\scalebox{0.85}{
\begin{tabular}{ c c c c }
Model & Dependent & Independent & $\beta_1$ interpretation \\ \hline
Level-level & $y$ & $x$ & $\Delta y = \beta_1 \Delta x$ \\
Level-log & $y$ & $\log(x)$ & $\Delta y \approx (\beta_1/100) (\% \Delta x$) \\
Log-level & $\log(y)$ & $x$ & $\% \Delta y \approx (100 \beta_1) \Delta x$ \\
Log-log & $\log(y)$ & $\log(x)$ & $\% \Delta y \approx \beta_1 (\% \Delta x$) \\
Quadratic & $y$ & $x + x^2$ & $\Delta y = (\beta_1 + 2 \beta_2 x) \Delta x$
\end{tabular}
}
\end{center}
\subsection*{Error measurements}
Sum of Sq. Residuals: \hfill $\SSR = \sum_{i=1}^n \hat{u}_i^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2$
Explained Sum of Squares: \hfill $\SSE = \sum_{i=1}^n (\hat{y}_i - \overline{y})^2$
Total Sum of Sq.: \hfill $\SST = \SSE + \SSR = \sum_{i=1}^n (y_i - \overline{y})^2$
Standard Error of the Regression: \hfill $\hat{\sigma}_u = \sqrt{\frac{\SSR}{n - k - 1}}$
Standard Error of the $\hat{\beta}$'s: \hfill $\se(\hat{\beta}) = \sqrt{\hat{\sigma}^2_u \cdot (X^\tr X)^{-1}}$
Root Mean Squared Error: \hfill $\mathrm{RMSE} = \sqrt{\frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n}}$
Absolute Mean Error: \hfill $\mathrm{AME} = \frac{\sum_{i=1}^n \lvert y_i - \hat{y}_i \rvert}{n}$
Mean Percentage Error: \hfill $\mathrm{MPE} = \frac{\sum_{i=1}^n \lvert \hat{u}_i / y_i \rvert}{n} \cdot 100$
\columnbreak
\section*{R-squared}
Is a measure of the \textbf{goodness of the fit}, how the regression fits to the data:
\begin{center}
$R^2 = \frac{\SSE}{\SST} = 1 - \frac{\SSR}{\SST}$
\end{center}
\begin{itemize}[leftmargin=*]
\item Measures the \textbf{percentage of variation} of $y$ that is linearly \textbf{explained} by the variations of $x$'s.
\item Takes values \textbf{between 0} (no linear explanation of the variations of $y$) \textbf{and 1} (total explanation of the variations of $y$).
\end{itemize}
When the number of regressors increment, the value of the R-squared increments as well, whatever the new variables are relevant or not. To solve this problem, there is an \textbf{adjusted R-squared} by degrees of freedom (or corrected R-squared):
\begin{center}
$\overline{R}^2 = 1 - \frac{n - 1}{n - k - 1} \cdot \frac{\SSR}{\SST} = 1 - \frac{n - 1}{n - k - 1} \cdot (1 - R^2)$
\end{center}
For big sample sizes: $\overline{R}^2 \approx R^2$
\section*{Hypothesis testing}
\subsection*{Definitions}
An hypothesis test is a rule designed to explain from a sample, if exist \textbf{evidence or not to reject an hypothesis} that is made about one or more population parameters.
Elements of an hypothesis test:
\begin{itemize}[leftmargin=*]
\item \textbf{Null hypothesis} ($H_0$) - is the hypothesis to be tested.
\item \textbf{Alternative hypothesis} ($H_1$) - is the hypothesis that cannot be rejected when the null hypothesis is rejected.
\item \textbf{Test statistic} - is a random variable whose probability distribution is known under the null hypothesis.
\item \textbf{Critic value} - is the value against which the test statistic is compared to determine if the null hypothesis is rejected or not. Is the value that makes the frontier between the regions of acceptance and rejection of the null hypothesis.
\item \textbf{Significance level} ($\alpha$) - is the probability of rejecting the null hypothesis being true (Type I Error). Is chosen by who conduct the test. Commonly is 0.10, 0.05 or 0.01.
\item \textbf{p-value} - is the highest level of significance by which the null hypothesis cannot be rejected ($H_0$).
\end{itemize}
\textbf{The rule is}: if the p-value is \textbf{less} than $\alpha$, there is evidence to \textbf{reject} the \textbf{null hypothesis} at that given $\alpha$ (there is evidence to accept the alternative hypothesis).
\subsection*{Individual tests}
Tests if a parameter is significantly different from a given value, $\vartheta$.
\begin{itemize}[leftmargin=*]
\item $H_0: \beta_j = \vartheta$
\item $H_1: \beta_j \neq \vartheta$
\end{itemize}
\begin{center}
Under $H_0$: \quad $t = \frac{\hat{\beta}_j - \vartheta}{\se(\hat{\beta}_j)} \sim t_{n - k - 1, \alpha/2}$
\end{center}
If $\lvert t \rvert > \lvert t_{n - k - 1, \alpha/2} \rvert$, there is evidence to reject $H_0$.
\textbf{Individual significance test} - tests if a parameter is significantly \textbf{different from zero}.
\begin{itemize}[leftmargin=*]
\item $H_0: \beta_j = 0$
\item $H_1: \beta_j \neq 0$
\end{itemize}
\begin{center}
Under $H_0$: \quad $t = \frac{\hat{\beta}_j}{\se(\hat{\beta}_j)} \sim t_{n - k - 1, \alpha/2}$
\end{center}
If $\lvert t \rvert > \lvert t_{n - k - 1, \alpha/2} \rvert$, there is evidence to reject $H_0$.
\subsection*{The F test}
Simultaneously tests multiple (linear) hypothesis about the parameters. It makes use of a non restricted model and a restricted model:
\begin{itemize}[leftmargin=*]
\item \textbf{Non restricted model} - is the model on which we want to test the hypothesis.
\item \textbf{Restricted model} - is the model on which the hypothesis that we want to test have been imposed.
\end{itemize}
Then, looking at the errors, there are:
\begin{itemize}[leftmargin=*]
\item \textbf{$\SSR_\mathrm{UR}$} - is the $\SSR$ of the non restricted model.
\item \textbf{$\SSR_\mathrm{R}$} - is the $\SSR$ of the restricted model.
\end{itemize}
\begin{center}
Under $H_0$: \quad $F = \frac{\SSR_\mathrm{R} - \SSR_\mathrm{UR}}{\SSR_\mathrm{UR}} \cdot \frac{n - k - 1}{q} \sim F_{q, n - k - 1}$
\end{center}
where $k$ is the number of parameters of the non restricted model and $q$ is the number of linear hypothesis tested.
If $F_{q, n - k - 1} < F$, there is evidence to reject $H_0$.
\textbf{Global significance test} - tests if all the parameters associated to $x$'s are \textbf{simultaneously equal to zero}.
\begin{itemize}[leftmargin=*]
\item $H_0: \beta_1 = \beta_2 = \cdots = \beta_k = 0$
\item $H_1: \beta_1 \neq 0$ and/or $\beta_2 \neq 0 \ldots$ and/or $\beta_k \neq 0$
\end{itemize}
In this case, we can simplify the formula for the $F$ statistic.
\begin{center}
Under $H_0$: \quad $F = \frac{R^2}{1 - R^2} \cdot \frac{n - k - 1}{k} \sim F_{k, n - k - 1}$
\end{center}
If $F_{k, n - k - 1} < F$, there is evidence to reject $H_0$.
\section*{Confidence intervals}
The confidence intervals at ($1 - \alpha$) confidence level can be calculated:
\begin{center}
$\hat{\beta}_j \mp t_{n - k - 1, \alpha/2} \cdot \se(\hat{\beta}_j)$
\end{center}
\section*{Dummy variables}
Dummy (or binary) variables are used for qualitative information like sex, civil state, country, etc.
\begin{itemize}[leftmargin=*]
\item Takes the \textbf{value 1} in a given category and \textbf{0 in the rest}.
\item Are used to analyze and modeling \textbf{structural changes} in the model parameters.
\end{itemize}
If a qualitative variable have $m$ categories, we only have to include ($m - 1$) dummy variables.
\subsection*{Structural change}
Structural change refers to changes in the values of the parameters of the econometric model produced by the effect of different sub-populations. Structural change can be included in the model through dummy variables.
The location of the dummy variables ($D$) matters:
\begin{itemize}[leftmargin=*]
\item \textbf{On the intercept} (additive effect) - represents the mean difference between the values produced by the structural change.
\begin{center}
$y = \beta_0 + \delta_1 D + \beta_1 x_1 + u$
\end{center}
\item \textbf{On the slope} (multiplicative effect) - represents the effect (slope) difference between the values produced by the structural change.
\begin{center}
$y = \beta_0 + \beta_1 x_1 + \delta_1 D \cdot x_1 + u$
\end{center}
\end{itemize}
\textbf{Chow's structural test} - is used when we want to analyze the existence of structural changes in all the model parameters, it's a particular expression of the F test, where the null hypothesis is: $H_0$: No structural change (all $\delta = 0$).
\section*{Changes of scale}
Changes in the \textbf{measurement units} of the variables:
\begin{itemize}[leftmargin=*]
\item In the \textbf{endogenous} variable, $y^* = y \cdot \lambda$ - affects all model parameters, $\beta_j^* = \beta_j \cdot \lambda, \; \forall j = 1, \ldots, k$
\item In an \textbf{exogenous} variable, $x_j^* = x_j \cdot \lambda$ - only affect the parameter linked to said exogenous variable, $\beta_j^* = \beta_j \cdot \lambda$
\item Same scale change on endogenous and exogenous - only affects the intercept, $\beta_0^* = \beta_0 \cdot \lambda$
\end{itemize}
\section*{Changes of origin}
Changes in the \textbf{measurement origin} of the variables (endogenous or exogenous), $y^* = y + \lambda$ - only affects the model's intercept, $\beta_0^* = \beta_0 + \lambda$
\columnbreak
\section*{Multicollinearity}
\begin{itemize}[leftmargin=*]
\item \textbf{Perfect multicollinearity} - there are independent variables that are constant and/or there is an exact linear relation between independent variables. Is the \textbf{breaking of the third (3) econometric} model \textbf{assumption}.
\item \textbf{Approximate multicollinearity} - there are independent variables that are approximately constant and/or there is an approximately linear relation between independent variables. It \textbf{does not break any econometric} model \textbf{assumption}, but has an effect on OLS.
\end{itemize}
\subsection*{Consequences}
\begin{itemize}[leftmargin=*]
\item \textbf{Perfect multicollinearity} - the equation system of OLS cannot be solved due to infinite solutions.
\item \textbf{Approximate multicollinearity}
\begin{itemize}[leftmargin=*]
\item Small sample variations can induce to big variations in the OLS estimations.
\item The variance of the OLS estimators of the $x$'s that are collinear, increments, thus the inference of the parameter is affected. The estimation of the parameter is very imprecise (big confidence interval).
\end{itemize}
\end{itemize}
\subsection*{Detection}
\begin{itemize}[leftmargin=*]
\item \textbf{Correlation analysis} - look for high correlations between independent variables, $\lvert r \rvert > 0.7$.
\item \textbf{Variance Inflation Factor (VIF)} - indicates the increment of $\Var(\hat{\beta}_j)$ because of the multicollinearity.
\begin{center}
$\mathrm{VIF} (\hat{\beta}_j) = \frac{1}{1 - R_j^2}$
\end{center}
where $R^2_j$ denotes the R-squared from a regression between $x_j$ and all the other $x$'s.
\begin{itemize}[leftmargin=*]
\item Values between 4 to 10 suggest that it is advisable to analyze in more depth if there might be multicollinearity problems.
\item Values bigger than 10 indicates that there are multicollinearity problems.
\end{itemize}
\end{itemize}
One typical characteristic of multicollinearity is that the regression coefficients of the model aren't individually different from zero (due to high variances), but jointly they are different from zero.
\subsection*{Correction}
\begin{itemize}[leftmargin=*]
\item Delete one of the collinear variables.
\item Perform factorial analysis (or any other dimension reduction technique) on the collinear variables.
\item Interpret coefficients with multicollinearity jointly.
\end{itemize}
\columnbreak
\section*{Heteroscedasticity}
The residuals $u_i$ of the population regression function do not have the same variance $\sigma^2_u$:
\begin{center}
$\Var(u \mid x_1, \ldots, x_k) = \Var(u) \neq \sigma^2_u$
\end{center}
Is the \textbf{breaking of the fifth (5) econometric} model \textbf{assumption}.
\subsection*{Consequences}
\begin{itemize}[leftmargin=*]
\item OLS estimators still are unbiased.
\item OLS estimators still are consistent.
\item OLS is \textbf{not efficient} anymore, but still a LUE (Linear Unbiased Estimator).
\item \textbf{Variance estimations} of the estimators are \textbf{biased}: the construction of confidence intervals and the hypothesis testing is not reliable.
\end{itemize}
\subsection*{Detection}
\begin{itemize}[leftmargin=*]
\setlength{\multicolsep}{0pt}
\setlength{\columnsep}{20pt}
\begin{multicols}{3}
\item \textbf{Graphs} - look for scatter patterns on $x$ vs. $u$ or $x$ vs. $y$ plots.
\columnbreak
\begin{tikzpicture}[scale=0.108]
% \draw [step=2, gray, very thin] (-10, -10) grid (10, 10); % grid
\draw [thick, ->] (-10, 0) -- (10, 0) node [anchor=north] {$x$}; % x axis
\draw [thick, -] (-10, -10) -- (-10, 10) node [anchor=west] {$u$}; % u axis
\draw plot [only marks, mark=*, mark size=6, domain=0:17, samples=50] ({\x - 9}, {-0.5*rand*\x - 1}); % data points
\draw [thick, dashed, red, -latex] plot [domain=1:17] ({\x - 10}, {-0.5*\x - 1}); % lower red arrow
\draw [thick, dashed, red, -latex] plot [domain=1:17] ({\x - 10}, {0.5*\x - 1}); % upper red arrow
\end{tikzpicture}
\columnbreak
\begin{tikzpicture}[scale=0.108]
% \draw [step=2, gray, very thin] (-10, -10) grid (10, 10); % grid
\draw [thick, <->] (-10, 10) node [anchor=west] {$y$} -- (-10, -10) -- (10, -10) node[anchor=south] {$x$}; % axis
\draw plot [only marks, mark=*, mark size=6, domain=0:13, samples=50] ({\x - 9}, {(-0.65*rand*\x) + 0.6*\x - 8}); % data points
\draw [thick, dashed, red, -latex] plot [domain=0:12] ({\x - 9}, {-0.06*\x - 8.5}); % lower red arrow
\draw [thick, dashed, red, -latex] plot [domain=0:12] ({\x -9}, {1.25*\x - 7.2}); % upper red arrow
\end{tikzpicture}
\end{multicols}
\item \textbf{Formal tests} - White, Bartlett, Breusch-Pagan, etc. Commonly, the null hypothesis: $H_0$: Homoscedasticity.
\end{itemize}
\subsection*{Correction}
\begin{itemize}[leftmargin=*]
\item Use OLS with a variance-covariance matrix estimator robust to heteroscedasticity (HC), for example, the one proposed by White.
\item If the variance structure is known, make use of Weighted Least Squares (WLS) or Generalized Least Squares (GLS):
\begin{itemize}[leftmargin=*]
\item Supposing that $\Var(u) = \sigma^2_u \cdot x_i$, divide the model variables by the square root of $x_i$ and apply OLS.
\item Supposing that $\Var(u) = \sigma^2_u \cdot x_i^2$, divide the model variables by $x_i$ (the square root of $x_i^2$) and apply OLS.
\end{itemize}
\item If the variance structure is not known, make use of Feasible Weighted Least Squared (FWLS), that estimates a possible variance, divides the model variables by it and then apply OLS.
\item Make a new model specification, for example, logarithmic transformation (lower variance).
\end{itemize}
\columnbreak
\section*{Auto-correlation}
The residual of any observation, $u_t$, is correlated with the residual of any other observation. The observations are not independent.
\begin{center}
$\Corr(u_t, u_s \mid x_1, \ldots, x_k) = \Corr(u_t, u_s) \neq 0, \quad \forall t \neq s$
\end{center}
The ``natural" context of this phenomena is time series. Is the \textbf{breaking of the sixth (6) econometric} model \textbf{assumption}.
\subsection*{Consequences}
\begin{itemize}[leftmargin=*]
\item OLS estimators still are unbiased.
\item OLS estimators still are consistent.
\item OLS is \textbf{not efficient} anymore, but still a LUE (Linear Unbiased Estimator).
\item \textbf{Variance estimations} of the estimators are \textbf{biased}: the construction of confidence intervals and the hypothesis testing is not reliable.
\end{itemize}
\subsection*{Detection}
\begin{itemize}[leftmargin=*]
\item \textbf{Graphs} - look for scatter patterns on $u_{t - 1}$ vs. $u_t$ or make use of a correlogram.
\setlength{\multicolsep}{0pt}
\setlength{\columnsep}{6pt}
\begin{multicols}{3}
\begin{center}
\begin{tikzpicture}[scale=0.11]
\node at (0, 13) {\textbf{Ac.}};
% \draw [step=2, gray, very thin] (-10, -10) grid (10, 10); % grid
\draw [thick, ->] (-10, 0) -- (10, 0) node [anchor=south] {$u_{t - 1}$}; % ut-1 axis
\draw [thick, -] (-10, -10) -- (-10, 10) node [anchor=west] {$u_t$}; % ut axis
\draw plot [only marks, mark=*, mark size=6, domain=-8:8, samples=50] (\x, {rnd*6 + (-2*(\x)^2 + 40)*0.1}); % data points
\draw [thick, dashed, red, -latex] plot [domain=-8:8] (\x, {3 + (-2*(\x)^2 + 40)*0.1}); % red arrow
\end{tikzpicture}
\end{center}
\columnbreak
\begin{center}
\begin{tikzpicture}[scale=0.11]
\node at (0, 13) {\textbf{Ac.(+)}};
% \draw [step=2, gray, very thin] (-10, -10) grid (10, 10); % grid
\draw [thick, ->] (-10, 0) -- (10, 0) node [anchor=north] {$u_{t - 1}$}; % ut-1 axis
\draw [thick, -] (-10, -10) -- (-10, 10) node [anchor=west] {$u_t$}; % ut axis
\draw plot [only marks, mark=*, mark size=6, domain=-8:8, samples=25] (\x, {rnd*6 + 0.5*\x - 3}); % data points
\draw [thick, dashed, red, -latex] plot [domain=-8:8] (\x, {3 + 0.5*\x - 3}); % red arrow
\end{tikzpicture}
\end{center}
\columnbreak
\begin{center}
\begin{tikzpicture}[scale=0.11]
\node at (0, 13) {\textbf{Ac.(-)}};
% \draw [step=2, gray, very thin] (-10, -10) grid (10, 10); % grid
\draw [thick, ->] (-10, 0) -- (10, 0) node [anchor=south] {$u_{t - 1}$}; % ut-1 axis
\draw [thick, -] (-10, -10) -- (-10, 10) node [anchor=west] {$u_t$}; % ut axis
\draw plot [only marks, mark=*, mark size=6, domain=-8:8, samples=25] (\x, {rnd*6 - 0.5*\x - 3}); % data points
\draw [thick, dashed, red, -latex] plot [domain=-8:8] (\x, {3 - 0.5*\x - 3}); % red arrow
\end{tikzpicture}
\end{center}
\end{multicols}
\item \textbf{Formal tests} - Durbin-Watson, Breusch-Godfrey, etc. Commonly, the null hypothesis: $H_0$: No auto-correlation.
\end{itemize}
\subsection*{Correction}
\begin{itemize}[leftmargin=*]
\item Use OLS with a variance-covariance matrix estimator robust to heterocedasticity and auto-correlation (HAC), for example, the one proposed by Newey-West.
\item Use Generalized Least Squares. Supposing $y_t = \beta_0 + \beta_1 x_t + u_t$, with $u_t = \rho u_{t - 1} + \varepsilon_t$, where $\lvert \rho \rvert < 1$ and $\varepsilon_t$ is white noise.
\begin{itemize}[leftmargin=*]
\item If $\rho$ is known, create a quasi-differentiated model where $u_t$ is white noise and estimate it by OLS.
\item If $\rho$ is not known, estimate it by -for example- the Cochrane-Orcutt method, create a quasi-differentiated model where $u_t$ is white noise and estimate it by OLS.
\end{itemize}
\end{itemize}
\end{multicols}
\end{document}