# Econometrics seminar

### Wooldridge
We are gonno need Data sets from wooldridge. [Click here](https://pypi.org/project/wooldridge/) to get manual to install

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import wooldridge

In [2]:
wooldridge.data()

  J.M. Wooldridge (2019) Introductory Econometrics: A Modern Approach,
  Cengage Learning, 6th edition.

  401k       401ksubs    admnrev       affairs     airfare
  alcohol    apple       approval      athlet1     athlet2
  attend     audit       barium        beauty      benefits
  beveridge  big9salary  bwght         bwght2      campus
  card       catholic    cement        census2000  ceosal1
  ceosal2    charity     consump       corn        countymurders
  cps78_85   cps91       crime1        crime2      crime3
  crime4     discrim     driving       earns       econmath
  elem94_95  engin       expendshares  ezanders    ezunem
  fair       fertil1     fertil2       fertil3     fish
  fringe     gpa1        gpa2          gpa3        happiness
  hprice1    hprice2     hprice3       hseinv      htv
  infmrt     injury      intdef        intqrt      inven
  jtrain     jtrain2     jtrain3       kielmc      lawsch85
  loanapp    lowbrth     mathpnl       meap00_01   meap01
  meap93    

1. The data in 401K.RAW are a subset of data analyzed by Papke (1995) to study the rela- tionship between participation in a 401(k) pension plan and the generosity of the plan. The variable *prate* is the percentage of eligible workers with an active account; this is the variable we would like to explain. The measure of generosity is the plan match rate, *mrate*. This variable gives the average amount the firm contributes to each worker’s plan for each 100¢ contribution by the worker. For example, if mrate = 0.50, then a 100¢ contribution by the worker is matched by a 50¢  contribution by the firm.
    1. Find the average participation rate and the average match rate in the sample of plans.
    2. Now, estimate the simple regression equation prate 5 bˆ0 1 bˆ1 mrate, and report the results along with the sample size and R-squared.
    3. Interpret the intercept in your equation. Interpret the coefficient on mrate.
    4. Find the predicted prate when mrate 5 3.5. Is this a reasonable prediction? Explain what is happening here.
    5. How much of the variation in prate is explained by mrate? Is this a lot in your opinion?

<p id="average"><b>1</b></p>
The data in 401K.RAW

In [3]:
wooldridge.data('401k', description=True)

name of dataset: 401k
no of variables: 8
no of observations: 1534

+----------+---------------------------------+
| variable | label                           |
+----------+---------------------------------+
| prate    | participation rate, percent     |
| mrate    | 401k plan match rate            |
| totpart  | total 401k participants         |
| totelg   | total eligible for 401k plan    |
| age      | age of 401k plan                |
| totemp   | total number of firm employees  |
| sole     | = 1 if 401k is firm's sole plan |
| ltotemp  | log of totemp                   |
+----------+---------------------------------+

L.E. Papke (1995), “Participation in and Contributions to 401(k)
Pension Plans:Evidence from Plan Data,” Journal of Human Resources 30,
311-325. Professor Papke kindly provided these data. She gathered them
from the Internal Revenue Service’s Form 5500 tapes.


In [4]:
df = wooldridge.data('401k')
df

Unnamed: 0,prate,mrate,totpart,totelg,age,totemp,sole,ltotemp
0,26.100000,0.21,1653.0,6322.0,8,8709.0,0,9.072112
1,100.000000,1.42,262.0,262.0,6,315.0,1,5.752573
2,97.599998,0.91,166.0,170.0,10,275.0,1,5.616771
3,100.000000,0.42,257.0,257.0,7,500.0,0,6.214608
4,82.500000,0.53,591.0,716.0,28,933.0,1,6.838405
...,...,...,...,...,...,...,...,...
1529,85.099998,0.33,553.0,650.0,24,907.0,0,6.810143
1530,100.000000,2.52,142.0,142.0,17,197.0,1,5.283204
1531,100.000000,2.27,1928.0,1928.0,35,2171.0,0,7.682943
1532,100.000000,0.58,166.0,166.0,8,931.0,1,6.836259


<p id="average"><b>1.A</b></p>
Find the average participation rate and the average match rate in the sample of plans.

In [5]:
print(f'The average participation rate is {df.prate.mean().round(3)}')

The average participation rate is 87.363


In [6]:
print(f'The average participation rate is {df.mrate.mean().round(3)}')

The average participation rate is 0.732


<p id="average"><b>1.B</b></p>
Now, estimate the simple regression equation $$\widehat{prate} =\widehat{ \beta_0} + \widehat{ \beta_1}mrate ,$$ and report the results along with the sample size and $R$-squared.

In [7]:
import statsmodels.formula.api as smf
mod = smf.ols(formula='prate ~ mrate', data=df)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                  prate   R-squared:                       0.075
Model:                            OLS   Adj. R-squared:                  0.074
Method:                 Least Squares   F-statistic:                     123.7
Date:                Fri, 13 Jun 2025   Prob (F-statistic):           1.10e-27
Time:                        12:43:59   Log-Likelihood:                -6437.0
No. Observations:                1534   AIC:                         1.288e+04
Df Residuals:                    1532   BIC:                         1.289e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     83.0755      0.563    147.484      0.0

<p id="average"><b>1.C</b></p>
Interpret the intercept in your equation. Interpret the coefficient on mrate.

<p id="average"><b>1.D</b></p>
Find the predicted *prate* when $mrate = 3.5$. Is this a reasonable prediction?
Explain what is happening here.

In [8]:
res.predict(df.loc[df['mrate']==3.5])

730    103.589233
dtype: float64

<p id="average"><b>1.E</b></p>
How much of the variation in prate is explained by mrate? Is this a lot in your opinion?

In [9]:
print(f'The variation in prate is explained by mrate {res.rsquared.round(3)}')

The variation in prate is explained by mrate 0.075


1. The data set in CEOSAL2.RAW contains information on chief executive officers for U.S. corporations. The variable salary is annual compensation, in thousands of dollars, and ceoten is prior number of years as company CEO.
    1. Find the average salary and the average tenure in the sample.
    2. How many CEOs are in their first year as CEO (that is, ceoten 5 0)? What is the longest tenure as a CEO?
    3. Estimate the simple regression model $$\log(salary) =  \beta_0 + \beta_1 ceoten + u,$$ and report your results in the usual form. What is the (approximate) predicted percentage increase in salary given one more year as a CEO?


In [10]:
wooldridge.data('ceosal2', description=True )

name of dataset: ceosal2
no of variables: 15
no of observations: 177

+----------+--------------------------------+
| variable | label                          |
+----------+--------------------------------+
| salary   | 1990 compensation, $1000s      |
| age      | in years                       |
| college  | =1 if attended college         |
| grad     | =1 if attended graduate school |
| comten   | years with company             |
| ceoten   | years as ceo with company      |
| sales    | 1990 firm sales, millions      |
| profits  | 1990 profits, millions         |
| mktval   | market value, end 1990, mills. |
| lsalary  | log(salary)                    |
| lsales   | log(sales)                     |
| lmktval  | log(mktval)                    |
| comtensq | comten^2                       |
| ceotensq | ceoten^2                       |
| profmarg | profits as % of sales          |
+----------+--------------------------------+

See CEOSAL1.RAW


In [11]:
df = wooldridge.data('ceosal2')
df

Unnamed: 0,salary,age,college,grad,comten,ceoten,sales,profits,mktval,lsalary,lsales,lmktval,comtensq,ceotensq,profmarg
0,1161,49,1,1,9,2,6200.0,966,23200.0,7.057037,8.732305,10.051908,81,4,15.580646
1,600,43,1,1,10,10,283.0,48,1100.0,6.396930,5.645447,7.003066,100,100,16.961130
2,379,51,1,1,9,3,169.0,40,1100.0,5.937536,5.129899,7.003066,81,9,23.668638
3,651,55,1,0,22,22,1100.0,-54,1000.0,6.478509,7.003066,6.907755,484,484,-4.909091
4,497,44,1,1,8,6,351.0,28,387.0,6.208590,5.860786,5.958425,64,36,7.977208
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
172,264,63,1,0,42,3,334.0,43,480.0,5.575949,5.811141,6.173786,1764,9,12.874251
173,185,58,1,0,39,1,766.0,49,560.0,5.220356,6.641182,6.327937,1521,1,6.396867
174,387,71,1,1,32,13,432.0,28,477.0,5.958425,6.068426,6.167517,1024,169,6.481482
175,2220,63,1,1,18,18,277.0,-80,540.0,7.705263,5.624018,6.291569,324,324,-28.880867


<p id="average"><b>1.A</b></p>
Find the average salary and the average tenure in the sample.

In [12]:
print(f' The average salary is { df.salary.mean().round(3)}')

 The average salary is 865.864


In [13]:
print(f' The average tenure is { df.salary.mean().round(3)}')

 The average tenure is 865.864


<p id="average"><b>1.B</b></p>
How many CEOs are in their first year as CEO (that is, $ceoten=0$)? What is the longest tenure as a CEO?

In [14]:
print(f"There are {len(df.loc[df['ceoten']==0])} CEOs in their first year as CEO ")

There are 5 CEOs in their first year as CEO 


In [15]:
print(f"Longest tenure as a CEO id {df['ceoten'].max()}")

Longest tenure as a CEO id 37


<p id="average"><b>1.C</b></p>
How many CEOs are in their first year as CEO (that is, $ceoten=0$)? What is the longest tenure as a CEO?

Estimate the simple regression model $$\log(salary) =  \beta_0 + \beta_1 ceoten + u,$$ and report your results in the usual form. What is the (approximate) predicted per- centage increase in salary given one more year as a CEO?

In [16]:
import statsmodels.formula.api as smf
mod = smf.ols(formula='np.log(salary)~ceoten', data=df)
res = mod.fit()

In [17]:
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:         np.log(salary)   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.334
Date:                Fri, 13 Jun 2025   Prob (F-statistic):              0.128
Time:                        12:43:59   Log-Likelihood:                -160.84
No. Observations:                 177   AIC:                             325.7
Df Residuals:                     175   BIC:                             332.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      6.5055      0.068     95.682      0.0

# C3, C4, C5, C6, C7- HW

C8

8. [To complete](#complete) this exercise you need a software package that allows you to generate data from the uniform and normal distributions.
    1. Start by generating 500 observations $x_i$ – the explanatory variable – from the [uniform](#uniform) distribution with range [0,10]. (Most statistical packages have a command for the Uniform[0,1] distribution; just multiply those observations by 10.) What are the sample mean and sample standard deviation of the $x_i$?
    2. Randomly generate $500$ [errors](#errors), $u_i$, from the Normal[0,36] distribution. (If you generate a Normal[0,1], as is commonly available, simply multiply the outcomes by six.) Is the sample average of the $u_i$ exactly zero? Why or why not? What is the sample standard deviation of the $u_i$?
    3. Now generate the $y_i$ as $$y_i = 1+2x_i + u_i = \beta_0 + \beta_1x_i + ui;$$ that is, the population intercept is one and the population slope is two. Use the data to run the [regression](#regression) of yi on xi. What are your estimates of the intercept and slope? Are they equal to the population values in the above equation? Explain.
    4. Obtain the OLS residuals, $\hat{u}_i$, and verify that equation (2.60) hold (subject to rounding error).
    5. Compute the same quantities in equation (2.60) but use the errors $u_i$ in place of the residuals. Now what do you conclude?
    6. Repeat parts (A), (B), and (C) with a new sample of data, starting with generating the $x_i$. Now what do you obtain for $\hat{\beta}_0$ and $\hat{\beta}_1$? Why are these different from what you obtained in part (C)?

<p id="complete"><b>8.</b></p>
8. To complete this exercise you need a software package that allows you to generate data from the uniform and normal distributions.

In [27]:
import random

<p id="uniform"><b>8.A</b></p>

Start by generating 500 observations $x_i$ – the explanatory variable – from the uniform distribution with range [0,10]. (Most statistical packages have a command for the Uniform[0,1] distribution; just multiply those observations by 10.) What are the sample mean and sample standard deviation of the $x_i$?

In [35]:
X = np.array([random.uniform(0, 10) for _ in range(500)])
print(f'500 observations is \n {X}')

500 observations is 
 [6.15115736 9.70598963 7.35992974 1.45512676 1.72563485 8.54078226
 6.51626368 2.25601864 6.36900215 4.54682281 8.76385123 8.93903509
 9.49804097 4.25413673 4.92101739 2.1003527  0.78387166 9.38111996
 4.02665203 5.27003356 4.13880847 3.50746339 9.86465868 7.26905983
 4.24860384 3.24184541 7.85679311 3.09168468 8.00640514 6.34326397
 1.54967064 1.88755858 3.42522173 7.44956521 6.30509615 2.56720047
 4.73319681 6.37810795 6.01264135 1.7043559  5.49062266 4.64991656
 0.03269739 0.45139701 3.14180668 4.49749872 2.25812736 1.81097221
 2.96245582 5.97299105 2.66880997 4.74408538 7.53938996 3.99665418
 8.29395289 5.81918272 8.75141026 2.71936381 2.02149888 3.94612216
 4.6419907  2.94685029 5.29185388 5.0727382  4.40247286 3.9430885
 2.03300651 9.75871133 6.6517058  7.51154704 0.02941241 7.86386667
 8.07495463 2.94363897 6.75366093 5.35759945 1.43249264 4.93667023
 4.16871657 0.04446921 7.26344814 5.2593858  4.7980903  4.77329944
 6.68177548 1.15719899 7.65975256 9.85965

In [37]:
print(f"Standart mean of X is {X.mean().round(3)}")

Standart mean of X is 5.023


In [39]:
print(f"Standart Variation of X is {X.std().round(3)}")

Standart Variation of X is 2.846


<p id="errors"><b>8.B</b></p>
2. Randomly generate $500$ errors, $u_i$, from the Normal[0,36] distribution. (If you generate a Normal[0,1], as is commonly available, simply multiply the outcomes by six.) Is the sample average of the $u_i$ exactly zero? Why or why not? What is the sample standard deviation of the $u_i$?

In [42]:
mean = 0 # mean from normal distribution
mu = 6 # standart error of normal distribution
U = np.array([random.normalvariate(mean,mu) for _ in  range(0,500)])
print(f'500 errors is \n {U}')

500 errors is 
 [ 3.32547504e+00  1.04955608e+01 -4.47913322e-01  1.36926155e+00
  3.21553892e+00  3.67437461e+00 -1.81638338e+00 -7.01322476e+00
 -3.87294578e+00  8.35494254e+00  5.97482366e+00 -3.56620164e-01
  6.93025594e+00  7.74082338e+00 -1.00781729e+01 -5.24183269e+00
 -2.60243657e-01  3.40763921e+00  5.14308897e+00 -2.44444163e+00
  9.08887420e+00 -1.00312112e+00 -9.07230273e+00  2.38297243e+00
 -3.80514994e+00  6.06125600e-01 -5.00897465e+00 -2.31120902e+00
  1.35793133e+01 -1.02384763e+01  6.12549068e+00 -4.56172165e+00
  3.80300888e+00 -2.66438834e+00 -1.07541627e+01 -8.38922804e+00
 -7.03635337e+00  1.80598407e+00  3.18408448e+00  2.33355547e+00
  2.67589011e+00  3.10435751e-01  3.86145287e+00 -2.32593928e+00
  9.51243157e-01 -3.72683480e-01 -1.22984110e+01 -1.19916466e+01
  1.10751782e+01  4.43247278e+00 -2.50075091e+00 -4.32146741e-01
 -4.07206512e+00  3.37819142e+00 -7.11986630e-01  6.51481447e-02
  5.68230051e+00 -7.49101617e+00  6.55502225e+00 -1.08308659e+00
  7.15036

In [44]:
print(f"Mean average of U is {U.mean().round(3)} ")

Mean average of U is 0.082 


In [45]:
print(f"Standart deviation of U is {U.std().round(3)}")

Standart deviation of U is 5.875


<p id="uniform"><b>8.C</b></p>
Now generate the $y_i$ as $$y_i = 1+2x_i + u_i = \beta_0 + \beta_1x_i + ui;$$ that is, the population intercept is one and the population slope is two. Use the data to run the regression of yi on xi. What are your estimates of the intercept and slope? Are they equal to the population values in the above equation? Explain.