In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

### Generate X: an array of random observations from a standard normal distribution. The array consists of 10000 columns each with 20 rows.

### Also generate y: a list of 20 standard normal observations. 

Notes: 
* Any column of X should have no relationship with y: they are both generated randomly!
* the ```np.random.seed(0)``` command simply seeds the random number generator so that we all get the same numbers. 

In [2]:
np.random.seed(0)
X=np.random.randn(20,10000)
y=np.random.randn(20)

In [3]:
# Is there supposed to be another term in use for y
print(X.shape,y.shape)

(20, 10000) (20,)


In [4]:
# Is there something missing for y
X[:,0].shape

(20,)

### Now choose the three columns of X most correlated with y:

In [5]:
corr=[]

In [6]:
for i in range(10000):
    corr.append(np.corrcoef(X[:,i],y)[0][1])

In [7]:
np.array(corr).argsort()[-3:]

array([3681, 1697, 9249])

### Put those 3 columns, along with y, in a data frame...

In [8]:
df=pd.DataFrame(X[:,np.array(corr).argsort()[-3:]],columns=['Col1',"Col2","Col3"])

In [9]:
df['y']=y

In [10]:
df.head()

Unnamed: 0,Col1,Col2,Col3,y
0,1.520004,0.171244,-0.14693,0.03951
1,0.118086,0.643488,0.239815,0.338378
2,-0.455489,-1.229948,0.058524,-0.842183
3,0.786997,-0.255133,0.337896,-0.049632
4,-0.070668,-0.753288,-2.140309,-1.230245


### And build a regression model predicting y using the 3 columns:

In [11]:
reg = ols(formula='y~Col1+Col2+Col3', data=df)
reg_mod = reg.fit()
reg_mod.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.801
Model:,OLS,Adj. R-squared:,0.764
Method:,Least Squares,F-statistic:,21.51
Date:,"Tue, 18 May 2021",Prob (F-statistic):,7.36e-06
Time:,00:43:15,Log-Likelihood:,-10.741
No. Observations:,20,AIC:,29.48
Df Residuals:,16,BIC:,33.47
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.1644,0.108,-1.520,0.148,-0.394,0.065
Col1,0.3145,0.098,3.208,0.005,0.107,0.522
Col2,0.3633,0.120,3.036,0.008,0.110,0.617
Col3,0.3288,0.143,2.305,0.035,0.026,0.631

0,1,2,3
Omnibus:,9.939,Durbin-Watson:,1.484
Prob(Omnibus):,0.007,Jarque-Bera (JB):,7.409
Skew:,-1.173,Prob(JB):,0.0246
Kurtosis:,4.84,Cond. No.,2.36


### As you can see, the model explains approximately 80% of the variation in y. Moreover, the p-values associated to each of the three predictor columns are less than 0.05 indicating statistical significance at the $\alpha=0.05$ level!

### Since the columns I used in the model were just themselves random numbers, the fact that I was able to use them to predict the random numbers y is amazing!

## Response

While it may be possible $95\%$ of the time that the numbers generated out of three columns, there is always that $5\%$ possibility that the numbers that were output might not be so random after all. Such a Type I error can occur at any time that at least two of the random numbers generated might be the exact same. There is also the chance that an entire column of randomly generated numbers might be all the same numbers, or where all the numbers simply are not unique at all. 

Additionally, it is possible that $5\%$ of the time a Type II error can occur when numbers are randomly generated. It can be the case that the numbers might not be random at all but can be declared as random. 

There is always that rejection region where a column can have more than one "random" number can appear more than once in the same column and that not every number produced is actually a unique number. 