# Advances in Machine Learning with Big Data

### (part 1 of 2) 
### Trinity 2020 Weeks 1 - 4
### Dr Jeremy Large
#### jeremy.large@economics.ox.ac.uk


&#169; Jeremy Large ; shared under [CC BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/)

## 2. Being an econometrician *and* a data scientist

## Contents Weeks 1-4:

1. Introducing this course's dataset

1. **Being an econometrician _and_ a data scientist**

1. Data abundance and 'jaggedness' -> regularization and the problem of overfit

1. Regularization through resampling methods (bootstrap etc.)

1. Regularization through predictor/feature selection (Lasso etc.)

1. Moving from linear regression to the perceptron

1. Moving from linear regression to the random forest (and similar)

In [1]:
%load_ext autoreload
%autoreload 2
%pylab inline

import sys, os
import logging
logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', level=logging.INFO)

# point at library; I need some lessons on doing good PYTHONPATHs:
REPO_DIR = os.path.dirname(os.getcwd())

UCI_LIB = os.path.join(REPO_DIR, 'lib')
UCI_DATA = os.path.join(REPO_DIR, 'data') 

sys.path.append(UCI_LIB)

UCI_DATA_FILE = os.path.join(UCI_DATA, 'raw.csv') 

from uci_retail_data import stock_codes, uci_files 

Populating the interactive namespace from numpy and matplotlib


### Pull in and prepare our data

In [2]:
if os.path.exists(UCI_DATA_FILE):
    df = uci_files.load_uci_file(UCI_DATA_FILE, uci_files.SHEET_NAME)
else:
    df = uci_files.load_uci_file(uci_files.REMOTE_FILE, uci_files.SHEET_NAME)
    df.to_csv(UCI_DATA_FILE)
    logging.info('Saving a copy to ' + UCI_DATA_FILE)

2020-04-06 14:51:53,410 INFO:Loading C:\Users\Jeremy Large\Documents\work\Oxford\SBS\MLBD\ox-sbs-ml-bd\data\raw.csv , sheet Year 2009-2010
2020-04-06 14:51:57,490 INFO:Loaded C:\Users\Jeremy Large\Documents\work\Oxford\SBS\MLBD\ox-sbs-ml-bd\data\raw.csv , sheet number one, obviously


Clean data:

In [3]:
# Here, I call the irrelevant lines 'invalids':
invalids = stock_codes.invalid_series(df)

Aggregate into invoices:

In [4]:
invoices = stock_codes.invoice_df(df, invalid_series=invalids)

In [5]:
invoices.head(2)

Unnamed: 0_level_0,customer,codes_in_invoice,items_in_invoice,invoice_spend,hour,month,words,country,words_per_item
Invoice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
489434,13085.0,8,166,505.3,7,200912,"{STRAWBERRY, DOUGHNUT, 7"", SAVE, FRAME, SIZE, ...",United Kingdom,3.625
489435,13085.0,4,60,145.8,7,200912,"{BOWL, DOG, LUNCHBOX, CHASING, ,, BALL, DESIGN...",United Kingdom,4.0


In [6]:
invoices.tail(2)

Unnamed: 0_level_0,customer,codes_in_invoice,items_in_invoice,invoice_spend,hour,month,words,country,words_per_item
Invoice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
538170,13969.0,25,133,317.59,19,201012,"{CARDS, SET, TRAY, LARGE, JAM, STOPMETAL, 10, ...",United Kingdom,2.92
538171,17530.0,65,194,300.64,20,201012,"{AM, OVER, BUTTER, SETTING, STORAGE, SPOT, NOT...",United Kingdom,2.2


In [7]:
invoices.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20577 entries, 489434 to 538171
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customer          18969 non-null  float64
 1   codes_in_invoice  20577 non-null  int64  
 2   items_in_invoice  20577 non-null  int64  
 3   invoice_spend     20577 non-null  float64
 4   hour              20577 non-null  int64  
 5   month             20577 non-null  int64  
 6   words             20577 non-null  object 
 7   country           20577 non-null  object 
 8   words_per_item    20577 non-null  float64
dtypes: float64(3), int64(4), object(2)
memory usage: 1.6+ MB


### Set a prediction problem:

**Given the time, date, and complexity of an invoice, what's its expected spend?**

First, we'll attack this in a simple-minded way: *linear regression*. However, we'll take time to compare two suitable `python` libraries for this.

### 1. [`StatsModels`](https://www.statsmodels.org/stable/index.html)

* package for established statistics

* has some of the feel of R

* funded at [Google Summer of Code (GSOC) 2009-2017](https://summerofcode.withgoogle.com/) and by hedge fund [AQR](https://www.aqr.com)      

### 2. [`scikit-learn`](https://scikit-learn.org/stable/index.html)

* package for machine learning

* *Scikit-learn: Machine Learning in Python*, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

They both do linear regression.

#### Lets import statsmodels / scikit-learn and packages that they use:

In [39]:
import numpy as np    # fast handling of matrices of real numbers, and similar

import pandas as pd     # great tool for wielding rectangular datasets



import statsmodels    # econometrics-centric statistical package

import sklearn   # scikit-learn - machine-learning-centric statistical package


#### Reminder of linear regression:

We have an i.i.d. sequence of observations, $\{(y_i, x_i), i=0, 1, ...\}$ where we are interested in moments of the R.V. $y_i$, conditional on the multivariate R.V.  $x_i$ (of length, say, $p$). 

We postulate a linear relationship of the following form:

\begin{equation}
y_i = x_i ' \beta + \epsilon_i,
\end{equation}
where $\beta$ is a constant vector of length $p$, and the iid sequence of random variables $\{\epsilon_i\}$ is independent of the regressors $\{x_i\}$

In [35]:
# we must specifically import the bit of statsmodels that we need:
import statsmodels.formula.api as smf

In [38]:
# create a fitted model
formula_string = 'invoice_spend ~ codes_in_invoice + items_in_invoice + hour + month + words_per_item'

lm1 = smf.ols(formula=formula_string, data=invoices)

Common step, next, namely to `fit()`:

In [38]:
lm1.fit()

In [38]:
lm1.summary()

0,1,2,3
Dep. Variable:,invoice_spend,R-squared:,0.328
Model:,OLS,Adj. R-squared:,0.328
Method:,Least Squares,F-statistic:,2008.0
Date:,"Mon, 06 Apr 2020",Prob (F-statistic):,0.0
Time:,15:14:43,Log-Likelihood:,-168730.0
No. Observations:,20577,AIC:,337500.0
Df Residuals:,20571,BIC:,337500.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1.444e+04,4.74e+04,0.304,0.761,-7.85e+04,1.07e+05
codes_in_invoice,7.6562,0.176,43.403,0.000,7.310,8.002
items_in_invoice,0.3613,0.004,80.823,0.000,0.353,0.370
hour,-13.3661,2.568,-5.206,0.000,-18.399,-8.333
month,-0.0694,0.236,-0.294,0.769,-0.532,0.393
words_per_item,-45.4932,8.314,-5.472,0.000,-61.790,-29.197

0,1,2,3
Omnibus:,44696.804,Durbin-Watson:,1.94
Prob(Omnibus):,0.0,Jarque-Bera (JB):,576492998.889
Skew:,19.212,Prob(JB):,0.0
Kurtosis:,822.095,Cond. No.,1550000000.0


The 'condition number' refers to $X'X$ -- and indicates how stably we can expect to invert that matrix

#### Comments

* Thorough implementation of linear regression, including
    * t-stats & p-values
    * significance tests
    * specification tests