# Advances in Machine Learning with Big Data

### (part 1 of 2) 
### Trinity 2020 Weeks 1 - 4
### Dr Jeremy Large
#### jeremy.large@economics.ox.ac.uk


&#169; Jeremy Large ; shared under [CC BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/)

## 2. Being an econometrician *and* a data scientist

## Contents Weeks 1-4:

1. Introducing this course's dataset

1. **Being an econometrician _and_ a data scientist**

1. Data abundance and 'jaggedness' -> regularization and the problem of overfit

1. Regularization through resampling methods (bootstrap etc.)

1. Regularization through predictor/feature selection (Lasso etc.)

1. Moving from linear regression to the perceptron

1. Moving from linear regression to the random forest (and similar)

In [1]:
%load_ext autoreload
%autoreload 2
%pylab inline

import sys, os
import logging
logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', level=logging.INFO)

# point at library; I need some lessons on doing good PYTHONPATHs:
REPO_DIR = os.path.dirname(os.getcwd())

UCI_LIB = os.path.join(REPO_DIR, 'lib')
UCI_DATA = os.path.join(REPO_DIR, 'data') 

sys.path.append(UCI_LIB)

UCI_DATA_FILE = os.path.join(UCI_DATA, 'raw.csv') 

from uci_retail_data import stock_codes, uci_files 

Populating the interactive namespace from numpy and matplotlib


### Pull in and prepare our data

In [2]:
if os.path.exists(UCI_DATA_FILE):
    df = uci_files.load_uci_file(UCI_DATA_FILE, uci_files.SHEET_NAME)
else:
    df = uci_files.load_uci_file(uci_files.REMOTE_FILE, uci_files.SHEET_NAME)
    df.to_csv(UCI_DATA_FILE)
    logging.info('Saving a copy to ' + UCI_DATA_FILE)

2020-03-31 16:10:54,319 INFO:Loading C:\Users\Jeremy Large\Documents\professional\Oxford Work\ML Lectures\ox-sbs-ml-bd\data\raw.csv , sheet Year 2009-2010
2020-03-31 16:10:59,143 INFO:Loaded C:\Users\Jeremy Large\Documents\professional\Oxford Work\ML Lectures\ox-sbs-ml-bd\data\raw.csv , sheet number one, obviously


Clean data:

In [3]:
# Here, I call the irrelevant lines 'invalids':
invalids = stock_codes.invalid_series(df)

Aggregate into invoices:

In [4]:
invoices = stock_codes.invoice_df(df, invalid_series=invalids)

In [10]:
invoices.head(2)

Unnamed: 0_level_0,customer,codes_in_invoice,items_in_invoice,invoice_spend,hour,month,words,country,words_per_item
Invoice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
489434,13085.0,8,166,505.3,7,200912,"{BOX, FANCY, MUG, SINGLE, 20, 7"", 15CM, STRAWB...",United Kingdom,3.625
489435,13085.0,4,60,145.8,7,200912,"{CHASING, CUTLERY, WITH, MEASURING, CAT, DOG, ...",United Kingdom,4.0


In [9]:
invoices.tail(2)

Unnamed: 0_level_0,customer,codes_in_invoice,items_in_invoice,invoice_spend,hour,month,words,country,words_per_item
Invoice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
538170,13969.0,25,133,317.59,19,201012,"{ORANGE, HOTTIE, MEASURING, PRINTED, HOT, SEWI...",United Kingdom,2.92
538171,17530.0,65,194,300.64,20,201012,"{TIER, CAKE, GARDEN, SO, SET, SWEET, PLACE, FE...",United Kingdom,2.2


In [6]:
invoices.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20577 entries, 489434 to 538171
Data columns (total 9 columns):
customer            18969 non-null float64
codes_in_invoice    20577 non-null int64
items_in_invoice    20577 non-null int64
invoice_spend       20577 non-null float64
hour                20577 non-null int64
month               20577 non-null int64
words               20577 non-null object
country             20577 non-null object
words_per_item      20577 non-null float64
dtypes: float64(3), int64(4), object(2)
memory usage: 1.6+ MB


### Set a prediction problem:

**Given the time, date, and complexity of an invoice, what's its expected spend?**

 * Lets attack this in a familiar way: linear regression

 * We'll use a signature library, developed in this paper:

     *  *Scikit-learn: Machine Learning in Python*, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

In [11]:
# so, we:
import sklearn

#### Reminder of linear regression:

We have an i.i.d. sequence of observations, $\{(y_i, x_i), i=0, 1, ...\}$ where we are interested in moments of the R.V. $y_i$, conditional on the multivariate R.V.  $x_i$ (of length, say, $p$). 

We postulate a linear relationship of the following form:

\begin{equation}
y_i = x_i ' \beta + \epsilon_i,
\end{equation}
where the iid sequence of random variables $\{\epsilon_i\}$ is independent of the regressors $\{x_i\}$