# Some things to read about machine learning

* Susan Athey, "The Impact of Machine Learning in Economics," https://www.gsb.stanford.edu/faculty-research/publications/impact-machine-learning-economics.

* Marcos Lopez de Prado, "Beyond Econometrics: A Roadmap towards Financial Machine Learning," https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2670703.

* Fieberg, Hesse, and Loy, "Machine Learning in Accounting Research," https://link.springer.com/chapter/10.1007/978-3-031-04063-4_6

# Pre-Processing for Machine Learning

### BUSI 520 - Python for Business Research
### Kerry Back, JGSB, Rice University

## Issues and methods

* Deal with outliers and skewness
* Scale so all features on same scale
* Create dummy variables for categorical variables
* Dimension reduction through principal components or other
* Dimension expansion through polynomial features

### An alternative to logs for dealing with skewness

* For any sample $x$, there exists a monotone transformation $f$ such that $y = f(x)$ is distributed as a standard normal sample.
* True more generally for continuous random variables $x$.
* Let $F$ denote cdf of a random variable $x$ and $N$ the standard normal cdf.
* We want 
$$\text{prob}(f(x) < a) = N(a) \quad\Leftrightarrow\quad \text{prob}(x < f^{-1}(a)) = N(a)$$
$$  \quad\Leftrightarrow\quad F(f^{-1}(a)) = N(a)  \quad\Leftrightarrow\quad f^{-1} = F^{-1} \circ N$$
$$  \quad\Leftrightarrow\quad f = N^{-1} \circ F$$


### Scikit-learn quantile transformer

    from sklearn.preprocessing import QuantileTransformer
    transformer = QuantileTransformer(output_distribution="normal")
    transformer.fit_transform(X_train)

### Scaling

* Subtract mean and divide by standard deviation
* An alternative to transforming (but doesn't really fix outliers)
* Scikit-learn's standard scaler:

      from sklearn.preprocessing import StandardScaler
      scaler = StandardScaler()
      scaler.fit_transform(X_train)

### Pipelines

Add pre-processing to the model in a pipeline and fit/predict in one step

    from sklearn.pipeline import Pipeline
    transformer = ...
    model = ...
    pipe = Pipeline(
        steps = [("transformer", transformer), ("model", model)]
    )
    pipe.fit(X_train, y_train)
    pipe.predict(X_test)
    pipe.score(X_test, y_test)

We can also pass pipe to GridSearchCV.     

### Dummy variables

* Transform categorical features into dummy variables with OneHotEncoder
* Can add this to the pipeline too

## A Kaggle dataset

In [1]:
import pandas as pd

df = pd.read_csv("housing.data")
y = df.median_house_value
X = df.drop(columns=["median_house_value"])
X.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,NEAR BAY


In [2]:
X.ocean_proximity.unique()

array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
      dtype=object)

In [3]:
X[:-1].describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
count,20639.0,20639.0,20639.0,20639.0,20432.0,20639.0,20639.0,20639.0
mean,-119.569624,35.63168,28.640099,2635.755851,537.866729,1425.478608,499.538204,3.870743
std,2.003547,2.135846,12.585555,2181.667858,421.395028,1132.489526,382.338957,1.89984
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999
25%,-121.8,33.93,18.0,1447.5,296.0,787.0,280.0,2.5638
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5349
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.7434
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001


### Drop rows with missing values

In [4]:
df = df.dropna()
y = df.median_house_value
X = df.drop(columns=["median_house_value"])

### Train-test split

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=0
)

### Transformer

In [6]:
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

qt = QuantileTransformer(output_distribution="normal")
dummies = OneHotEncoder()
transformer = make_column_transformer(
    (qt, X.columns[:-1]),
    (dummies, [X.columns[-1]])
)

### See what the transformer does

In [7]:
pd.DataFrame(transformer.fit_transform(X)).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,-1.152175,0.899534,0.912774,-1.355132,-1.694923,-1.697037,-1.639539,1.919214,0.0,0.0,0.0,1.0,0.0
1,-1.137677,0.882739,-0.479432,1.813497,1.466763,1.193058,1.623705,1.914455,0.0,0.0,0.0,1.0,0.0
2,-1.169401,0.875354,5.199338,-0.65493,-1.334066,-1.317949,-1.324958,1.629912,0.0,0.0,0.0,1.0,0.0
3,-1.189522,0.875354,5.199338,-0.874727,-1.059682,-1.17689,-1.068522,1.083588,0.0,0.0,0.0,1.0,0.0
4,-1.189522,0.875354,5.199338,-0.471004,-0.77314,-1.164444,-0.81091,0.191633,0.0,0.0,0.0,1.0,0.0


### Pipeline with lasso

In [8]:

from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline

model = Lasso(alpha=10, fit_intercept=False)
pipe = Pipeline(
    steps = [("transformer", transformer), ("model", model)]
)

### Train and test

In [9]:
pipe.fit(X_train, y_train)
score_train = pipe.score(X_train, y_train)
score_test = pipe.score(X_test, y_test)
print("R-squared on training data is", score_train)
print("R-squared on test data is", score_test)

R-squared on training data is 0.617063711730554
R-squared on test data is 0.6264072535855956


## GridSearchCV

In [10]:
from sklearn.model_selection import GridSearchCV

alphas = (0.1, 1, 10, 100)

cv = GridSearchCV(
    pipe,
    param_grid = {"model__alpha": alphas}
)
cv.fit(X_train, y_train)
print(f"best alpha is {cv.best_params_}")
print(f"score on the test data is {cv.score(X_test, y_test)}")

best alpha is {'model__alpha': 10}
score on the test data is 0.6264072535855956
