# Feature Engineering

Notebook supporting the [**Do we know our data, as good as we know our tools** talk](https://devoxxuk19.confinabox.com/talk/VEM-8021/Do_we_know_our_data_as_good_as_we_know_our_tools%3F) at [Devoxx UK 2019](http://twitter.com/@DevoxxUK).

The contents of the notebook is inspired by many sources.


### High-level steps covered:

- Find hidden information
  - feature extraction
  - applying statistical functions
  - apply physics functions
- Deal with too much data
  - dimensionality reduction
  - feature selection
- Statistical Inference 


### Resources

- [Basic Feature Engineering With Time Series Data in Python](http://machinelearningmastery.com/basic-feature-engineering-time-series-data-python/)
- [Zillow Prize - EDA, Data Cleaning & Feature Engineering](https://www.kaggle.com/lauracozma/eda-data-cleaning-feature-engineering)
- [Feature-wise transformations](https://distill.pub/2018/feature-wise-transformations)
- [tsfresh - tsfresh is used to to extract characteristics from time series](https://tsfresh.readthedocs.io/en/latest/text/introduction.html)
- [featuretools - an open source python framework for automated feature engineering](https://github.com/featuretools/featuretools/)
- [Synthetic features and outliers notebook](https://colab.research.google.com/notebooks/mlcc/synthetic_features_and_outliers.ipynb?utm_source=mlcc&utm_campaign=colab-external&utm_medium=referral&utm_content=syntheticfeatures-colab&hl=en#scrollTo=jnKgkN5fHbGy)


Please refer to the [Slides](http://bit.ly/do-we-know-our-data) for the step here after.

#### Load Your Data

In [0]:
import pandas
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv"
names = ["crim","zn","indus","chas","nox","rm","age","dis","rad","tax","ptratio","b","lstat","medv"]
data = pandas.read_csv(url, names=names)

!rm  housing.names || true
!wget https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.names &> /dev/null
print("Names and descriptions of the fields of the Boston Housing dataset can be found at")
print("https://github.com/jbrownlee/Datasets/blob/master/housing.names")
print("")
!cat housing.names

Names of the fields and descriptions of the fields in the Boston Housing dataset can be found at
https://github.com/jbrownlee/Datasets/blob/master/housing.names

1. Title: Boston Housing Data

2. Sources:
   (a) Origin:  This dataset was taken from the StatLib library which is
                maintained at Carnegie Mellon University.
   (b) Creator:  Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the 
                 demand for clean air', J. Environ. Economics & Management,
                 vol.5, 81-102, 1978.
   (c) Date: July 7, 1993

3. Past Usage:
   -   Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 
       1980.   N.B. Various transformations are used in the table on
       pages 244-261.
    -  Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning.
       In Proceedings on the Tenth International Conference of Machine 
       Learning, 236-243, University of Massachusetts, Amherst. Morgan
       Kaufmann.

4. Relevant Information:

  

### Find hidden information

- feature extraction
- applying statistical functions
- apply physics functions 

#### Feature extraction

####  Applying statistical functions

#### Apply physics functions

### Deal with too many features / too much data

- dimensionality reduction
- feature selection

#### Dimensionality reduction

#### Feature selection

### Statistical Inference

- [Understanding statistical inference]() [video]
- [Four ideas of Statistical Inference](http://www.bristol.ac.uk/medical-school/media/rms/red/4_ideas_of_statistical_inference.html)
- [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) [book]
- [Statistical Inference](https://www.coursera.org/learn/statistical-inference) [course]


### Please refer to the [Slides](http://bit.ly/do-you-know-your-data) for the step here after.