## Data preparation
The goal of this phase is to deliver the data in the most appropriate representation for consumption by more advanced data mining & machine learning methods. This includes transforming the data, defining and extracting characteristic features and selecting the most relevant features, ... 

### Data transformation

> Based on the exploration activities, you have retained a number of interesting attributes and instances. You should now apply some standard transformations on them, such as
> - normalization to make the attributes comparable: using min-max or z-score
> - discretization to convert numerical into categorical attributes
> - one-hot encoding to convert categorical into numerical attributes
> - resampling to convert the frequency
> - ...

### Feature construction & extraction
Feature engineering is the most creative part of the data science process and aims to turn the raw (but preprocessed) data into higher-level concepts that are valuable for the modelling step. 

> Consider some of the common features for the different types of data
> - for time series data (using a windowing approach if appropriate):
>     - mean & standard deviation
>     - mean or zero crossing rate: the number of times the series crosses the mean/zero axis
>     - peak distance: the time between two peaks
>     - peak mean: the mean of the peaks
>     - count above/below mean
>     - first/last location of maximum/minimum
>     - number of occurrences of minimum/maximum/any value
>     - number of peaks
>     - longest strike above/below mean
>     - mean (absolute) change
>     - average autocorrelation
>     - number of observed values within interval
>     - ... 
> - for location data (using geohashing when relevant):  
>     - time spent at particular location
>     - number of items passing a particular location
>     - time spent between different locations
>     - min-max-mean speed between different locations
>     - ... 
> - for log data: 
>     - time-based features: duration of a sequence, timestamp at which most events occur, ...
>     - frequency-based features: number of events in sequence, presence or absence of specific events, ...
>     - density-based features: average time between occurrences of events, number of events at the timestamp at which most events occur, ...
>     - rank-based features: rank of the event in the sequence in chronological order, rank of the event in the sequence sorted by its number of occurrences, ...
> - for text data: 
>     - term and document frequency, tf-idf,
>     - stem
>     - part-of-speech
>     - n-grams
>     - ...

### Feature selection
Feature selection is the activity of selecting the most suitable/optimal set of features for the proposed data science solution.
> You should consider different approaches afterwards:
> - filter: select features regardless of the model, but based only on their general characteristics like their correlation with the variable to predict, their ability to discriminate outcome (e.g. using entropy or variance), or their interaction (e.g. correlation) and use PCA
> - wrapper: run the chosen data analysis method on a representative dataset with different subsets of potential features and evaluate the results




