# Features

Prof. Dr. Georgios K. Ouzounis<br/>
[georgios.ouzounis@go.kauko.lt](georgios.ouzounis@go.kauko.lt)

## Contents

- features - definition 
- feature impact
- feature selection
- feature engineering
- feature vectors

## Features - Definition

A feature is an individual measurable property or characteristic of a phenomenon being observed [Wikipedia](https://en.wikipedia.org/wiki/Feature_(machine_learning)).

Choosing informative, discriminating and independent features is a crucial step for effective algorithms in pattern recognition, classification and regression. 

Features are usually numeric, but structural features such as strings and graphs exist too.


In datasets, features (or variables / attributes) appear as columns:


<img src="https://3gp10c1vpy442j63me73gy3s-wpengine.netdna-ssl.com/wp-content/uploads/2018/03/Screen-Shot-2018-03-22-at-8.29.31-AM.png"/>

The image above contains a snippet of data from a public dataset with information about passengers on Titanic’s maiden voyage. Each feature, or column, represents a measurable piece of data that can be used for analysis: Name, Age, Sex, Fare, and so on.  [Source](https://www.datarobot.com/wiki/feature/)

Features are the basic building blocks of datasets. The quality of the features in your dataset has a major impact on the quality of the insights you will be able to get when you use that dataset for machine learning. <br/>

You can improve the quality of your dataset’s features with processes like:

1. feature impact assessment,
2. feature selection,
3. feature engineering. 

All three are difficult and tedious. 


## Feature Impact

Feature impact identifies which features in a dataset have the greatest effect on the outcomes of a machine learning model.

Depending on their properties, different machine learning algorithms focus on different features in a dataset. 

**Impact Example:** features that have strong linear trends (that is, they increase or decrease at a steady rate) will have high impacts in linear-based methods like regression, while nonlinear-based methods will leverage the more complex relationships in the data. 


In the big data world the size and dimensionality of data-sets are unprecedented.

Identifying relevant and valuable information allows us to focus on the factors that matter the most when building data models, saving both time and resources.

Feature impact can be computed by a handful of machine learning algorithms and usually requires intuition and a deeper insight into your data. 

You may practise it with empirical procedures.


Feature impact is used in both:

1. feature selection for improving the accuracy of your models, 
2. identifying target leakage for avoiding highly inaccurate models. If a single feature is extremely impactful on a model’s outcomes, that is a primary indicator of target leakage. 

Target or data leakage is the situation in which you train your algorithm on a dataset that includes information that would not be available at the time of prediction, when you apply that model to data you collect in the future. 
Since it already knows the actual outcomes, the model’s results will be unrealistically accurate for the training data, like bringing an answer sheet into an exam..

## Feature Selection

The initial set of raw features can be redundant and too large to be managed!

Select a subset of features, or construct a new and reduced set of features to facilitate learning, and to improve generalization and interpretability

**Selection example:** if you’re trying to predict flight delays, today’s temperature may be important, but the temperature three months ago will be not.


Good feature selection eliminates irrelevant or redundant columns from your dataset without sacrificing accuracy. 

Feature selection, by contrast to dimensionality reduction,  doesn’t involve creating new features or transforming existing ones, but rather getting rid of the ones that don’t add value to your analysis.

The benefits of feature selection for machine learning include:
- Reducing the chance of overfitting.
- Reducing the CPU, I/O, and RAM load the production system needs to build and use the model by lowering the number of operations it takes to read and preprocess data and perform data science, improving algorithm run speed.
- Increasing the model’s interpretability by revealing the most informative factors that drive the model’s outcomes.


### Useful Links

| blog | article |
|-----|:--------|
|<img src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2018/09/icon-100x100.png" style="float: left; margin-right: 10px;" width="100"/>|[An Introduction to Feature Selection](https://machinelearningmastery.com/category/machine-learning-process/) by Jason Brownlee on October 6, 2014 in Machine Learning Process |
| <img src="http://www.euro-langues.org/wp-content/uploads/2019/05/1*F0LADxTtsKOgmPa-_7iUEQ.jpeg" style="float: left; margin-right: 10px;" width="100"/>|[Why, How and When to Apply Feature Selection](https://towardsdatascience.com/why-how-and-when-to-apply-feature-selection-e9c69adfabf2) by Sudharsan Asaithambi, Jan 31, 2018 in Towards Data Science
|
| <img src="https://cdn-images-1.medium.com/max/1600/1*emiGsBgJu2KHWyjluhKXQw.png" style="float: left; margin-right: 10px;" width="100"/>| [3 Effective Feature Selection Strategies](https://medium.com/ai%C2%B3-theory-practice-business/three-effective-feature-selection-strategies-e1f86f331fb1) by  Christopher Dossman, Oct. 22, 2017 in AI3 | Theory Practice Business
|

## Feature Engineering 

“Feature engineering is the art part of data science.” 

Sergey Yurgenson, former #1 ranked global competitive data scientist on Kaggle


Feature engineering is the addition and construction of additional variables, or features, to your dataset to improve machine learning model performance and accuracy. 

Higher-level features can be obtained from already available features and added to the feature vector; for example, for the study of diseases the feature 'Age' is useful and is defined as Age = 'Year of death' minus 'Year of birth' . 

The most effective feature engineering is based on sound knowledge of the business problem for which you’re trying to gain deeper insight and your available data sources. 


It’s an exercise in engagement with the meaning of the problem and the data. For example, you might improve a model used to estimate likely loan defaults by finding external sources of relevant data, such as local unemployment rates or housing price trends.

Feature Engineering requires the experimentation of multiple possibilities and the combination of automated techniques with the intuition and knowledge of the domain expert. 

Automating this process is feature learning, where a machine not only uses features for learning, but learns the features itself.

Creating new features gives you a deeper understanding of your data and results in more valuable insights. When done correctly, feature engineering is one of the most valuable techniques of data science, but it’s also one of the most challenging:

“Coming up with features is difficult, time-consuming, [and] requires expert knowledge. 
— Andrew Ng,  chief scientist of Baidu, co-chairman and co-founder of Coursera, and adjunct professor at Stanford University


### Useful Links

| blog | article |
|-----|:--------|
|<img src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2018/09/icon-100x100.png" style="float: left; margin-right: 10px;" width="100"/>| [Discover Feature Engineering, How to Engineer Features and How to Get Good at It](https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/) by Jason Brownlee on September 26, 2014 in Machine Learning Process |
|<img src="http://www.euro-langues.org/wp-content/uploads/2019/05/1*F0LADxTtsKOgmPa-_7iUEQ.jpeg" style="float: left; margin-right: 10px;" width="100"/>| [Understanding Feature Engineering (Part 1) — Continuous Numeric Data: Strategies for working with continuous, numerical data](https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b), by Dipanjan (DJ) Sarkar, Data Scientist @Intel, Jan 4, 2018 in Towards Data Science |
|<img src="https://www.datacamp.com/datacamp-sq.png" style="float: left; margin-right: 10px;" width="100"/>| [Machine Learning with Kaggle: Feature Engineering](https://www.datacamp.com/community/tutorials/feature-engineering-kaggle) by Hugo Bowne-Anderson, Jan. 10, 2018 in Data Camp |
|<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/PDSH-cover-small.png" style="float: left; margin-right: 10px;" width="100"/>| [Feature Engineering](https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html) by Jake VanderPlas, Nov. 2016 in Python Data Science Handbook |

## Feature Vectors

A feature vector is an n-dimensional vector of numerical features that represent some object. 

Feature vectors offer a richer numerical representation of objects which may facilitate processing and statistical analysis. 

Feature vectors are equivalent to the vectors of explanatory variables used in statistical procedures such as linear regression. 

Feature vectors are often combined with weights using a dot product in order to construct a linear predictor function that is used to determine a score for making a prediction

The vector space associated with feature vectors is often called the feature space. 

Dimensionality Reduction: In order to reduce the dimensionality of the feature space, a number of dimensionality reduction techniques can be employed.