<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Outliers" data-toc-modified-id="Outliers-1">Outliers</a></span></li><li><span><a href="#Learning-Outcomes" data-toc-modified-id="Learning-Outcomes-2">Learning Outcomes</a></span></li><li><span><a href="#A-Tale-of-Two-Outliers" data-toc-modified-id="A-Tale-of-Two-Outliers-3">A Tale of Two Outliers</a></span></li><li><span><a href="#Types-of-Errors" data-toc-modified-id="Types-of-Errors-4">Types of Errors</a></span></li><li><span><a href="#What-is-the-best-way-to-handle-invalid-values?" data-toc-modified-id="What-is-the-best-way-to-handle-invalid-values?-5">What is the best way to handle invalid values?</a></span></li><li><span><a href="#Type-ahead-for-text-entry" data-toc-modified-id="Type-ahead-for-text-entry-6">Type-ahead for text entry</a></span></li><li><span><a href="#What-is-going-on-in-this-dataset?" data-toc-modified-id="What-is-going-on-in-this-dataset?-7">What is going on in this dataset?</a></span></li><li><span><a href="#2-types-of-novelties" data-toc-modified-id="2-types-of-novelties-8">2 types of novelties</a></span></li><li><span><a href="#Defining-Novelties" data-toc-modified-id="Defining-Novelties-9">Defining Novelties</a></span></li><li><span><a href="#3-standard-deviations-rule" data-toc-modified-id="3-standard-deviations-rule-10">3 standard deviations rule</a></span></li><li><span><a href="#Takeaways" data-toc-modified-id="Takeaways-11">Takeaways</a></span></li><li><span><a href="#Demo:-Bounding-values-with-closures" data-toc-modified-id="Demo:-Bounding-values-with-closures-12">Demo: Bounding values with closures</a></span></li></ul></div>

<center><h2>Outliers</h2></center>
<br>

<center><img src="images/inlier.png" width="90%"/></center>

<center><h2>Learning Outcomes</h2></center>

__By the end of this session, you should be able to__:

- Define outliers in your own words.
- Explain the difference between errors and novelties.


<center><h2>A Tale of Two Outliers</h2></center> 
<br>
<br>

1. Errors
1. Novelties

<center><h2>Types of Errors</h2></center>

- Data entry (caused by human)
- Measurement (caused by sensor)
- Manipulation (caused by code)  
- Other (there are many, many types of errors)

<center><h2>What is the best way to handle invalid values?</h2></center>

Prevent them!

Data validation is very useful.

<center><h2>Type-ahead for text entry</h2></center>

<center><img src="images/onety-one.png" width="50%"/></center>

Before type ahead, 20-25% of Google search queries had spelling mistakes.

Type-ahead improved the business metrics more than any machine learning model.

<center><h2>What is going on in this dataset?</h2></center>

<center><img src="images/Sexpartners_histogram0.png" width="80%"/></center>

<center><h2>2 types of novelties</h2></center>

1. Generated by the __same__ statistical process as the rest of your data (just unusual spread / sparse sampling).
2. Generated by a __different__ statistical process as the rest of your data.

<center><h2>Defining Novelties</h2></center>

Again, can be defined ad hoc or learned.


<center><h2>3 standard deviations rule</h2></center>

<center><img src="images/deviations.png" width="75%"/></center>

<center>A standard learned rule is to consider any more 3 standard deviations from the mean as statistically novel.</center>

<center>However, that assumes normal-ish data and medium-ish data.</center>

Image Source: http://www.psychwiki.com/wiki/Detecting_Outliers_-_Univariate

<center><h2>Takeaways</h2></center>

- Outliers can from errors or novelties.
+ Errors should be mostly discarded.
+ Novelties should be mostly modeled.


<center><h2>Bonus Material</h2></center>

Demo: Bounding values with closures
----

A closure is a function that makes another function.

When it makes a function, it binds data to the created function.

In [55]:
def make_bound_func(min_value, max_value):
    "Define a bound function with a certain min and max"
    def bound(value):
        "Limit value between the min and max"
        return min(max_value, max(value, min_value))
    return bound

In [56]:
bound_rbg = make_bound_func(min_value=0, max_value=255)

assert bound_rbg(42) == 42
assert bound_rbg(-1) == 0
assert bound_rbg(256) == 255

In [57]:
from math import inf

bound_non_negative = make_bound_func(min_value=0, max_value=inf)

assert bound_non_negative(42) == 42
assert bound_non_negative(-1) == 0
assert bound_non_negative(1_000_000) == 1_000_000

Learn more: https://gist.github.com/brianspiering/ae91413ab693bb066b483d330900d585