In [1]:
import pandas as pd
import psycopg2 as pg2
from sqlalchemy import create_engine

engine = create_engine('postgresql://testuser:testpass@localhost:5432/postgresql_analysis')

con = pg2.connect(host='localhost',
                  user='testuser',
                  password='testpass',
                  database='postgresql_analysis')
con.autocommit = True
cur = con.cursor()

In [2]:
def select(query):
    return pd.read_sql(query, con)

### Anomaly Detection

An `anomaly` is something that is different from other members of the same group. In data, an anomaly is a record, an observation, or a value that differs from the remaining data points in a way that raises concerns or suspicions. Anomalies go by a number of different names, including *outliers*, *novelties*, *noise*, *deviations*, and *exceptions*, to name a few.

Anomalies typically have one of two sources: real events that are extreme or otherwise unusual, or errors introduced during data collection or processing. While many of the steps used to detect outliers are the same regardless of the source, how we choose to handle a particular anomaly depends on the root cause. As a result, understanding the root cause and distinguishing between the two types of causes is important to the analysis process.

Data can also contain anomalies because of errors in collection or processing. Manually entered data is notorious for typos and incorrect data. Changes to forms, fields, or validation rules can introduce unexpected values, including nulls.

#### The Data Set

The data for the examples in this chapter is a set of records for all earthquakes recor‐ ded by the US Geological Survey (USGS) from 2010 to 2020. The USGS provides the data in a number of formats, including real-time feeds, at [https://earthquake.usgs.gov/ earthquakes/feed](https://earthquake.usgs.gov/earthquakes/feed/).

The data set contains approximately 1.5 million records. Each record represents a single earthquake event and includes information such as the timestamp, location, mag‐ nitude, depth, and source of the information.

Earthquakes are caused by sudden slips along faults in the tectonic plates that exist on the outer surface of the earth. Locations on the edges of these plates experience many more, and more dramatic, earthquakes than other places. The so-called Ring of Fire is a region along the rim of the Pacific Ocean in which many earthquakes occur. Various locations within this region, including California, Alaska, Japan, and Indonesia, will appear frequently in our analysis.

*Magnitude* is a measure of the size of an earthquake at its source, as measured by its seismic waves. Magnitude is recorded on a logarithmic scale, meaning that the amplitude of a magnitude 5 earthquake is 10 times that of a magnitude 4 earthquake.

#### Detecting Outliers

Although the idea of an anomaly or outlier—a data point that is very different from the rest—seems straightforward, actually finding one in any particular data set poses some challenges. The first challenge has to do with knowing when a value or data point is common or rare, and the second is setting a threshold for marking values on either side of this dividing line.

Generally, the larger or more complete the data set, the easier it is to make a judgment on what is truly anomalous. In some instances, we have labeled or “ground truth” values to which we can refer. A label is generally a column in the data set that indicates whether the record is normal or an outlier. Ground truth can be obtained from industry or scientific sources or from past analysis and might tell us, for example, that **any earthquake greater than magnitude 7 is an anomaly**. In other cases, we must look to the data itself and apply reasonable judgment.

#### Sorting to Find Anomalies

In [4]:
query_01 = """
        SELECT mag
        FROM earthquakes
        ORDER BY 1 desc
        """

select(query_01)

Unnamed: 0,mag
0,
1,
2,
3,
4,
...,...
1495921,-9.99
1495922,-9.99
1495923,-9.99
1495924,-9.99


There is only one value greater than 9, and there are only two additional values greater than 8.5. In many contexts, these would not appear to be particularly large values. However, with a little domain knowledge about earthquakes, we can recognize that these values are in fact both very large and unusual.

In [3]:
query_02 = """
        SELECT mag
            ,count(id) as earthquakes
            ,round(count(id) * 100.0 / sum(count(id)) over (partition by 1),8) as pct_earthquakes
        FROM earthquakes
        WHERE mag is not null
        GROUP BY 1
        ORDER BY 1 desc
        """

select(query_02)

Unnamed: 0,mag,earthquakes,pct_earthquakes
0,9.10,1,0.000067
1,8.80,1,0.000067
2,8.60,1,0.000067
3,8.30,2,0.000134
4,8.20,4,0.000269
...,...,...,...
812,-2.50,3,0.000202
813,-2.60,2,0.000134
814,-5.00,1,0.000067
815,-9.00,29,0.001949


In [5]:
query_03 = """
        SELECT mag
            ,count(id) as earthquakes
            ,round(count(id) * 100.0 / sum(count(id)) over (partition by 1),8) as pct_earthquakes
        FROM earthquakes
        WHERE mag is not null
        GROUP BY 1
        ORDER BY 1
        """

select(query_03)

Unnamed: 0,mag,earthquakes,pct_earthquakes
0,-9.99,258,0.017336
1,-9.00,29,0.001949
2,-5.00,1,0.000067
3,-2.60,2,0.000134
4,-2.50,3,0.000202
...,...,...,...
812,8.20,4,0.000269
813,8.30,2,0.000134
814,8.60,1,0.000067
815,8.80,1,0.000067


There is only one each of the earthquakes that are over 8.5 in magnitude, but there are two that registered 8.3. By the value 6.9, there are double digits of earthquakes, but those still represent a very small percentage of the data.

At the low end of values, –9.99 and –9 occur more frequently than we might expect. Although we can’t take the logarithm of zero or a negative number, a logarithm can be negative when the argument is greater than zero and less than one. For example, log(0.5) is equal to approximately –0.301. The values –9.99 and –9 represent extremely small earthquake magnitudes, and we might question whether such small quakes could really be detected.

In [8]:
query_04 = """
        SELECT place, mag, count(*)
        FROM earthquakes
        WHERE mag is not null
        and place = 'Northern California'
        GROUP BY 1,2
        ORDER BY 1,2 desc
        """

select(query_04)

Unnamed: 0,place,mag,count
0,Northern California,5.60,1
1,Northern California,4.73,1
2,Northern California,4.51,1
3,Northern California,4.43,2
4,Northern California,4.29,1
...,...,...,...
441,Northern California,-0.90,48
442,Northern California,-1.00,23
443,Northern California,-1.10,7
444,Northern California,-1.20,2


“Northern California” is the most common place in the data set, and inspecting just the subset for it, we can see that the high and low values are not nearly as extreme as those for the data set as a whole. Earthquakes over 5.0 magnitude are not uncommon overall, but they are outliers for “Northern California.”

#### Calculating Percentiles and Standard Deviations to Find Anomalies

SQL has a window function, `percent_rank`, that returns the percentile for each row within a partition. As with all window functions, the sorting direction is controlled with an ORDER BY statement. Similar to the rank function, percent_rank does not take any argument; it operates over all the rows returned by the query. The basic form is:

    percent_rank() over (partition by ... order by ...)

In [12]:
query_05 = """
        SELECT place
            ,mag
            ,percentile
            ,count(*)
        FROM
        (
            SELECT place
                ,mag
                ,percent_rank() over (partition by place order by mag) as percentile
            FROM earthquakes
            WHERE mag is not null
            and place = 'Northern California'
        ) a
        GROUP BY 1,2,3
        ORDER BY 1,2 desc
        """

select(query_05)

Unnamed: 0,place,mag,percentile,count
0,Northern California,5.60,1.000000,1
1,Northern California,4.73,0.999987,1
2,Northern California,4.51,0.999974,1
3,Northern California,4.43,0.999948,2
4,Northern California,4.29,0.999935,1
...,...,...,...,...
441,Northern California,-0.90,0.000427,48
442,Northern California,-1.00,0.000129,23
443,Northern California,-1.10,0.000039,7
444,Northern California,-1.20,0.000013,2


Within Northern California, the magnitude 5.6 earthquake has a percentile of 1, or 100%, indicating that all of the other values are less than this one. The magnitude –1.6 earthquake has a percentile of 0, indicating that no other data points are smaller.

In [13]:
query_06 = """
        SELECT place
            ,mag
            ,ntile(100) over (partition by place order by mag) as ntile
        FROM earthquakes
        WHERE mag is not null
        and place = 'Central Alaska'
        ORDER BY 1,2 desc
        """

select(query_06)

Unnamed: 0,place,mag,ntile
0,Central Alaska,5.4,100
1,Central Alaska,5.3,100
2,Central Alaska,5.2,100
3,Central Alaska,4.8,100
4,Central Alaska,4.6,100
...,...,...,...
33430,Central Alaska,-0.4,1
33431,Central Alaska,-0.5,1
33432,Central Alaska,-0.5,1
33433,Central Alaska,-0.5,1


we see that the three earthquakes greater than 5 are in the 100th percentile, 1.5 falls within the 79th percentile, and the smallest values of –0.5 fall in the first percentile.

In [14]:
query_07 = """
        SELECT place, ntile
            ,max(mag) as maximum
            ,min(mag) as minimum
        FROM
        (
            SELECT place
                ,mag
                ,ntile(4) over (partition by place order by mag) as ntile
            FROM earthquakes
            WHERE mag is not null
            and place = 'Central Alaska'
        ) a
        GROUP BY 1,2
        ORDER BY 1,2 desc
        """

select(query_07)

Unnamed: 0,place,ntile,maximum,minimum
0,Central Alaska,4,5.4,1.4
1,Central Alaska,3,1.4,1.1
2,Central Alaska,2,1.1,0.8
3,Central Alaska,1,0.8,-0.5


The highest ntile, 4, which represents the 75th to 100th percentiles, has the widest range, spanning from 1.4 to 5.4. On the other hand, the middle 50 percent of values, which include ntiles 2 and 3, range only from 0.8 to 1.4.

---

In addition to finding the percentile or ntile for each row, we can calculate specific percentiles across the entire result set of a query. To do this, we can use the percen tile_cont function or the percentile_disc function. Both are window functions, but with a slightly different syntax than other window functions discussed previously because they require a `WITHIN GROUP` clause. The form of the functions is:

    percentile_cont(numeric) within group (order by field_name) over (partition by field_name)

The numeric is a value between 0 and 1 that represents the percentile to return. For example, 0.25 returns the 25th percentile.

In [15]:
query_08 = """
        SELECT 
        percentile_cont(0.25) within group (order by mag) as pct_25
        ,percentile_cont(0.5) within group (order by mag) as pct_50
        ,percentile_cont(0.75) within group (order by mag) as pct_75
        FROM earthquakes
        WHERE mag is not null
        and place = 'Central Alaska'
        """

select(query_08)

Unnamed: 0,pct_25,pct_50,pct_75
0,0.8,1.1,1.4


The query returns the requested percentiles, summarized across the data set. Notice that the values correspond to the maximum values for ntiles 1, 2, and 3 calculated in the previous example. 

Percentiles for different fields can be calculated within the same query by changing the field in the `ORDER BY` clause:

In [16]:
query_09 = """
        SELECT 
        percentile_cont(0.25) within group (order by mag) as pct_25_mag
        ,percentile_cont(0.25) within group (order by depth) as pct_25_depth
        FROM earthquakes
        WHERE mag is not null
        and place = 'Central Alaska'
        """

select(query_09)

Unnamed: 0,pct_25_mag,pct_25_depth
0,0.8,7.1


Unlike other window functions, percentile_cont and percentile_disc require a `GROUP BY` clause at the query level when other fields are present in the query. 

For example, if we want to consider two areas within Alaska, and so include the place field, the query must also include it in the `GROUP BY`, and the percentiles are calculated per place:

In [18]:
query_10 = """
        SELECT 
        percentile_cont(0.25) within group (order by mag) as pct_25_mag
        ,percentile_cont(0.25) within group (order by depth) as pct_25_depth
        FROM earthquakes
        WHERE mag is not null
        and place in ('Central Alaska', 'Southern Alaska')
        GROUP By place
        """

select(query_10)

Unnamed: 0,pct_25_mag,pct_25_depth
0,0.8,7.1
1,1.2,10.1


The **standard deviation** is a measure of the variation in a set of values. A lower value means less variation, while a higher number means more variation. When data is normally distributed around the mean, about 68% of the values lie within +/– one standard deviation from the mean, and about 95% lie within two standard deviations. The standard deviation is calculated as the square root of the sum of differences from the mean, divided by the number of observations.

The `stddev_pop` function finds the standard deviation of a population. If the data set represents the entire population, as is often the case with a customer data set, use the stddev_pop. The `stddev_samp` finds the standard deviation of a sample and differs from the above formula by dividing by N – 1 instead of N. This has the effect of increasing the standard deviation, reflecting the loss of accuracy when only a sample of the entire population is used.

With this function, we can now calculate the number of standard deviations from the mean for each value in the data set. This value is known as the `z-score` and is a way of standardizing data. Values that are above the average have a positive z-score, and those below the average have a negative z-score.

In [19]:
query_11 = """
        SELECT a.place
            ,a.mag
            ,b.avg_mag
            ,b.std_dev
            ,(a.mag - b.avg_mag) / b.std_dev as z_score
        FROM earthquakes a
        JOIN
        (
            SELECT avg(mag) as avg_mag
                ,stddev_pop(mag) as std_dev
            FROM earthquakes
            WHERE mag is not null
        ) b on 1 = 1
        WHERE a.mag is not null
        ORDER BY 2 desc
        """

select(query_11)

Unnamed: 0,place,mag,avg_mag,std_dev,z_score
0,"2011 Great Tohoku Earthquake, Japan",9.10,1.625102,1.273606,5.869083
1,"offshore Bio-Bio, Chile",8.80,1.625102,1.273606,5.633532
2,off the west coast of northern Sumatra,8.60,1.625102,1.273606,5.476497
3,Sea of Okhotsk,8.30,1.625102,1.273606,5.240945
4,"48km W of Illapel, Chile",8.30,1.625102,1.273606,5.240945
...,...,...,...,...,...
1488239,Utah,-9.99,1.625102,1.273606,-9.119856
1488240,"Yellowstone National Park, Montana",-9.99,1.625102,1.273606,-9.119856
1488241,"Yellowstone National Park, Wyoming",-9.99,1.625102,1.273606,-9.119856
1488242,Utah,-9.99,1.625102,1.273606,-9.119856


The largest earthquakes have a z-score of almost 6, whereas the smallest (excluding the –9 and –9.99 earthquakes that appear to be data entry anomalies) have z-scores close to 3. We can conclude that the **largest earthquakes are more extreme outliers than the ones at the low end**.