In [1]:
import pandas as pd
import psycopg2 as pg2
from sqlalchemy import create_engine

engine = create_engine('postgresql://testuser:testpass@localhost:5432/postgresql_analysis')

con = pg2.connect(host='localhost',
                  user='testuser',
                  password='testpass',
                  database='postgresql_analysis')
con.autocommit = True
cur = con.cursor()

In [2]:
def select(query):
    return pd.read_sql(query, con)

### Anomaly Detection

An `anomaly` is something that is different from other members of the same group. In data, an anomaly is a record, an observation, or a value that differs from the remaining data points in a way that raises concerns or suspicions. Anomalies go by a number of different names, including *outliers*, *novelties*, *noise*, *deviations*, and *exceptions*, to name a few.

Anomalies typically have one of two sources: real events that are extreme or otherwise unusual, or errors introduced during data collection or processing. While many of the steps used to detect outliers are the same regardless of the source, how we choose to handle a particular anomaly depends on the root cause. As a result, understanding the root cause and distinguishing between the two types of causes is important to the analysis process.

Data can also contain anomalies because of errors in collection or processing. Manually entered data is notorious for typos and incorrect data. Changes to forms, fields, or validation rules can introduce unexpected values, including nulls.

#### The Data Set

The data for the examples in this chapter is a set of records for all earthquakes recor‐ ded by the US Geological Survey (USGS) from 2010 to 2020. The USGS provides the data in a number of formats, including real-time feeds, at [https://earthquake.usgs.gov/ earthquakes/feed](https://earthquake.usgs.gov/earthquakes/feed/).

The data set contains approximately 1.5 million records. Each record represents a single earthquake event and includes information such as the timestamp, location, mag‐ nitude, depth, and source of the information.

Earthquakes are caused by sudden slips along faults in the tectonic plates that exist on the outer surface of the earth. Locations on the edges of these plates experience many more, and more dramatic, earthquakes than other places. The so-called Ring of Fire is a region along the rim of the Pacific Ocean in which many earthquakes occur. Various locations within this region, including California, Alaska, Japan, and Indonesia, will appear frequently in our analysis.

*Magnitude* is a measure of the size of an earthquake at its source, as measured by its seismic waves. Magnitude is recorded on a logarithmic scale, meaning that the amplitude of a magnitude 5 earthquake is 10 times that of a magnitude 4 earthquake.

#### Detecting Outliers

Although the idea of an anomaly or outlier—a data point that is very different from the rest—seems straightforward, actually finding one in any particular data set poses some challenges. The first challenge has to do with knowing when a value or data point is common or rare, and the second is setting a threshold for marking values on either side of this dividing line.

Generally, the larger or more complete the data set, the easier it is to make a judgment on what is truly anomalous. In some instances, we have labeled or “ground truth” values to which we can refer. A label is generally a column in the data set that indicates whether the record is normal or an outlier. Ground truth can be obtained from industry or scientific sources or from past analysis and might tell us, for example, that **any earthquake greater than magnitude 7 is an anomaly**. In other cases, we must look to the data itself and apply reasonable judgment.

#### Sorting to Find Anomalies

In [3]:
query_01 = """
        SELECT mag
        FROM earthquakes
        ORDER BY 1 desc
        """

select(query_01)

Unnamed: 0,mag
0,
1,
2,
3,
4,
...,...
1495921,-9.99
1495922,-9.99
1495923,-9.99
1495924,-9.99


There is only one value greater than 9, and there are only two additional values greater than 8.5. In many contexts, these would not appear to be particularly large values. However, with a little domain knowledge about earthquakes, we can recognize that these values are in fact both very large and unusual.

In [4]:
query_02 = """
        SELECT mag
            ,count(id) as earthquakes
            ,round(count(id) * 100.0 / sum(count(id)) over (partition by 1),8) as pct_earthquakes
        FROM earthquakes
        WHERE mag is not null
        GROUP BY 1
        ORDER BY 1 desc
        """

select(query_02)

Unnamed: 0,mag,earthquakes,pct_earthquakes
0,9.10,1,0.000067
1,8.80,1,0.000067
2,8.60,1,0.000067
3,8.30,2,0.000134
4,8.20,4,0.000269
...,...,...,...
812,-2.50,3,0.000202
813,-2.60,2,0.000134
814,-5.00,1,0.000067
815,-9.00,29,0.001949


In [5]:
query_03 = """
        SELECT mag
            ,count(id) as earthquakes
            ,round(count(id) * 100.0 / sum(count(id)) over (partition by 1),8) as pct_earthquakes
        FROM earthquakes
        WHERE mag is not null
        GROUP BY 1
        ORDER BY 1
        """

select(query_03)

Unnamed: 0,mag,earthquakes,pct_earthquakes
0,-9.99,258,0.017336
1,-9.00,29,0.001949
2,-5.00,1,0.000067
3,-2.60,2,0.000134
4,-2.50,3,0.000202
...,...,...,...
812,8.20,4,0.000269
813,8.30,2,0.000134
814,8.60,1,0.000067
815,8.80,1,0.000067


There is only one each of the earthquakes that are over 8.5 in magnitude, but there are two that registered 8.3. By the value 6.9, there are double digits of earthquakes, but those still represent a very small percentage of the data.

At the low end of values, –9.99 and –9 occur more frequently than we might expect. Although we can’t take the logarithm of zero or a negative number, a logarithm can be negative when the argument is greater than zero and less than one. For example, log(0.5) is equal to approximately –0.301. The values –9.99 and –9 represent extremely small earthquake magnitudes, and we might question whether such small quakes could really be detected.

In [6]:
query_04 = """
        SELECT place, mag, count(*)
        FROM earthquakes
        WHERE mag is not null
        and place = 'Northern California'
        GROUP BY 1,2
        ORDER BY 1,2 desc
        """

select(query_04)

Unnamed: 0,place,mag,count
0,Northern California,5.60,1
1,Northern California,4.73,1
2,Northern California,4.51,1
3,Northern California,4.43,2
4,Northern California,4.29,1
...,...,...,...
441,Northern California,-0.90,48
442,Northern California,-1.00,23
443,Northern California,-1.10,7
444,Northern California,-1.20,2


“Northern California” is the most common place in the data set, and inspecting just the subset for it, we can see that the high and low values are not nearly as extreme as those for the data set as a whole. Earthquakes over 5.0 magnitude are not uncommon overall, but they are outliers for “Northern California.”

#### Calculating Percentiles and Standard Deviations to Find Anomalies

SQL has a window function, `percent_rank`, that returns the percentile for each row within a partition. As with all window functions, the sorting direction is controlled with an ORDER BY statement. Similar to the rank function, percent_rank does not take any argument; it operates over all the rows returned by the query. The basic form is:

    percent_rank() over (partition by ... order by ...)

In [7]:
query_05 = """
        SELECT place
            ,mag
            ,percentile
            ,count(*)
        FROM
        (
            SELECT place
                ,mag
                ,percent_rank() over (partition by place order by mag) as percentile
            FROM earthquakes
            WHERE mag is not null
            and place = 'Northern California'
        ) a
        GROUP BY 1,2,3
        ORDER BY 1,2 desc
        """

select(query_05)

Unnamed: 0,place,mag,percentile,count
0,Northern California,5.60,1.000000,1
1,Northern California,4.73,0.999987,1
2,Northern California,4.51,0.999974,1
3,Northern California,4.43,0.999948,2
4,Northern California,4.29,0.999935,1
...,...,...,...,...
441,Northern California,-0.90,0.000427,48
442,Northern California,-1.00,0.000129,23
443,Northern California,-1.10,0.000039,7
444,Northern California,-1.20,0.000013,2


Within Northern California, the magnitude 5.6 earthquake has a percentile of 1, or 100%, indicating that all of the other values are less than this one. The magnitude –1.6 earthquake has a percentile of 0, indicating that no other data points are smaller.

In [8]:
query_06 = """
        SELECT place
            ,mag
            ,ntile(100) over (partition by place order by mag) as ntile
        FROM earthquakes
        WHERE mag is not null
        and place = 'Central Alaska'
        ORDER BY 1,2 desc
        """

select(query_06)

Unnamed: 0,place,mag,ntile
0,Central Alaska,5.4,100
1,Central Alaska,5.3,100
2,Central Alaska,5.2,100
3,Central Alaska,4.8,100
4,Central Alaska,4.6,100
...,...,...,...
33430,Central Alaska,-0.4,1
33431,Central Alaska,-0.5,1
33432,Central Alaska,-0.5,1
33433,Central Alaska,-0.5,1


we see that the three earthquakes greater than 5 are in the 100th percentile, 1.5 falls within the 79th percentile, and the smallest values of –0.5 fall in the first percentile.

In [9]:
query_07 = """
        SELECT place, ntile
            ,max(mag) as maximum
            ,min(mag) as minimum
        FROM
        (
            SELECT place
                ,mag
                ,ntile(4) over (partition by place order by mag) as ntile
            FROM earthquakes
            WHERE mag is not null
            and place = 'Central Alaska'
        ) a
        GROUP BY 1,2
        ORDER BY 1,2 desc
        """

select(query_07)

Unnamed: 0,place,ntile,maximum,minimum
0,Central Alaska,4,5.4,1.4
1,Central Alaska,3,1.4,1.1
2,Central Alaska,2,1.1,0.8
3,Central Alaska,1,0.8,-0.5


The highest ntile, 4, which represents the 75th to 100th percentiles, has the widest range, spanning from 1.4 to 5.4. On the other hand, the middle 50 percent of values, which include ntiles 2 and 3, range only from 0.8 to 1.4.

---

In addition to finding the percentile or ntile for each row, we can calculate specific percentiles across the entire result set of a query. To do this, we can use the percen tile_cont function or the percentile_disc function. Both are window functions, but with a slightly different syntax than other window functions discussed previously because they require a `WITHIN GROUP` clause. The form of the functions is:

    percentile_cont(numeric) within group (order by field_name) over (partition by field_name)

The numeric is a value between 0 and 1 that represents the percentile to return. For example, 0.25 returns the 25th percentile.

In [10]:
query_08 = """
        SELECT 
        percentile_cont(0.25) within group (order by mag) as pct_25
        ,percentile_cont(0.5) within group (order by mag) as pct_50
        ,percentile_cont(0.75) within group (order by mag) as pct_75
        FROM earthquakes
        WHERE mag is not null
        and place = 'Central Alaska'
        """

select(query_08)

Unnamed: 0,pct_25,pct_50,pct_75
0,0.8,1.1,1.4


The query returns the requested percentiles, summarized across the data set. Notice that the values correspond to the maximum values for ntiles 1, 2, and 3 calculated in the previous example. 

Percentiles for different fields can be calculated within the same query by changing the field in the `ORDER BY` clause:

In [11]:
query_09 = """
        SELECT 
        percentile_cont(0.25) within group (order by mag) as pct_25_mag
        ,percentile_cont(0.25) within group (order by depth) as pct_25_depth
        FROM earthquakes
        WHERE mag is not null
        and place = 'Central Alaska'
        """

select(query_09)

Unnamed: 0,pct_25_mag,pct_25_depth
0,0.8,7.1


Unlike other window functions, percentile_cont and percentile_disc require a `GROUP BY` clause at the query level when other fields are present in the query. 

For example, if we want to consider two areas within Alaska, and so include the place field, the query must also include it in the `GROUP BY`, and the percentiles are calculated per place:

In [12]:
query_10 = """
        SELECT 
        percentile_cont(0.25) within group (order by mag) as pct_25_mag
        ,percentile_cont(0.25) within group (order by depth) as pct_25_depth
        FROM earthquakes
        WHERE mag is not null
        and place in ('Central Alaska', 'Southern Alaska')
        GROUP By place
        """

select(query_10)

Unnamed: 0,pct_25_mag,pct_25_depth
0,0.8,7.1
1,1.2,10.1


The **standard deviation** is a measure of the variation in a set of values. A lower value means less variation, while a higher number means more variation. When data is normally distributed around the mean, about 68% of the values lie within +/– one standard deviation from the mean, and about 95% lie within two standard deviations. The standard deviation is calculated as the square root of the sum of differences from the mean, divided by the number of observations.

The `stddev_pop` function finds the standard deviation of a population. If the data set represents the entire population, as is often the case with a customer data set, use the stddev_pop. The `stddev_samp` finds the standard deviation of a sample and differs from the above formula by dividing by N – 1 instead of N. This has the effect of increasing the standard deviation, reflecting the loss of accuracy when only a sample of the entire population is used.

With this function, we can now calculate the number of standard deviations from the mean for each value in the data set. This value is known as the `z-score` and is a way of standardizing data. Values that are above the average have a positive z-score, and those below the average have a negative z-score.

In [13]:
query_11 = """
        SELECT a.place
            ,a.mag
            ,b.avg_mag
            ,b.std_dev
            ,(a.mag - b.avg_mag) / b.std_dev as z_score
        FROM earthquakes a
        JOIN
        (
            SELECT avg(mag) as avg_mag
                ,stddev_pop(mag) as std_dev
            FROM earthquakes
            WHERE mag is not null
        ) b on 1 = 1
        WHERE a.mag is not null
        ORDER BY 2 desc
        """

select(query_11)

Unnamed: 0,place,mag,avg_mag,std_dev,z_score
0,"2011 Great Tohoku Earthquake, Japan",9.10,1.625102,1.273606,5.869083
1,"offshore Bio-Bio, Chile",8.80,1.625102,1.273606,5.633532
2,off the west coast of northern Sumatra,8.60,1.625102,1.273606,5.476497
3,"48km W of Illapel, Chile",8.30,1.625102,1.273606,5.240945
4,Sea of Okhotsk,8.30,1.625102,1.273606,5.240945
...,...,...,...,...,...
1488239,"Yellowstone National Park, Wyoming",-9.99,1.625102,1.273606,-9.119856
1488240,"Yellowstone National Park, Wyoming",-9.99,1.625102,1.273606,-9.119856
1488241,"Yellowstone National Park, Wyoming",-9.99,1.625102,1.273606,-9.119856
1488242,"Yellowstone National Park, Wyoming",-9.99,1.625102,1.273606,-9.119856


The largest earthquakes have a z-score of almost 6, whereas the smallest (excluding the –9 and –9.99 earthquakes that appear to be data entry anomalies) have z-scores close to 3. We can conclude that the **largest earthquakes are more extreme outliers than the ones at the low end**.

#### Graphing to Find Anomalies Visually

The **bar graph** is used to plot a histogram or distribution of the values in a field and is useful for both characterizing the data and spotting outliers. The full extent of values are plotted along one axis, and the number of occurrences of each value is plotted on the other axis. The extreme high and low values are interesting, as is the shape of the plot. We can quickly determine whether the distribution is approximately normal (symmetric around a peak or average value), has another type of distribution, or has peaks at particular values.

In [14]:
query_12 = """
        SELECT mag
            ,count(*) as earthquakes
        FROM earthquakes
        GROUP BY 1
        ORDER BY 1
        """

select(query_12)

Unnamed: 0,mag,earthquakes
0,-9.99,258
1,-9.00,29
2,-5.00,1
3,-2.60,2
4,-2.50,3
...,...,...
813,8.30,2
814,8.60,1
815,8.80,1
816,9.10,1


<img align="left" width="567" alt="Screen Shot 2022-04-26 at 11 25 50 AM" src="https://user-images.githubusercontent.com/73784742/165214234-ef54ffa6-cb46-4fc0-957f-6372f2bac71f.png">

It peaks and is roughly symmetric around a value in the range of 1.1 to 1.4 with almost 40,000 earthquakes of each magnitude, but it has a second peak of almost 20,000 earthquakes around the value 4.4.

<img align="left" width="565" alt="Screen Shot 2022-04-26 at 11 27 26 AM" src="https://user-images.githubusercontent.com/73784742/165214415-3953995d-eab5-466a-a5b4-e4e2c149a608.png">

Here the frequencies of these very high-intensity earthquakes are easier to see, as is the decrease in frequency from more than 10 to only 1 as the value goes from the low 7s to over 8. Thankfully these temblors are extremely rare.

---

A second type of graph that can be used to characterize data and spot outliers is the **scatter plot**. A scatter plot is appropriate when the data set contains at least two numeric values of interest. The x-axis displays the range of values of the first data field, the y-axis displays the range of values of the second data field, and a dot is graphed for every pair of x and y values in the data set.

In [15]:
query_13 = """
        SELECT mag, depth
            ,count(*) as earthquakes
        FROM earthquakes
        GROUP BY 1,2
        ORDER BY 1,2
        """

select(query_13)

Unnamed: 0,mag,depth,earthquakes
0,-9.99,-0.59,1
1,-9.99,-0.35,1
2,-9.99,-0.11,1
3,-9.99,-0.08,1
4,-9.99,0.42,1
...,...,...,...
668084,,289.50,1
668085,,301.60,1
668086,,322.90,1
668087,,324.80,1


<img align="left" width="570" alt="Screen Shot 2022-04-26 at 11 32 50 AM" src="https://user-images.githubusercontent.com/73784742/165214998-71346e37-643b-4015-8730-bb41deb2de9e.png">

In this graph, we can see the same range of magnitudes, now plotted against the depths, which range from just below zero to around 700 kilometers. Interestingly, the high depth values, over 300, correspond to magnitudes that are roughly 4 and higher. Perhaps such deep earthquakes can be detected only after they reach a minimum magnitude.

---

A third type of graph useful in finding and analyzing outliers is the **box plot**, also known as the box-and-whisker plot. These graphs summarize data in the middle of the range of values while retaining the outliers. The graph type is named for the box, or rectangle, in the middle. The line that forms the bottom of the rectangle is located at the 25th percentile value, the line that forms the top is located at the 75th percentile, and the line through the middle is located at the 50th percentile, or median, value.

In [16]:
query_14 = """
        SELECT mag
        FROM earthquakes
        WHERE place like '%Japan%'
        ORDER BY 1
        """

select(query_14)

Unnamed: 0,mag
0,2.7
1,3.1
2,3.2
3,3.2
4,3.2
...,...
16031,7.4
16032,7.7
16033,7.8
16034,7.9


<img align="left" width="524" alt="Screen Shot 2022-04-26 at 11 38 18 AM" src="https://user-images.githubusercontent.com/73784742/165215570-162fa9c0-0fa0-4752-af10-99d7342d791d.png">

In [18]:
query_15 = """
        SELECT ntile_25, median, ntile_75
            ,(ntile_75 - ntile_25) * 1.5 as iqr
            ,ntile_25 - (ntile_75 - ntile_25) * 1.5 as lower_whisker
            ,ntile_75 + (ntile_75 - ntile_25) * 1.5 as upper_whisker
        FROM
        (
            SELECT percentile_cont(0.25) within group (order by mag) as ntile_25
            ,percentile_cont(0.5) within group (order by mag) as median
            ,percentile_cont(0.75) within group (order by mag) as ntile_75
            FROM earthquakes
            WHERE place like '%Japan%'
        ) a
        """

select(query_15)

Unnamed: 0,ntile_25,median,ntile_75,iqr,lower_whisker,upper_whisker
0,4.3,4.5,4.7,0.6,3.7,5.3


The median Japanese earthquake had a magnitude of 4.5, and the whiskers extend from 3.7 to 5.3. The plotted circles represent outlier earthquakes, both small and large. The Great Tohoku Earthquake of 2011, at 9.1, is an obvious outlier, even among the larger earthquakes Japan experienced.

<img align="left" width="537" alt="Screen Shot 2022-04-26 at 11 41 40 AM" src="https://user-images.githubusercontent.com/73784742/165216419-55c646fe-6832-4662-9afd-004471a96622.png">

In [19]:
query_16 = """
        SELECT date_part('year',time)::int as year
            ,mag
        FROM earthquakes
        WHERE place like '%Japan%'
        ORDER BY 1,2
        """

select(query_16)

Unnamed: 0,year,mag
0,2010,3.6
1,2010,3.7
2,2010,3.7
3,2010,3.8
4,2010,3.8
...,...,...
16031,2020,6.1
16032,2020,6.3
16033,2020,6.3
16034,2020,6.6


<img width="574" alt="Screen Shot 2022-04-26 at 11 44 38 AM" src="https://user-images.githubusercontent.com/73784742/165216699-2ee05333-7aa6-4bbf-a037-3d6c00be2675.png">

Although the median and the range of the boxes fluctuate a bit from year to year, they are consistently between 4 and 5. Japan experienced large outlier earthquakes every year, with at least one greater than 6.0, and in six of the years it experienced at least one earthquake at or larger than 7.0. Japan is undoubtedly a very seismically active region.

#### Anomalous Values

In [21]:
query_17 = """
        SELECT place, count(*)
        FROM earthquakes
        WHERE depth > 600
        GROUP BY 1
        """

select(query_17)

Unnamed: 0,place,count
0,"100km NW of Ndoi Island, Fiji",1
1,"100km SSW of Ndoi Island, Fiji",1
2,"100km SW of Ndoi Island, Fiji",1
3,"101km ENE of Suva, Fiji",1
4,"101km NNE of Ndoi Island, Fiji",1
...,...,...
672,South of the Fiji Islands,42
673,Strait of Gibraltar,1
674,"Sulawesi, Indonesia",1
675,Vanuatu region,46


Visual inspection suggests that many of these very deep earthquakes happen around Ndoi Island in Fiji. However, the place includes a distance and direction component, such as “100km NW of,” that makes summarization more difficult. We can apply some text parsing to focus on the place itself for better insights. For places that con‐ tain some values and then “ of ” and some more values, split on the “ of ” string and take the second part.

In [23]:
query_18 = """
        SELECT
        case when place like '% of %' then split_part(place, ' of ', 2)
             else place end as place_name
            ,count(*)
        FROM earthquakes
        WHERE depth > 600
        GROUP BY 1
        ORDER BY 2 DESC
        """

select(query_18)

Unnamed: 0,place_name,count
0,"Ndoi Island, Fiji",487
1,Fiji region,186
2,"Lambasa, Fiji",140
3,the Fiji Islands,63
4,"Sola, Vanuatu",50
...,...,...
62,"Paciran, Indonesia",1
63,"Palu, Indonesia",1
64,"Pangai, Tonga",1
65,Peru-Brazil border region,1


Anomalies can come in the form of misspellings, variations in capitalization, or other text errors. The ease of finding these depends on the number of distinct values, or cardinality, of the field. Differences in capitalization can be detected by counting both the distinct values and the distinct values when a lower or upper function is applied.

In [24]:
query_19 = """
        SELECT count(distinct type) as distinct_types
            ,count(distinct lower(type)) as distinct_lower
        FROM earthquakes
        """

select(query_19)

Unnamed: 0,distinct_types,distinct_lower
0,25,24


There are 24 distinct values of the type field, but 25 different forms. To find the spe‐ cific types, we can use a calculation to flag those values whose lowercase form doesn’t match the actual value. Including the count of records for each form will help contextualize so that we can later decide how to handle the values.

In [25]:
query_20 = """
        SELECT type
            ,lower(type)
            ,type = lower(type) as flag
            ,count(*) as records
        FROM earthquakes
        GROUP BY 1,2,3
        ORDER By 2,4 desc
        """

select(query_20)

Unnamed: 0,type,lower,flag,records
0,accidental explosion,accidental explosion,True,1
1,acoustic noise,acoustic noise,True,2
2,building collapse,building collapse,True,5
3,chemical explosion,chemical explosion,True,131
4,collapse,collapse,True,1
5,earthquake,earthquake,True,1461750
6,experimental explosion,experimental explosion,True,6
7,explosion,explosion,True,9887
8,ice quake,ice quake,True,10136
9,Ice Quake,ice quake,False,1


The anomalous value of “Ice quake” is easy to spot, since it is the only value for which the flag calculation returns false. Since there is only one record with this value, compared to 10,136 with the lowercase form, we can assume that it can be grouped together with the other records.

Other text functions can be applied, such as `trim` if we suspect that the values contain extra leading or trailing spaces, or `replace` if we suspect that certain spellings have multiple forms, such as the number “2” and the word “two.”

#### Anomalous Counts or Frequencies

In [28]:
query_21 = """
        SELECT date_trunc('year',time)::date as earthquake_year
            ,count(*) as earthquakes
        FROM earthquakes
        GROUP BY 1
        ORDER BY 1
        """

select(query_21)

Unnamed: 0,earthquake_year,earthquakes
0,2010-01-01,122322
1,2011-01-01,107397
2,2012-01-01,105693
3,2013-01-01,114368
4,2014-01-01,135247
5,2015-01-01,122914
6,2016-01-01,122420
7,2017-01-01,130622
8,2018-01-01,179304
9,2019-01-01,171116


We can see that 2011 and 2012 had low numbers of earthquakes compared to other years. There was also a sharp increase in records in 2018 that was sustained through 2019 and 2020. This seems unusual, and we can hypothesize that the earth became more seismically active suddenly, that there is an error in the data such as duplication of records, or that something changed in the data collection process.

In [29]:
query_22 = """
        SELECT date_trunc('month',time)::date as earthquake_month
            ,count(*) as earthquakes
        FROM earthquakes
        GROUP BY 1
        ORDER BY 1
        """

select(query_22)

Unnamed: 0,earthquake_month,earthquakes
0,2010-01-01,9651
1,2010-02-01,7697
2,2010-03-01,7750
3,2010-04-01,19380
4,2010-05-01,11407
...,...,...
127,2020-08-01,15510
128,2020-09-01,13184
129,2020-10-01,15226
130,2020-11-01,12618


<img align="left" width="575" alt="Screen Shot 2022-04-26 at 12 15 53 PM" src="https://user-images.githubusercontent.com/73784742/165219989-80bc5912-8eff-4384-a35a-5a77e87058e1.png">

We can see that although the number of earthquakes varies from month to month, there does appear to be an overall increase starting in 2017. We can also see that there are three outlier months, in April 2010, July 2018, and July 2019.

---

From here we can continue checking the data at more granular time periods, perhaps optionally filtering the result set by a range of dates to focus in on these anomalous stretches of time. After narrowing in on the specific days or even times of day to pinpoint when the spikes occurred, we might want to break the data down further by other attributes in the data set. This can help explain the anomalies or at least narrow down the conditions in which they occurred. 

For example, it turns out that the increase in earthquakes starting in 2017 can be at least partially explained by the status field. The status indicates whether the event has been reviewed by a human (“reviewed”) or was directly posted by a system without review (“automatic”).

In [30]:
query_23 = """
        SELECT date_trunc('month',time)::date as earthquake_month
            ,status
            ,count(*) as earthquakes
        FROM earthquakes
        GROUP BY 1,2
        ORDER BY 1
        """

select(query_23)

Unnamed: 0,earthquake_month,status,earthquakes
0,2010-01-01,automatic,620
1,2010-01-01,reviewed,9031
2,2010-02-01,automatic,695
3,2010-02-01,reviewed,7002
4,2010-03-01,automatic,842
...,...,...,...
267,2020-10-01,reviewed,12856
268,2020-11-01,automatic,2070
269,2020-11-01,reviewed,10548
270,2020-12-01,automatic,3956


<img align="left" width="570" alt="Screen Shot 2022-04-26 at 12 20 46 PM" src="https://user-images.githubusercontent.com/73784742/165220461-482fa837-030b-4836-81da-ed453af1441f.png">

In the graph, we can see that the outlier counts in July 2018 and July 2019 are due to large increases in the number of “automatic”-status earthquakes, whereas the spike in April 2010 was in “reviewed”-status earthquakes. A new type of automatic recording equipment may have been added to the data set in 2017, or perhaps there hasn’t been enough time to review all the recordings yet.

In [31]:
query_24 = """
        SELECT place, count(*) as earthquakes
        FROM earthquakes
        WHERE mag >= 6
        GROUP BY 1
        ORDER BY 2 desc
        """

select(query_24)

Unnamed: 0,place,earthquakes
0,"near the east coast of Honshu, Japan",52
1,"off the east coast of Honshu, Japan",34
2,Vanuatu,28
3,"New Britain region, Papua New Guinea",13
4,Solomon Islands,13
...,...,...
1119,"125km SSW of Tarauaca, Brazil",1
1120,"125km SW of Puerto Madero, Mexico",1
1121,"126km S of Kokopo, Papua New Guinea",1
1122,"127km NE of Bathsheba, Barbados",1


In [32]:
query_25 = """
        SELECT
        case when place like '% of %' then split_part(place, ' of ', 2)
             else place
             end as place
            ,count(*) as earthquakes
        FROM earthquakes
        WHERE mag >= 6
        GROUP BY 1
        ORDER BY 2 desc
        """

select(query_25)

Unnamed: 0,place,earthquakes
0,"Honshu, Japan",89
1,Vanuatu,28
2,"Lata, Solomon Islands",28
3,the Fiji Islands,21
4,"Hihifo, Tonga",20
...,...,...
608,"Port Blair, India",1
609,"Port Mathurin, Mauritius",1
610,"Puerto El Triunfo, El Salvador",1
611,"Puerto Morazan, Nicaragua",1


The region around Honshu, Japan, experienced 89 earthquakes, making it not only the location of the largest earthquake in the data set but also an outlier in the number of very large earthquakes recorded.

#### Anomalies from the Absence of Data