# Seattle rain data

The data was pulled from [this Kaggle set](https://www.kaggle.com/rtatman/did-it-rain-in-seattle-19482017). Three rows containing "NA"s have been removed.

The name of the database is `ex_seattle_weather`.

If you think you have found an error in the result set, please open an issue in Github. 

## Date information

Different database engines have different techniques for manipulating dates. You typically won't be marked down for not knowing the date transformations in PostgreSQL vs Oracle vs MySQL vs ......

For the following exercises, it is probably useful to know the EXTRACT command:
```sql
SELECT EXTRACT('month' FROM date_weather) AS month FROM weather LIMIT 5;
```
will return 5 rows with just the month extracted from the `date_weather` column. 

In [1]:
%load_ext sql
%sql postgres://localhost/ex_seattle_weather

'Connected: @ex_seattle_weather'

## Questions

1. **Select all rows from December 1st, 2000 to December 15th, 2000 (inclusive)**

In [2]:
%%sql
SELECT 
       * 
FROM 
      weather
WHERE 
      date_weather >= timestamp '2000-12-01'
  AND
      date_weather <= timestamp '2000-12-15'
;

 * postgres://localhost/ex_seattle_weather
15 rows affected.


date_weather,inches_rain,temp_max,temp_min,did_rain
2000-12-01 00:00:00,0.04,55.0,39.0,True
2000-12-02 00:00:00,0.18,51.0,37.0,True
2000-12-03 00:00:00,0.0,44.0,34.0,False
2000-12-04 00:00:00,0.0,51.0,37.0,False
2000-12-05 00:00:00,0.0,50.0,36.0,False
2000-12-06 00:00:00,0.0,50.0,35.0,False
2000-12-07 00:00:00,0.0,40.0,34.0,False
2000-12-08 00:00:00,0.02,45.0,30.0,True
2000-12-09 00:00:00,0.06,43.0,36.0,True
2000-12-10 00:00:00,0.0,40.0,30.0,False


2. **Get the average maximum temperature for every year from the year 2000 onward. Order the results by year (ascending)**

In [3]:
%%sql
SELECT 
    DATE_PART('year', date_weather)::int AS year, AVG(temp_max)::numeric(5,2) avg_max_temp
FROM 
    weather
WHERE
    DATE_PART('year', date_weather)::int >= 2000
GROUP BY
    year
ORDER BY 
    year
;

 * postgres://localhost/ex_seattle_weather
18 rows affected.


year,avg_max_temp
2000,58.67
2001,58.47
2002,58.89
2003,60.44
2004,60.62
2005,60.15
2006,61.04
2007,59.2
2008,58.49
2009,59.91


3. **Get the standard deviation of the maximum temperature per year, from 2000 onward. Order by year (ascending)**

In [4]:
%%sql
SELECT 
    DATE_PART('year', date_weather)::INT AS year, STDDEV(temp_max)::NUMERIC(5,2) std_dev_temp_max
FROM 
    weather
WHERE
    DATE_PART('year', date_weather)::INT >= 2000
GROUP BY
    year
ORDER BY 
    year
;

 * postgres://localhost/ex_seattle_weather
18 rows affected.


year,std_dev_temp_max
2000,11.49
2001,11.18
2002,12.31
2003,12.87
2004,12.61
2005,11.89
2006,13.05
2007,12.92
2008,13.0
2009,14.23


4. **What are the 10 hottest days on record? Take hottest to mean 'highest maximum temperature'.**

In [5]:
%%sql
SELECT
    DATE(date_weather), temp_max
FROM
    weather
ORDER BY 
    temp_max DESC
LIMIT 10;

 * postgres://localhost/ex_seattle_weather
10 rows affected.


date,temp_max
2009-07-29,103.0
1994-07-20,100.0
1981-08-09,99.0
1991-07-23,99.0
1960-08-09,99.0
1981-08-10,98.0
1960-08-08,98.0
1988-09-02,98.0
1979-07-16,98.0
1967-08-16,98.0


5. **In 2016, what fraction of days did it rain?**

In [6]:
%%sql
WITH rain16 AS (
    SELECT 
        /* count the number of measurements in 2016 as year_days */
        count(*) AS year_days,  
    
        /* count the number of measurements with rain as rain_days */
        SUM(
            CASE 
                WHEN did_rain THEN 1 
                ELSE 0 
            END
        ) AS rain_days
    FROM 
        weather
    WHERE 
        DATE_PART('year', date_weather) = 2016
)
SELECT 
    rain_days, year_days, rain_days::FLOAT/year_days::FLOAT fraction
FROM 
    rain16
;

 * postgres://localhost/ex_seattle_weather
1 rows affected.


rain_days,year_days,fraction
172,366,0.469945355191257


6. **What is the 75th percentile for the amount of rain that fell on a day where _there was some rain_ in 2016?**

Hint: This is a somewhat advanced question. There are a couple of ways of approaching it:
  - count the number of rows with some rain, order then, and take the one ranked in the 75th percentile. This works on all RDBMSes.
  - use a PostgreSQL specific command such as `percent_rank` or `percentile_cont`

Note these different results give _slightly_ different answers (it depends on whether you interpolate to find the 75th percentile or not). This would be a more advanced question.

Answer: approximately 0.33 inches

In [7]:
%%sql
SELECT
    PERCENTILE_CONT(.75) WITHIN GROUP (ORDER BY inches_rain ASC)
FROM
    weather
WHERE
    DATE_PART('year', date_weather)::INT = 2016
AND
    did_rain
;

 * postgres://localhost/ex_seattle_weather
1 rows affected.


percentile_cont
0.330000013113022


7. **What is the 75th percentile for the amount of rain that fell on any day in 2016?**

Answer: approximately 0.15 inches

In [8]:
%%sql
SELECT
    PERCENTILE_CONT(.75) WITHIN GROUP (ORDER BY inches_rain ASC)
FROM
    weather
WHERE
    DATE_PART('year', date_weather)::int = 2016
;

 * postgres://localhost/ex_seattle_weather
1 rows affected.


percentile_cont
0.150000005960464


8. **Get the 10 years with the hottest average maximum temperature in July. Order from hottest to coolest**

In [9]:
%%sql
SELECT 
    DATE_PART('year', date_weather)::int AS year
    , AVG(temp_max)::NUMERIC(5,2) AS avg_july_temp_max
FROM
    weather
WHERE
    DATE_PART('month', date_weather)::int = 7
GROUP BY 
    DATE_PART('year', date_weather)
ORDER BY 
    AVG(temp_max) DESC
LIMIT 
    10
;

 * postgres://localhost/ex_seattle_weather
10 rows affected.


year,avg_july_temp_max
2015,82.58
1958,81.42
2009,80.97
1985,80.94
2014,80.42
1960,79.65
1965,79.45
1990,79.19
2003,78.97
1994,78.97


9. **Get the 10 years with the coldest average minimum temperature in December. Order from coolest to hottest**

In [10]:
%%sql
SELECT 
    DATE_PART('year', date_weather)::int AS year
    , AVG(temp_min) AS avg_dec_temp_min
FROM
    weather
WHERE
    DATE_PART('month', date_weather)::int = 12
GROUP BY 
    DATE_PART('year', date_weather)::int
ORDER BY 
    AVG(temp_min) ASC
LIMIT 
    10
;

 * postgres://localhost/ex_seattle_weather
10 rows affected.


year,avg_dec_temp_min
1990,30.3870967741935
1948,30.8064516129032
1985,30.9354838709677
1951,31.2258064516129
1964,31.4838709677419
1983,31.5161290322581
1968,32.0322580645161
1984,32.0967741935484
2009,32.0967741935484
1978,32.1612903225806


10. **Repeat the last question, but round the temperatures to 3 decimal places**

HINT: If using the `ROUND` function, you will need to cast your results from `REAL` to `NUMERIC` first. There is an issue in PostgreSQL where REAL is not a subclass of NUMERIC. 

Format for casting

```sql
SELECT temp_min::numeric FROM weather
``` 

In [11]:
%%sql
SELECT 
    DATE_PART('year', date_weather)::int AS year
    , AVG(temp_min)::NUMERIC(5,3) AS avg_dec_temp_min
FROM
    weather
WHERE
    DATE_PART('month', date_weather)::int = 12
GROUP BY 
    DATE_PART('year', date_weather)::int
ORDER BY 
    AVG(temp_min) ASC
LIMIT 
    10;

 * postgres://localhost/ex_seattle_weather
10 rows affected.


year,avg_dec_temp_min
1990,30.387
1948,30.806
1985,30.935
1951,31.226
1964,31.484
1983,31.516
1968,32.032
1984,32.097
2009,32.097
1978,32.161


11. **Given the results of the previous queries, would it be fair to use this data to claim that 2015 had the "hottest July on record"? Why or why not?**

Answer: No SQL for this question, it is about interpretation of results.

There are many different ways to define 'hottest on record' -- by the definition of "highest monthly average temperature" 2015 wins. But the hottest days ever query includes years where the hottest July day was hotter than than any July day in 2015. In general, the average as a metric will not be robust to outliers. Median temperature, or total number of days above some floor might also be ways of determining the hottest July on record.

12. **Give the average inches of rain that fell per day for each month, where the average is taken over 2000 - 2010 (inclusive).** 

In [12]:
%%sql
SELECT 
    DATE_PART('month', date_weather)::int as month
    , AVG(inches_rain) as avg_inches_rain
FROM
    weather
WHERE
        date_weather >= timestamp '2000-01-01'
    AND
        date_weather <  timestamp '2011-01-01'
GROUP BY 
    DATE_PART('month', date_weather)
ORDER BY 
    DATE_PART('month', date_weather) ASC
;

 * postgres://localhost/ex_seattle_weather
12 rows affected.


month,avg_inches_rain
1,0.191612903460248
2,0.094276527248299
3,0.113577712622206
4,0.0853636359626597
5,0.0680351905699524
6,0.050181818189043
7,0.0161290320395724
8,0.0343695011842024
9,0.056930091577415
10,0.115542522634317
