# Tutorial Exercises

This week's tutorial exercises focus on indexing and obtaining descriptive statistics

### Set up Python Libraries

As usual you will need to run this code block to import the relevant Python libraries

In [3]:
# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas 
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf

### Import a dataset to work with

You will need to download the file OxfordWeather.csv from Canvas to your computer, then import it

In [4]:
weather = pandas.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook/main/data/OxfordWeather.csv")
display(weather)

Unnamed: 0,YYYY,MM,DD,Tmax,Tmin,Tmean,Trange,Rainfall_mm
0,1827,1,1,8.3,5.6,7.0,2.7,0.0
1,1827,1,2,2.2,0.0,1.1,2.2,0.0
2,1827,1,3,-2.2,-8.3,-5.3,6.1,9.7
3,1827,1,4,-1.7,-7.8,-4.8,6.1,0.0
4,1827,1,5,0.0,-10.6,-5.3,10.6,0.0
...,...,...,...,...,...,...,...,...
71338,2022,4,26,15.2,4.1,9.7,11.1,0.0
71339,2022,4,27,10.7,2.6,6.7,8.1,0.0
71340,2022,4,28,12.7,3.9,8.3,8.8,0.0
71341,2022,4,29,11.7,6.7,9.2,5.0,0.0


## Exercises

In the following questions, we descriptive statistics and indexing to answer some questions about the weather and climate in Oxford.

Where you are asked to calculate a value (such as the mean) rather than output a table, you should **report your answer in words** in the text box below the code block.

Where the question asks you to "comment", you are simmply being asked to engage with the data/ explain what  you notice in plain English. Please discuss with your fellow students and your tutor as this is a really important skill for data analysis.

### Part 1: Heat

#### a. What was the hottest temperature on record?

Note that the dataset ends in April 2022 and therefore does not include the record heatwave of summer 2022.

In [12]:
# Your code here
weather.Tmax.max()

36.5

*Your text here*

#### b. On what date did the hottest temperature occur?

Hint: you could use `df.query()` to help you here

In [15]:
# Your code here
weather.query('Tmax == 36.5')

Unnamed: 0,YYYY,MM,DD,Tmax,Tmin,Tmean,Trange,Rainfall_mm
70332,2019,7,25,36.5,16.4,26.5,20.1,0.0


*Your text here*

#### c. Display the 10 hottest days on record and comment

Hint: you can use `df.values_sort()` and `df.head()` or `df.tail()` to help you here

In [22]:
# Your code here
weather.sort_values(by='Tmax').tail(11)

Unnamed: 0,YYYY,MM,DD,Tmax,Tmin,Tmean,Trange,Rainfall_mm
70715,2020,8,11,34.1,17.9,26.0,16.2,0.0
54599,1976,6,27,34.3,17.8,26.1,16.5,0.0
70716,2020,8,12,34.4,20.4,27.4,14.0,8.4
64503,2003,8,9,34.6,16.0,25.3,18.6,0.0
30900,1911,8,9,34.8,15.2,25.0,19.6,0.0
65578,2006,7,19,34.8,15.4,25.1,19.4,0.2
38581,1932,8,19,35.1,16.3,25.7,18.8,0.0
59749,1990,8,3,35.1,16.4,25.8,18.7,0.0
70704,2020,7,31,35.1,14.8,25.0,20.3,0.0
70332,2019,7,25,36.5,16.4,26.5,20.1,0.0


*Your comment here*

Six out of the ten hottest days on record occurred in the last 20 years

#### d. Find the mean of maximum daily temperature (Tmax) for each month and comment

Hint: you can use `df.groupby()` to help you here

In [41]:
# Your code here
weather.groupby('MM').Tmax.mean()

MM
1      6.554444
2      7.401048
3      9.944914
4     13.187517
5     16.795252
6     20.011487
7     21.799007
8     21.192936
9     18.451043
10    14.112639
11     9.640041
12     7.290571
Name: Tmax, dtype: float64

*Your coment here*

Unsurprisingly the warmest months are in summer and the coolest are in winter

#### e. Make a table displaying the mean and standard deviation of Tmax in each month

Hint: A combination of `df.agg()` and `df.groupby()` will help you here

In [43]:
# Your code here
weather.groupby('MM').agg({'Tmax':['mean', 'std']})

Unnamed: 0_level_0,Tmax,Tmax
Unnamed: 0_level_1,mean,std
MM,Unnamed: 1_level_2,Unnamed: 2_level_2
1,6.554444,3.831624
2,7.401048,3.72329
3,9.944914,3.641816
4,13.187517,3.648047
5,16.795252,3.761523
6,20.011487,3.585932
7,21.799007,3.511055
8,21.192936,3.232944
9,18.451043,3.088003
10,14.112639,3.090256


#### e. Make a table displaying the mean of Tmax and Tmin in each month

Hint: A combination of `df.agg()` and `df.groupby()` will help you here

In [46]:
# Your code here
weather.groupby('MM').agg({'Tmax':['mean'], 'Tmin':['mean']})

Unnamed: 0_level_0,Tmax,Tmin
Unnamed: 0_level_1,mean,mean
MM,Unnamed: 1_level_2,Unnamed: 2_level_2
1,6.554444,1.319437
2,7.401048,1.470683
3,9.944914,2.39684
4,13.187517,4.301786
5,16.795252,7.165062
6,20.011487,10.328291
7,21.799007,12.238098
8,21.192936,11.965261
9,18.451043,9.824855
10,14.112639,6.874028


### Part 2: Rain

#### a. Run this code block to add a column called <tt>wet</tt> containing a <tt>True</tt> for days on which it rained and <tt>False</tt> otherwise

We will practice adding columns in a later session

In [33]:
# Your code here
weather['wet']=weather.Rainfall_mm>0
weather

Unnamed: 0,YYYY,MM,DD,Tmax,Tmin,Tmean,Trange,Rainfall_mm,wet
0,1827,1,1,8.3,5.6,7.0,2.7,0.0,False
1,1827,1,2,2.2,0.0,1.1,2.2,0.0,False
2,1827,1,3,-2.2,-8.3,-5.3,6.1,9.7,True
3,1827,1,4,-1.7,-7.8,-4.8,6.1,0.0,False
4,1827,1,5,0.0,-10.6,-5.3,10.6,0.0,False
...,...,...,...,...,...,...,...,...,...
71338,2022,4,26,15.2,4.1,9.7,11.1,0.0,False
71339,2022,4,27,10.7,2.6,6.7,8.1,0.0,False
71340,2022,4,28,12.7,3.9,8.3,8.8,0.0,False
71341,2022,4,29,11.7,6.7,9.2,5.0,0.0,False


#### b. What is the proportion of wet days overall?

Hint: The values <tt>True</tt> and <tt>False</tt> can be treated as <tt>1</tt> and <tt>0</tt> respectively.
    
To get the proportion of days on which <tt>wet==True</tt>, we can use a programmming trick which is to simply take the mean of the column <tt>wet</tt>:
    
* say there are 100 days in my sample
    * say 66 of them, <tt>wet==True==1</tt>
    * for the other 44, <tt>wet==False==0</tt>
* If we take the mean, this gives us the proportion of wet days because we:
    * add up all the values (answer=66) 
    * divide by the number of cases (100)
    * result is 66/100 = 0.66 or 66%, the proportion of wet days

In [48]:
# your code here
weather.wet.mean()

0.46311481154423

*Your text here*

It rains on 46% of days

#### c. What is the proportion of wet days in each month? Comment on your findings

Hint: use `df.groupby()`

In [51]:
# your code here
weather.groupby('MM').wet.mean()

MM
1     0.530941
2     0.479046
3     0.448321
4     0.441156
5     0.422829
6     0.405128
7     0.414723
8     0.431596
9     0.421538
10    0.503722
11    0.529231
12    0.528536
Name: wet, dtype: float64

*Your comments here*

The proportion of wet days is always between 40 and 53% (ie, it rains a lot).
Unsurprisingly, the dryest months are in early summer (May,June,July) and the months with the most wet days are in winter (December/January)

#### d. What is the mean quantity of rainfall (in mm) in each month? Comment on your findings

In [52]:
# your code here
weather.groupby('MM').Rainfall_mm.mean()

MM
1     1.768186
2     1.513909
3     1.384546
4     1.471871
5     1.666600
6     1.813607
7     1.888238
8     1.935203
9     1.889658
10    2.173002
11    2.043350
12    1.878412
Name: Rainfall_mm, dtype: float64

*Your comment here*

The quantity of rain follows a different pattern from the proportion of wet days - March has the lowest mean rainfall (perhaps it tends to drizzle in March?!) and October has the highest (it rains like it means it)

#### e. Display the 10 wettest days on record and comment

In [40]:
# Your code here
weather.sort_values(by='Rainfall_mm').tail(25)

Unnamed: 0,YYYY,MM,DD,Tmax,Tmin,Tmean,Trange,Rainfall_mm,wet
17726,1875,7,14,15.1,11.9,13.5,3.2,42.5,True
17823,1875,10,19,10.7,8.9,9.8,1.8,43.0,True
52078,1969,8,2,24.7,16.5,20.6,8.2,43.4,True
32407,1915,9,24,19.3,12.7,16.0,6.6,43.7,True
20636,1883,7,2,25.7,16.1,20.9,9.6,43.9,True
65581,2006,7,22,29.5,19.1,24.3,10.4,44.2,True
51336,1967,7,22,22.8,10.4,16.6,12.4,44.5,True
19553,1880,7,14,15.8,13.7,14.8,2.1,46.0,True
21755,1886,7,25,21.9,10.9,16.4,11.0,46.3,True
62725,1998,9,26,19.9,12.8,16.4,7.1,46.4,True


*Your comment here*

* Almost all the wettest days are in summer, suggesting extreme rainfall is more likely in summer (perhaps due to convection storms?)
* There is no obvious trend for more of the wettest days to be recent (evidence for climate change less clear than in temperature data)

#### f. Compare and contrast the different findings in part 2 c,d, and e

Different descriptive statistics tell us different things about the same data!

*Your comments here!*

Bring together the observations on rainfall
* It rains on pretty much half of all days in Oxford!
* Rain is more likely in winter than summer
* The total volume of rain is greatest in Autumn and lowest in spring
* Extreme rain is almost always in summer

### Snow

#### a. Create a dataframe containing the weather on Christmas day, for all the years in which there was a White Christmas 

Hint: we don't have a column telling us when is has snowed, but it is reasonable to assume this happens when the minimum temperature dips below zero, and Rainfall_mm is above zero.

In [61]:
# 
WhiteChristmas = weather.query('MM==12 and DD==25 and Tmin<0 and wet==True')
WhiteChristmas

Unnamed: 0,YYYY,MM,DD,Tmax,Tmin,Tmean,Trange,Rainfall_mm,wet
5472,1841,12,25,6.3,-1.5,2.4,7.8,0.9,True
8394,1849,12,25,4.4,-0.7,1.9,5.1,1.0,True
18256,1876,12,25,1.2,-0.6,0.3,1.8,1.3,True
18621,1877,12,25,3.4,-2.1,0.7,5.5,0.8,True
18986,1878,12,25,4.0,-2.7,0.7,6.7,10.2,True
19351,1879,12,25,1.1,-1.7,-0.3,2.8,0.3,True
23369,1890,12,25,-3.1,-4.8,-4.0,1.7,0.3,True
23734,1891,12,25,1.7,-6.8,-2.6,8.5,0.7,True
25195,1895,12,25,3.2,-0.5,1.4,3.7,5.0,True
28482,1904,12,25,2.7,-1.5,0.6,4.2,0.5,True


#### b. Sort the dataframe <tt>WhiteChristmas</tt> by year and comment

In [62]:
WhiteChristmas.sort_values(by='YYYY')

Unnamed: 0,YYYY,MM,DD,Tmax,Tmin,Tmean,Trange,Rainfall_mm,wet
5472,1841,12,25,6.3,-1.5,2.4,7.8,0.9,True
8394,1849,12,25,4.4,-0.7,1.9,5.1,1.0,True
18256,1876,12,25,1.2,-0.6,0.3,1.8,1.3,True
18621,1877,12,25,3.4,-2.1,0.7,5.5,0.8,True
18986,1878,12,25,4.0,-2.7,0.7,6.7,10.2,True
19351,1879,12,25,1.1,-1.7,-0.3,2.8,0.3,True
23369,1890,12,25,-3.1,-4.8,-4.0,1.7,0.3,True
23734,1891,12,25,1.7,-6.8,-2.6,8.5,0.7,True
25195,1895,12,25,3.2,-0.5,1.4,3.7,5.0,True
28482,1904,12,25,2.7,-1.5,0.6,4.2,0.5,True


*Your comments here*

General descriptive coments/ engageent with the data to be encouraged, e.g.

* White Christmasses are not very common now (only three in the last 50 years)
* They have never been all that common although there was a run of them in the 1870's
* The most recent one was in 2018, however there wasn't enough snow to play in (0.1mm of rain - so maybe 1mm of snow!)

* There have been only a handful of Christmas days with a worthwhile amount of snow, notably 1878 and and 1923 (>10mm rainfall, which would mean more than 10cm of snow)

#### c. Any issues with our definition of 'snow'?

We defined snow as when the <tt>Tmin</tt> falls below zero and Rainfall is non-zero. 

* Do you think this over- or under- estiamtes the number of snowy days?
* Why?

*Your comments here*

#### d. How common is 'proper' snowfall in Oxford?

Let's focus on days with enough snowfall to make at least a tiny snowman! Assume that this happens when TMin is below zero and there is more than 4mm of rainfall 

* 4mm of rain makes about 5cm of soggy snow in Oxford conditions, although it would make a uch greater depth of powder in a cold dry atmosphere like Utah or Colorado

Create a dataframe called <tt>SnowDays</tt> containing only days with enough snow to make a snowman.

You can check how often this happened in recent years using `df.tail()`

In [71]:
SnowDays = weather.query('Tmin<0 and Rainfall_mm>4').tail(25)
SnowDays

Unnamed: 0,YYYY,MM,DD,Tmax,Tmin,Tmean,Trange,Rainfall_mm,wet
68718,2015,2,22,9.8,-1.0,4.4,10.8,6.9,True
68749,2015,3,25,9.5,-1.2,4.2,10.7,5.8,True
68992,2015,11,23,9.2,-1.6,3.8,10.8,6.9,True
69098,2016,3,8,8.7,-2.1,3.3,10.8,28.5,True
69343,2016,11,8,5.0,-2.2,1.4,7.2,20.0,True
69396,2016,12,31,8.0,-0.3,3.9,8.3,7.8,True
69402,2017,1,6,8.7,-2.9,2.9,11.6,6.3,True
69410,2017,1,14,5.9,-0.9,2.5,6.8,5.8,True
69433,2017,2,6,7.5,-1.7,2.9,9.2,6.0,True
69740,2017,12,10,1.9,-1.8,0.0,3.7,16.5,True


*Your comments here*

* It snows most years! 
* It mostly snows in January of February, but sometimes in November or December and occasionally in April