# basic statistical analysis with python

In this exercise, we'll take a look at some basic statistical analysis with python - starting with using python and `pandas` to calculate descriptive statistics for our datasets, before moving on to look at a few common examples of hypothesis tests using `statsmodels`.
 
## data

The data used in this exercise are the historic meteorological observations from the [Armagh Observatory](https://www.metoffice.gov.uk/weather/learn-about/how-forecasts-are-made/observations/recording-observations-for-over-100-years) (1853-present), the Oxford Observatory (1853-present), the Southampton Observatory (1855-2000), and Stornoway Airport (1873-present), downloaded from the [UK Met Office](https://www.metoffice.gov.uk/research/climate/maps-and-data/historic-station-data) that we used in previous exercises. I have copied the **combined_stations.csv** data into this folder - this is the same file that you created in the process of working through the "pandas" exercise.


## loading libraries

As before, we load the packages that we will use in the exercise at the beginning:

In [None]:
import pandas as pd
from pathlib import Path

Next, we'll use `pd.read_csv()` to load the combined station data. We'll also use the `parse_dates` argument to tell `pandas` to read the `date` column as a date:

In [None]:
station_data = pd.read_csv(Path('data', 'combined_stations.csv'), parse_dates=['date'])

## descriptive statistics

Before diving into statistical tests, we'll spend a little bit of time expanding on calculating *descriptive* statistics using `pandas`. We have seen a little bit of this already, using `.groupby()` and `.mean()` to calculate the mean value of `rain` for each station.

### describing variables using .describe()

First, we'll have a look at `.describe()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)), which provides a summary of each of the (numeric) columns in the table:

In [None]:
station_data.describe()

In the output above, we can see the count (**count**) minimum (**min**), 1st quartile (**25%**), median (**50%**), mean (**mean**), 3rd quartile (**75%**), maximum (**max**), and standard deviation (**std**) values of each numeric variable.

With this, we can quickly see where we might have errors in our data - for example, if we have non-physical or nonsense values in our variables. When first getting started with a dataset, it can be a good idea to check over the dataset using `.describe()`.

### using .describe() to summarize groups

What if we wanted to get a summary based on some grouping - for example, for each station? We could use `filter()` to create an object for each value of `station`, then call `summary()` on each of these objects in turn.

Not surprisingly, however, there is an easier way, using `split()` ([documentation](https://rdrr.io/r/base/split.html)) and `map()` ([documentation](https://purrr.tidyverse.org/reference/map.html)). First, `split()` divides the table into separate tables based on some grouping:

In [None]:
station_data.groupby('station').describe()

In [None]:
group_summary = station_data.groupby('station').describe()

In [None]:
stations = station_data['station'].unique()

combined_stats = []

for station in stations:

    this_summary = group_summary.loc[station]
    columns = this_summary.index.unique(level=0)
    
    reshaped = pd.concat([this_summary[ind] for ind in columns], axis=1)
    reshaped.columns = columns
    
    reshaped.reset_index(inplace=True)
    reshaped.rename(columns={'index': 'statistic'}, inplace=True)
    reshaped['station'] = station
    
    combined_stats.append(reshaped.set_index(['station', 'statistic']))

combined_stats = pd.concat(combined_stats)
combined_stats

In [None]:
this_summary.index.unique(level=0)

In [None]:
pd.to_datetime(station_data['date'])