## 1 Application of descriptive statistics using Python

## 1.1 Getting ready for the data analysis

We will work with a data file that contains daily temperature data observed at Albany Airport (KALB). This data file is part of the [Global Historical Climatology Network daily (GHCNd)](https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily) data base, which includes thousands of meteorological stations from within the US and the rest of the world.

### 1.1.1 Download and upload of the data file

Download the data file named *USW00014735_temp_1950-2021_daily.csv* from the 
[Github repository](https://github.com/oet808/ATMENV315/tree/master/data). Upload it to your local data directory in the Jupyterlab. Make sure the file name is correct and ends with '.csv'

### 1.1.2 Take a first look at the data in the file:

- How many columns with data are in the file?
- How many lines (rows) with data are in the file?
- Does the file start with comments /text info at the top? If so, how many lines are non-data lines (header lines)?
- Is there a line with column names (labels) for each column?
- What meteorological data are in the file?




### 1.1.3 Import of package Pandas

Once you have informed yourself about that data file and the structure of the data, let's get to work with the data. In this notebook version (in spring 2022) we will make use of the package Pandas. With Pandas we have new methods and functions available. We will work with a new data type that we will refer to a bit sloppy as 'dataframe'. But at this level it will be sufficient. We just want to distinguish from the regular numpy arrays.


In [None]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

# Import the new package Pandas
import pandas as pd


# Tip: You can change the style of the plots by choosing from 
# the matplotlib styles. 
# More help can be found through a quick google search
from matplotlib import style 
style.use('ggplot') #'classic' 


## 1.2 Getting familiar with the Pandas class _pandas.core.frame.DataFrame_


### 1.2.1 Reading the data and creating a dataframe object

Many Python coders use in their code a variable name such as _df_ when they work with a dataframe object. We apply the Pandas function _read_csv()_  to read the data table from the csv file and assign it to the variable df.

In [None]:
local_path='../data/'
filename=local_path+'USW00014735_temp_1950-2021_daily.csv'
df=pd.read_csv(filename,delimiter=',',skiprows=0)

In [None]:
print(type(df))

### 1.2.2 Visualize the dataframe in the Notebook

When you just type in the variable name df and run the cell, you will see the dataframe.
Note: This is quite a nice layout. Jupyter does the formatting for our convenience (the print-function in Python does not do the same decent job).


In [None]:
df

### 1.2.3 Getting information about the data columns

Dataframes have additional attributes that describe the data. They can be accessed using
the '_df.something_' where something is a name of the attribute. For example:



In [None]:
df.columns

### 1.2.4 Getting data columns from the dataframe _df_

Now the cool thing about these column names. We see the column names are strings. These strings we can use to refer to the data columns and get the data from that column.

In [None]:
y=df['avgt'] # get the data column labeled 'avgt' 
x=df.index # get the counter index 
# Note a Pandas dataframe has an internal index column
# but it can be other types of index values, in general, for examples dates.
# (here corresponds to number of days since Jan 1st 1950)
y

In [None]:
plt.plot(x,y)
plt.xlabel('days since Jan 1st 1950')
plt.ylabel('daily mean temperature [F]')
plt.show()
# Note avgt and index are not simple numpy arrays 

### 1.2.5 Practice getting data columns from the dataframe _df_ 

- Get the daily minimum temperature and plot the time series.
- Get the daily maximum temperature and plot the time series.
- Get the year data and plot the time series.
- Get the month data and plot the time series.

In [None]:
y=df['maxt'] # get the data column labeled 'avgt' 
x=df.index
plt.plot(x,y)
plt.show()

### 1.3 Getting subsets of data with the query method

We demonstrate how we can use data in one column of the dataframe to select subsets of data rows: 

- select rows for a specific month
- select a range of years
- select a specific day in a month

Ultimately, we want to create a plot like this: 30-year climatology of monthly temperatures similar to this curve in the image below (obtained from [ClimateCharts](https://climatecharts.net/))

In [None]:
from IPython.display import Image
Image("https://raw.githubusercontent.com/oet808/ATMENV315/master/unit5/example_climate_chart_glens_falls.png",width=600)

---
For this purpose we need to separate the daily data into groups. For each month, we calculate the mean temperature. There several ways to do that, one method is the query method. If you 'push' yourself to comprehend the examples and and pay attention to the syntax within the query string, you have an extremely powerful tool at your hand.
The query strings make use of the same syntax rules as Python Boolean expressions.


---


## 1.3 Examples for subsampling data frames using the query method.

### 1.3.1 Example: Use the query method to get the data from February

Take another look at the CSV file. The column labeled 'month' contains float numbers 
1.0, 2.0, 3.0 ... 11.0, 12.0.

Each represents the calendar months Jan, Feb, Mar, ... , Nov, Dec, respectively.

So, we can start a search for the rows (the row index position) where the column 'month' contains the value 2.0.

Below is the syntax: 


In [None]:
dfq1=df.query("month == 2.0")
dfq1

Note that we store the resulting data frame in variable _dfg1_. The number of rows has been considerable reduced because all rows that are not month February were dropped.


### 1.3.2 Example: Use the query method to select a year 

We can apply the query method now on the datafram _dfq1_ to select the February data from year 1989, for example.






In [None]:
dfq2=dfq1.query("year == 1989.0")
dfq2

### 1.3.3 Example: Use the query method to select a year range 1989-1993

We can apply the query method now on the datafram _dfq1_ to select the February data of the years 1989-1993, for example.

In [None]:
dfq3=dfq1.query("year >= 1989.0 and  year <= 1993") 
dfq3

### 1.3.4 Example: Combining two or more queries in one expression
Now let's do  the two steps in one query:

We can apply the query method on the original datafram _df_ to select the February data of the years 1989-1993:


In [None]:
dfq4=df.query("month == 2.0 and year >=1989 and year <= 1993.0")
dfq4

---

### References:

Links to web pages with examples / tutorials on Pandas dataframes:
    
- [Pandas quickstart guide](https://pandas.pydata.org/docs/user_guide/10min.html)
- [Examples how to apply the query method](https://towardsdatascience.com/10-examples-that-will-make-you-use-pandas-query-function-more-often-a8fb3e9361cb)



## Notes:

### Dealing with dates/ time data to make better time series plots

We have a column with time (dates) information in the dataframe. But we cannot use it directly for plotting purposes! The __problem__ is: The values are type string not numerical date time information. 

Solution to the problem: We can use Pandas package methods to convert the strings in column 'time' into proper _datetime_ objects that are numerical values that can be used in plotting the time series.

We do this the following way:

`
df['time']=pd.to_datetime(df['time'])
`

We reassign the in dataframe ___df___ column '___time___' with the coverted values that the function call `pd.to_datetime(df['time'])` returns.

In [None]:
df['time']=pd.to_datetime(df['time'])
plt.plot(df['time'],y)
plt.xlabel('time')
plt.ylabel('daily mean temperature [F]')
plt.show()
# Note avgt and index are not simple numpy arrays 