TODO ADD INTRO

We first import the necessary libraries.

In [1]:
import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline





We will first begin by optaining the required data of the analysis. 

By visiting the National Oceanic and Atmospheric Administration's National Centers for Environmental Information website, we specify the date range of the dataset **(1955-01-01 to 2021-10-20)**. We also need to provide the type of data we require, which in our case is **Daily Summaries**. Finally, we need to specify our search terms. In our case we will measure the Daily Summaries of Hellinikon weather station, so this is our search term.

The search tool returns us the data from the Hellinikon weather station. The tool also prompts us to specify various other preferences:

* The format of the data: in our case .csv format is selected
* Additional data to include: by reading the assignment description and the documentation, it is noticed that the Average Temperature (TAVG) and Precipitation (PRCP) are required for the analysis and are therefore selected.
* We also selected the option to include data headers in order to facilitate the organization of the data.
* Finally, the metric system is selected as the dataset unit of measurement.

After specifying the above, we download the required dataset. For portability reasons, the data are accessed from a personal access folder.

We also need to download the alternative Athens weather dataset from the Hellenic Data Service, again available in .csv format. 


We then proceed to read the two csv files and create two DataFrames.

By reading the dataset's documentation, we can resolve the contents of each column:

* **DATE** represents the date of the measurement. - (YYYY-MM-DD)
* **STATION** represents the station code, that in out case is the code of the Helinikon weather station.
* **PRCP** represents the precipitation, or the total daily rainfall. - (mm)
* **PRCP_ATTRIBUTES** contain a list of different attributes of the percipitation measurement.
* **TAVG** represents the average daily temperature. - (Degrees Celsius)



In [2]:
noaa_filename = "data/NOAA.csv"
data = pd.read_csv(noaa_filename, 
                   parse_dates=['DATE']) 

We notice that the secondary data file does not contain any headers. Fortunately, with some searching in the dataset's description (although not well documented), it is specified that the dataset columns describe the following:

* col.  1 represents the date of the measurement. - (YYYY-MM-DD)
* cols. 2 - 4 represent the average, maximum and minimum daily temperature (in that order). - (Degrees Celsius)
* cols. 5 - 7 represent the average, maximum and minimum relative humidity. - (%)
* cols. 8 - 10 represent the average, maximum and minimum atmospheric pressure. - (hPa)
* col. 11 represents the total daily rainfall. - (mm) 
* col. 12 represents the average wind speed. - (km/h)
* col. 12 represents the wind direction.
* col. 12 represents the maximum gust of wind speed. - (km/h)

We are careful to name the columns that describe the same measurements with the same name, in order to facilitate further processing. Even though the measurements are from different weather stations of Athens (the secondary data come from a station close to the centre of Athens, whereas the Hellinicon weather station is located in the southern part of Athens), for the purposes of the assignment, we take for granted that this does not affect the results.

We therefore create the headers and assign them manually. The names of the other columns are related to their contents, but we will not further explan their names as they will not be used in the analysis.

In [3]:
hds_filename = "data/HDS.csv"

secondary_data_headers = ["DATE", "TAVG", "TMAX", "TMIN", "HAVG", "HMAX", "HMIN", 
                          "PAVG", "PMAX", "PMIN", "PRCP", "WSPD", "WDIR", "GSPD"]

secondary_data = pd.read_csv(hds_filename,
                             names=secondary_data_headers,
                   parse_dates=['DATE'])

 

After reading the files we are left with the following DataFrames:

In [4]:
data

Unnamed: 0,STATION,DATE,PRCP,PRCP_ATTRIBUTES,TAVG,TAVG_ATTRIBUTES
0,GR000016716,1955-01-01,0.0,",,E",,
1,GR000016716,1955-01-02,2.0,",,E",,
2,GR000016716,1955-01-03,0.0,",,E",,
3,GR000016716,1955-01-04,0.0,",,E",,
4,GR000016716,1955-01-05,0.0,",,E",,
...,...,...,...,...,...,...
23532,GR000016716,2021-10-13,0.0,",,S",19.8,"H,,S"
23533,GR000016716,2021-10-14,5.6,",,S",17.5,"H,,S"
23534,GR000016716,2021-10-15,79.2,",,S",19.7,"H,,S"
23535,GR000016716,2021-10-16,2.5,",,S",19.9,"H,,S"


In [5]:
secondary_data

Unnamed: 0,DATE,TAVG,TMAX,TMIN,HAVG,HMAX,HMIN,PAVG,PMAX,PMIN,PRCP,WSPD,WDIR,GSPD
0,2010-01-01,17.9,18.1,17.8,61.4,91,33,1003.6,1006.3,1002.0,0.2,4.0,WSW,12.7
1,2010-01-02,15.6,15.7,15.5,57.4,70,45,1005.2,1008.7,1001.5,0.0,6.8,WSW,20.7
2,2010-01-03,13.5,13.6,13.4,56.0,76,39,1011.7,1016.7,1008.6,0.0,5.0,WSW,15.4
3,2010-01-04,9.5,9.6,9.5,50.7,60,38,1021.3,1023.1,1016.8,0.0,4.3,NNE,11.0
4,2010-01-05,13.4,13.5,13.4,70.5,82,54,1018.7,1022.1,1015.5,0.0,7.9,S,19.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3647,2019-12-27,10.1,10.2,10.0,60.3,79,44,1018.4,1019.9,1016.8,0.0,2.9,NE,8.0
3648,2019-12-28,8.3,8.4,8.2,60.9,82,46,1016.0,1017.2,1014.2,7.2,4.3,NE,12.8
3649,2019-12-29,6.4,6.5,6.4,73.4,82,66,1017.6,1018.9,1016.5,3.4,10.6,NNE,24.5
3650,2019-12-30,4.0,4.0,3.9,83.9,90,65,1020.0,1024.2,1016.6,12.4,5.1,NE,15.0


We will now explore the primary dataset.

In [6]:
data.describe()


Unnamed: 0,PRCP,TAVG
count,23024.0,21226.0
mean,1.007284,18.306492
std,4.640332,6.937514
min,0.0,-2.0
25%,0.0,12.8
50%,0.0,17.8
75%,0.0,24.3
max,142.0,34.8


In [7]:
data.TAVG.value_counts()

13.9    145
13.1    145
14.2    144
13.3    142
14.9    142
       ... 
0.4       1
33.9      1
2.7       1
1.4       1
1.0       1
Name: TAVG, Length: 348, dtype: int64

In [8]:
data.PRCP.value_counts()

0.0     19138
0.3       227
0.1       226
0.2       215
0.5       168
        ...  
20.9        1
25.5        1
42.2        1
34.0        1
26.8        1
Name: PRCP, Length: 371, dtype: int64

* At a first glance we see that no extreme values do exist.
* It is also worth mentioning that it rained less than 25% of the days, which makes sense in a country such as Greece.

Now let's explore the any existence of missing values. We only care about the absence of TAVG and PRCP values as these are the columns that we are interested in. Therefore, we first need to have an overview of any absent temperature or percipitation data.

In [9]:
data[data['PRCP'].isnull() | data['TAVG'].isnull()]

Unnamed: 0,STATION,DATE,PRCP,PRCP_ATTRIBUTES,TAVG,TAVG_ATTRIBUTES
0,GR000016716,1955-01-01,0.0,",,E",,
1,GR000016716,1955-01-02,2.0,",,E",,
2,GR000016716,1955-01-03,0.0,",,E",,
3,GR000016716,1955-01-04,0.0,",,E",,
4,GR000016716,1955-01-05,0.0,",,E",,
...,...,...,...,...,...,...
23503,GR000016716,2021-09-14,,,25.8,"H,,S"
23521,GR000016716,2021-10-02,,,18.9,"H,,S"
23528,GR000016716,2021-10-09,,,20.6,"H,,S"
23530,GR000016716,2021-10-11,,,21.8,"H,,S"


In [13]:
data[data['TAVG'].isnull()].count()

STATION            2311
DATE               2311
PRCP               2308
PRCP_ATTRIBUTES    2308
TAVG                  0
TAVG_ATTRIBUTES       0
dtype: int64

In [12]:
data[data['PRCP'].isnull()].count()

STATION            513
DATE               513
PRCP                 0
PRCP_ATTRIBUTES      0
TAVG               510
TAVG_ATTRIBUTES    510
dtype: int64

We have spotted 2821 missing values. From these 2311 are about missing temperature data and 513 are about missing precipotation data.

In [None]:
We are now going to analyze the secondary dataset.