# Exploratory data analysis of Auckland rainfall data


>In this notebook, I am applying EDA methods to analyse rainfall data in Auckland between 1872 and 2011 and summarise the main characteristics. 

> A set of tables is taken from the Auckland Council website: https://environmentauckland.org.nz/Data/DataSet/Summary/Location/A64871M/DataSet/Rainfall/Continuous/Interval/Latest 

>There are five tables containing the amount of rain in millimetres, measured at two different locations: Mt Albert and City centre. The tables cover various time intervals starting from 6 years to 140 years, and they include both measured values and synthesized data. I am using an A64871M data file, covering a period between 1872 and 1997. The table contains five columns and 31652 rows, but only two columns are needed for our analysis:  dates and the amount of rain in millimetres. The dates are given for NZST (UTC +12h) time zone. The frequency of the measurements is between 1 and 10 days. 

### 1. Importing Python libraries

In [3]:
import numpy as np
import pandas as pd
from pandas import datetime
import matplotlib.pyplot as plt

### 2. Reading input data file

To read AucklandRainfall1872-1997.csv file, I am using the panda's function read_csv. 

In [4]:
#setting the file path and the file name
file = 'Data/AucklandRainfall1872-1997.csv'

#reading file into data frame, rain, and setting date as index column of datetime type 
rain=pd.read_csv(file, skiprows=3, header=None,
                  names=['timestamp','rain_value', 'grade', 'interpol_type', 'event_timestamp'],
                  parse_dates=['timestamp'],
                  index_col='timestamp')

#printing the first lines of the rain.dat
rain.head(5)

Unnamed: 0_level_0,rain_value,grade,interpol_type,event_timestamp
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1872-01-03,0.0,200,5,
1872-01-04,3.81,200,5,
1872-01-05,2.03,200,5,
1872-01-09,0.0,200,5,
1872-01-10,0.51,200,5,


### 3. Dropping unnecesary columns

In [5]:
#displaying table infos
rain.info(verbose=True)

#counting unique values for grade, interpol_type and event_timestamp
display(rain['grade'].value_counts(sort=True, normalize=True))
display(rain['interpol_type'].value_counts(sort=True, normalize=True))
display(rain['event_timestamp'].value_counts(sort=True, normalize=True))

#dropping above columns     
rain=rain.drop(columns=['event_timestamp', 'grade', 'interpol_type'])

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 31652 entries, 1872-01-03 00:00:00 to 1997-09-30 08:30:00
Data columns (total 4 columns):
rain_value         31652 non-null float64
grade              31652 non-null int64
interpol_type      31652 non-null int64
event_timestamp    0 non-null float64
dtypes: float64(2), int64(2)
memory usage: 1.2 MB


200    1.0
Name: grade, dtype: float64

5    1.0
Name: interpol_type, dtype: float64

Series([], Name: event_timestamp, dtype: float64)