## Programming for Data Analysis Project Submission 2018

### Foreword
The investigation and synthesis of data contained in this Jupyter Notebook is the project submission for the 2nd semester 10 credit module - **Programming for Data Analysis**, part of the course entitled *Higher Diploma in Science - Computing(Data Analytics)*, submitted to Dr. Ian McLoughlin, Lecturer and Programme Director at GMIT.

Submitted by Justin Rutherford<br>
December 2018.

### Project Requirements;

*Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.*<br>
*1. Investigate the types of variables involved, their likely distributions, and their relationships with each other.*<br>
*2. Synthesise/simulate a data set as closely matching their properties as possible.*<br>
*3. Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.*



### Work Plan

1. Decide on the real world phenomenon to be simulated and access a publically available dataset. 
2. Review the dataset and extract the relevant data for further investigation.
3. Establish a baseline set of parameters for which we will then simulate data relecting closely the statistics of the real data.
4. Conduct on-line research to establish the various probability distributions used to simulate similar datasets.
5. Using the relevant Numpy Random distribution function we simulate datasets based on the statistics obtained.

### Summary Results and Commentary



In [1]:
import pandas as pd

# Read in the csv file and select the columns we are interested in;
df = pd.read_csv("http://cli.met.ie/cli/climate_data/webdata/hly375.csv", skiprows=17, low_memory=False, usecols=[0,2,4,10,12,14])

# KEYS - rain - in mm, temp in deg C, msl = mean sea level pressure (hPa), wdsp = windspeed (knot), wddir = Wind direction (degree)

In [2]:
df.tail()

Unnamed: 0,date,rain,temp,msl,wdsp,wddir
133699,31-oct-2018 20:00,0.0,6.2,1003.2,2,50
133700,31-oct-2018 21:00,0.0,5.7,1003.5,2,300
133701,31-oct-2018 22:00,0.0,5.4,1003.7,3,300
133702,31-oct-2018 23:00,0.0,4.3,1004.1,3,290
133703,01-nov-2018 00:00,0.0,3.6,1004.5,2,340


In [3]:
df.describe()

Unnamed: 0,date,rain,temp,msl,wdsp,wddir
count,133704,133704.0,133704.0,133704.0,133704,133704
unique,133704,98.0,412.0,849.0,42,37
top,29-dec-2012 22:00,0.0,,,6,180
freq,1,112206.0,2534.0,2651.0,12211,8429


In [4]:
df['Datetime']= pd.to_datetime(df['date'])

In [5]:
df.head()

Unnamed: 0,date,rain,temp,msl,wdsp,wddir,Datetime
0,01-aug-2003 01:00,,,,,,2003-08-01 01:00:00
1,01-aug-2003 02:00,,,,,,2003-08-01 02:00:00
2,01-aug-2003 03:00,,,,,,2003-08-01 03:00:00
3,01-aug-2003 04:00,,,,,,2003-08-01 04:00:00
4,01-aug-2003 05:00,,,,,,2003-08-01 05:00:00


In [6]:
#To rearrange the dataframe to have the datetime column first we call it as follows
df = df[['Datetime', 'rain', 'temp', 'msl', 'wdsp', 'wddir']]
df.tail()

Unnamed: 0,Datetime,rain,temp,msl,wdsp,wddir
133699,2018-10-31 20:00:00,0.0,6.2,1003.2,2,50
133700,2018-10-31 21:00:00,0.0,5.7,1003.5,2,300
133701,2018-10-31 22:00:00,0.0,5.4,1003.7,3,300
133702,2018-10-31 23:00:00,0.0,4.3,1004.1,3,290
133703,2018-11-01 00:00:00,0.0,3.6,1004.5,2,340


In [7]:
df = df.set_index('Datetime')
df.tail()

Unnamed: 0_level_0,rain,temp,msl,wdsp,wddir
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-10-31 20:00:00,0.0,6.2,1003.2,2,50
2018-10-31 21:00:00,0.0,5.7,1003.5,2,300
2018-10-31 22:00:00,0.0,5.4,1003.7,3,300
2018-10-31 23:00:00,0.0,4.3,1004.1,3,290
2018-11-01 00:00:00,0.0,3.6,1004.5,2,340


In [8]:
df1 = df.iloc[117647:126407]
df1.head()

Unnamed: 0_level_0,rain,temp,msl,wdsp,wddir
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2017-01-01 00:00:00,2.3,6.9,1019.9,9,330
2017-01-01 01:00:00,0.8,5.6,1020.2,9,330
2017-01-01 02:00:00,0.1,5.3,1020.2,7,340
2017-01-01 03:00:00,0.0,5.2,1020.1,8,340
2017-01-01 04:00:00,0.0,4.3,1020.2,9,340


In [9]:
df1.tail()

Unnamed: 0_level_0,rain,temp,msl,wdsp,wddir
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2017-12-31 19:00:00,0.0,4.7,991.0,11,270
2017-12-31 20:00:00,0.0,4.3,992.6,8,240
2017-12-31 21:00:00,0.1,4.6,993.1,12,250
2017-12-31 22:00:00,0.1,4.3,993.7,9,250
2017-12-31 23:00:00,0.0,4.5,993.7,11,190


In [10]:
df1.describe()

Unnamed: 0,rain,temp,msl,wdsp,wddir
count,8760.0,8760.0,8760.0,8760,8760
unique,48.0,290.0,595.0,34,36
top,0.0,9.9,,5,190
freq,7488.0,91.0,63.0,825,647


In [11]:
#Let's look at the data types we are working with.
df1.dtypes

rain     object
temp     object
msl      object
wdsp     object
wddir    object
dtype: object

In [13]:
df1 = df1.apply(pd.to_numeric, errors = 'coerce')

In [14]:
df1.dtypes

rain     float64
temp     float64
msl      float64
wdsp       int64
wddir      int64
dtype: object

In [15]:
df1.head()

Unnamed: 0_level_0,rain,temp,msl,wdsp,wddir
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2017-01-01 00:00:00,2.3,6.9,1019.9,9,330
2017-01-01 01:00:00,0.8,5.6,1020.2,9,330
2017-01-01 02:00:00,0.1,5.3,1020.2,7,340
2017-01-01 03:00:00,0.0,5.2,1020.1,8,340
2017-01-01 04:00:00,0.0,4.3,1020.2,9,340


In [17]:
#So now we should have some reference data to use as a guide in generating some random numbers!
Ref_data = df1.describe()
Ref_data

Unnamed: 0,rain,temp,msl,wdsp,wddir
count,8754.0,8697.0,8697.0,8760.0,8760.0
mean,0.086737,10.47759,1015.125595,7.409932,218.449772
std,0.388932,4.95477,11.442515,4.216714,80.052463
min,0.0,-4.2,970.8,0.0,10.0
25%,0.0,7.3,1008.6,4.0,170.0
50%,0.0,10.8,1016.2,7.0,210.0
75%,0.0,13.9,1022.6,10.0,280.0
max,11.0,25.8,1037.9,37.0,360.0


In [18]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Having read the paper on 'PDF of rainrate and estimation of rainfall' (as referenced in the footer), we will use the lognormal distribution function;  
Rain = np.random.lognormal(0.14,0.49,8760)
Rain

array([1.11278654, 1.65038106, 0.76286213, ..., 1.16994196, 0.75799245,
       3.043568  ])

In [None]:
Rain.max()

In [None]:
Rain.mean()

In [None]:
plt.hist(Rain)

In [None]:
# Let's increase the number of bins to give us a more accurate reflection of the sampling fequency;
count, bins, ignored = plt.hist(Rain, 50)

In [None]:
#Now let's look at the statistics of the number array (ref = https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.describe.html)
from scipy import stats
stats.describe(Rain)

In [None]:
#Having read the paper on "Matching Temperature Data to a Normal Distribution" we will now deploy a random normal distribution
#generate comparable data to the target dataset.

In [None]:
Temp = np.random.normal(9.04, 4.54, 8760)
Temp

In [None]:
Temp.max()

In [None]:
Temp.min()

In [None]:
Temp.mean()

In [None]:
count, bins, ignored = plt.hist(Temp,20)

In [None]:
stats.describe(Temp)

### Mean Sea Level Pressure (hPa)
Having read the paper *"Statistical Properties of the Atmospheric Pressure Field over the Artic Ocean"*<sup>[Ref](https://journals.ametsoc.org/doi/pdf/10.1175/1520-0469%281982%29039%3C2229%3ASPOTAP%3E2.0.CO%3B2)</sup>, we will employ a normal distribution to generate random pressure variables.

In [None]:
MSL = np.random.normal(1013.5, 13.22, 8760)
MSL

In [None]:
count, bins, ignored = plt.hist(MSL, 50,)

In [None]:
stats.describe(MSL)

### Wind Speed (WdSp) in knots
Having read the following paper *"Analytical study of different probability distributions for wind speed related to power statistics"*<sup>[Ref](https://ieeexplore.ieee.org/document/5211970)</sup>, we elected to use a wald distribution function in generating random varialbes for wind speed.

In [None]:
WdSp = np.random.wald(9.65,5.27, 8760)
WdSp

In [None]:
count, bins, ignored = plt.hist(WdSp, 50)

In [None]:
stats.describe(WdSp)

In [None]:
count, bins, ignored = plt.hist(Rain, 50, normed=True, align='mid')

In [None]:
mu=0.14
sigma = 0.49

x = np.linspace(min(bins), max(bins), 10000)
pdf = (np.exp(-(np.log(x) - mu)**2 / (2 * sigma**2))/ (x * sigma * np.sqrt(2 * np.pi)))
plt.plot(x, pdf, linewidth=2, color='r')
plt.axis('tight')
plt.show()

In [None]:
df.max()

In [None]:
df.min()

In [None]:
df.std()

In [None]:
df.iloc[:,2].mean()

In [None]:
df['msl']

In [None]:
pd.to_numeric('msl')

In [None]:
#To tidy up the frame view we can set Datetime as the index as follows;
#df = df.set_index(['Datetime'])

In [None]:
#df.describe(exclude=['Datetime'])

[The Probability Density Function of Rain Rate and the Estimation of Rainfall by Area Integrals](https://journals.ametsoc.org/doi/10.1175/1520-0450%281994%29033%3C1255%3ATPDFOR%3E2.0.CO%3B2)

[SIMULATING MAXIMUM AND MINIMUM DAILY TEMPERATURE
WITH THE NORMAL DISTRIBUTION](https://naldc.nal.usda.gov/download/27264/PDF)

[Matching Temperature Data to a Normal Distribution](http://demonstrations.wolfram.com/MatchingTemperatureDataToANormalDistribution/)



[Statistical Properties of the Atmospheric Pressure Field over the Artic Ocean](https://journals.ametsoc.org/doi/pdf/10.1175/1520-0469%281982%29039%3C2229%3ASPOTAP%3E2.0.CO%3B2)

[Probability distributions for offshore wind speeds](https://engineering.tufts.edu/cee/people/vogel/documents/probabilityDistributionsOffshoreWindSpeeds.pdf)

[Analytical study of different probability distributions for wind speed related to power statistics](https://ieeexplore.ieee.org/document/5211970)