## Programming for Data Analysis Project 2018

### Patrick McDonald G00281051

#### Problem statement

For this project you must create a data set by simulating a real-world phenomenon of your choosing. You may pick any phenomenon you wish – you might pick one that is of interest to you in your personal or professional life. Then, rather than collect data related to the phenomenon, you should model and synthesise such data using Python. We suggest you use the numpy.random package for this purpose.

Specifically, in this project you should:

1. Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.
2. Investigate the types of variables involved, their likely distributions, and their relationships with each other.
3. Synthesise/simulate a data set as closely matching their properties as possible.
4. Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.


### 1. Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.

For the purpose of this project, I shall extract some wave buoy data from the [M6 weather buoy](http://www.marine.ie/Home/site-area/data-services/real-time-observations/irish-weather-buoy-network) off the westcoast of Ireland. I surf occassionally, and many surfers, like myself; use weather buoy data in order to predict when there will be decent waves to surf. There are many online resources that provide such information, but I thought this may be an enjoyable exploration of raw data that is used everyday, worldwide.

In [26]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Downloaded hly62095.csv from https://data.gov.ie/dataset/hourly-data-for-buoy-m6 
# Opened dataset in VSCode. It contains the label legend, so I have skipped these rows.
# I also only want to utilise 4 relevant columns of data, I'll use the 'usecols' arguement:
# https://realpython.com/python-data-cleaning-numpy-pandas/#dropping-columns-in-a-dataframe

df = pd.read_csv("hly62095.csv", skiprows = 19, low_memory = False, usecols= ['date', 'dir', 'per', 'wavht'])

# Change the date column to a Pythonic datetime - 
# reference: https://github.com/ianmcloughlin/jupyter-teaching-notebooks/raw/master/time-series.ipynb

df['datetime'] = pd.to_datetime(df['date'])

Downloaded hly62095.csv from https://data.gov.ie/dataset/hourly-data-for-buoy-m6. Opened dataset in VSCode. It contains the label legend, so I have skipped these rows  1-19:

###Label legend

```
1.  Station Name: M6
2.  Station Height: 0 M 
3.  Latitude:52.990  ,Longitude: -15.870
4. 
5. 
6.  date:	  -  Date and Time (utc)
7.  temp:  	  -  Air Temperature (C)	
8.  rhum:	  -  Relative Humidity (%)
9.  windsp:	  -  Mean Wind Speed (kt)
10. dir:	  -  Mean Wind 	Direction (degrees)
11. gust:	  -  Maximum Gust (kt)
12. msl:	  -  Mean Sea Level Pressure (hPa)
13. seatp:	  -  Sea Temperature (C)
14. per:	  -  Significant Wave Period (seconds)
15. wavht:	  -  Significant Wave Height (m)
16. mxwav: 	  -  Individual Maximum Wave Height(m)
17. wvdir:    -  Wave Direction (degrees)
18. ind:      -  Indicator    
19. 
20. date,temp,rhum,wdsp,dir,gust,msl,seatp,per,wavht,mxwave,wvdir
21. 25-sep-2006 09:00,15.2, ,8.000,240.000, ,1007.2,15.4,6.000,1.5, , 
22. 25-sep-2006 10:00,15.2, ,8.000,220.000, ,1008.0,15.4,6.000,1.5, ,......... 

```

In [25]:
# View DataFrame
df

Unnamed: 0,date,dir,per,wavht,datetime
0,25-sep-2006 09:00,240.000,6.000,1.5,2006-09-25 09:00:00
1,25-sep-2006 10:00,220.000,6.000,1.5,2006-09-25 10:00:00
2,25-sep-2006 11:00,220.000,6.000,1.5,2006-09-25 11:00:00
3,25-sep-2006 12:00,240.000,6.000,1.0,2006-09-25 12:00:00
4,25-sep-2006 13:00,280.000,6.000,1.2,2006-09-25 13:00:00
5,25-sep-2006 14:00,280.000,5.000,1.2,2006-09-25 14:00:00
6,25-sep-2006 15:00,270.000,5.000,1.3,2006-09-25 15:00:00
7,25-sep-2006 16:00,280.000,5.000,1.4,2006-09-25 16:00:00
8,25-sep-2006 17:00,280.000,5.000,1.6,2006-09-25 17:00:00
9,25-sep-2006 18:00,270.000,6.000,1.8,2006-09-25 18:00:00


There are a significant missing datapoints, and its a large sample. I'm going to explore this further, and extract the relevant data for the first week of September 2018. This will give me enough data to explore and simulate for this project.
First, I'll describe the datatypes in the set.

In [27]:
df.describe()

Unnamed: 0,date,dir,per,wavht,datetime
count,94248,94248.0,94248.0,94248.0,94248
unique,94248,327.0,13.0,152.0,94248
top,11-mar-2017 14:00,280.0,7.0,2.0,2010-07-31 10:00:00
freq,1,5022.0,26723.0,2997.0,1
first,,,,,2006-09-25 09:00:00
last,,,,,2018-09-30 23:00:00


I want to view the data for the first week of September 2018. So I'll extract the relevant datapoints from this dataset.

In [9]:
# Create a datetime index for a data frame.

# Adapted from: https://pandas.pydata.org/pandas-docs/stable/timeseries.html 
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html

# One week commencing from midnight September 1st, 2018

rng = pd.date_range(start='1-sep-2018', periods=168, freq='H')

In [28]:
rng

DatetimeIndex(['2018-09-01 00:00:00', '2018-09-01 01:00:00',
               '2018-09-01 02:00:00', '2018-09-01 03:00:00',
               '2018-09-01 04:00:00', '2018-09-01 05:00:00',
               '2018-09-01 06:00:00', '2018-09-01 07:00:00',
               '2018-09-01 08:00:00', '2018-09-01 09:00:00',
               ...
               '2018-09-07 14:00:00', '2018-09-07 15:00:00',
               '2018-09-07 16:00:00', '2018-09-07 17:00:00',
               '2018-09-07 18:00:00', '2018-09-07 19:00:00',
               '2018-09-07 20:00:00', '2018-09-07 21:00:00',
               '2018-09-07 22:00:00', '2018-09-07 23:00:00'],
              dtype='datetime64[ns]', length=168, freq='H')

I'm using 4 variables from the dataset. These are;

1. date:  -  Date and Time (utc)
2. dir:	  -  Mean Wind 	Direction (degrees)
3. per:	  -  Significant Wave Period (seconds) - This is important for quality waves!
4. wavht: -  Significant Wave Height (m)

In [29]:
df.head(10)

Unnamed: 0,date,dir,per,wavht,datetime
0,25-sep-2006 09:00,240.0,6.0,1.5,2006-09-25 09:00:00
1,25-sep-2006 10:00,220.0,6.0,1.5,2006-09-25 10:00:00
2,25-sep-2006 11:00,220.0,6.0,1.5,2006-09-25 11:00:00
3,25-sep-2006 12:00,240.0,6.0,1.0,2006-09-25 12:00:00
4,25-sep-2006 13:00,280.0,6.0,1.2,2006-09-25 13:00:00
5,25-sep-2006 14:00,280.0,5.0,1.2,2006-09-25 14:00:00
6,25-sep-2006 15:00,270.0,5.0,1.3,2006-09-25 15:00:00
7,25-sep-2006 16:00,280.0,5.0,1.4,2006-09-25 16:00:00
8,25-sep-2006 17:00,280.0,5.0,1.6,2006-09-25 17:00:00
9,25-sep-2006 18:00,270.0,6.0,1.8,2006-09-25 18:00:00
