# Machine Learning and Statistics
Project notebook for Machine Learning and Statistics @ GMIT - 2020

Author: Maciej Izydorek (G00387873@gmit.ie) Github: [mizydorek](https://github.com/mizydorek/Fundamentals-of-Data-Analysis-Project-2020)

***

#### Project Description

*Create a web service that uses machine learning to make predictions based on the powerproduction data set. The goal is to produce a model that accurately predicts wind turbine power output from wind speed values, as in the data set. You must then develop a web service that will respond with predicted power values based on speed values sent as HTTP requests.*

#### Introduction to Dataset

##### — Packages

In [1]:
# Numerical arrays.
import numpy as np

# Data manipulation and analysis.
import pandas as pd 

# Plotting.
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

# plot settings.
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = [16,9]

# Set your custom color palette
colors = ["#495057", "#212529", "#6C757D", "#ADB5BD", "#CED4DA"]
sns.set_palette(sns.color_palette(colors))
cmap = matplotlib.colors.ListedColormap(colors)

##### — Load Dataset

In [2]:
# Load data set.
df = pd.read_csv("https://raw.githubusercontent.com/ianmcloughlin/2020A-machstat-project/master/dataset/powerproduction.csv")

##### — Preview of dataset

In [3]:
# Preview of dataset.
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,490,491,492,493,494,495,496,497,498,499
speed,0.0,0.125,0.15,0.225,0.275,0.325,0.4,0.45,0.501,0.526,...,24.499,24.525,24.575,24.65,24.75,24.775,24.85,24.875,24.95,25.0
power,0.0,0.0,0.0,0.0,0.0,4.331,5.186,3.826,1.048,5.553,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
# Shape of dataset.
df.shape

(500, 2)

In [5]:
# Have a look at some basic statistical details.
df.describe()

Unnamed: 0,speed,power
count,500.0,500.0
mean,12.590398,48.014584
std,7.224991,41.614572
min,0.0,0.0
25%,6.32475,5.288
50%,12.5505,41.6455
75%,18.77525,93.537
max,25.0,113.556


##### — Standard Missing values

In [6]:
# Check if dataset contains any missing values.
# https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b
df.isnull().sum().sum()

0

##### — Non-Standard Missing values

In [7]:
# Check if dataset contains any missing values according to specified list.
# https://stackoverflow.com/questions/43424199/display-rows-with-one-or-more-nan-values-in-pandas-dataframe
missing_values=['n/a', 'na', '--', ' ']
df = pd.read_csv('https://raw.githubusercontent.com/ianmcloughlin/2020A-machstat-project/master/dataset/powerproduction.csv', na_values=missing_values)
df.isna().sum().sum()

0

##### — Negative values

In [8]:
# check if dataset contains any negative values
df[(df.speed < 0) & (df.power < 0)].sum().sum()

0.0

##### — Correlation

In [9]:
# Display correlation between wind speed and power output.
df.corr()

Unnamed: 0,speed,power
speed,1.0,0.853778
power,0.853778,1.0


From a quick preview of dataset can be seen that dataset contains 500 rows and two columns with entries for wind and power outputs displayed accordingly in meters per second (m/s) and kilowatt-hour (kWh). There is no standard, non-standard or negative values in dataset. It can also be observed that at the tail of the dataset where wind speed is max out there are zero values in power column. Let's make some plots to identify outliers.