# Datasets
We are going to work on two datasets.

1. Iris dataset.
2. KNMI dataset (Weather data from Netherlands).

[Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) is a dataset that describes flowers and their attributes, and it is used a lot in the data science and machine learning.
We can get it from one of these two sources:
1. from scikit-learn.
2. download it from [iris website](https://archive.ics.uci.edu/ml/datasets/iris)

[KNMI dataset](https://www.knmi.nl/climatology/daily_data/download.html). a daily measure of weather info from plants across Netherlands.

## Iris Dataset

### In case you want to play with scikit-learn

scikit-learn comes with a few standard datasets, and iris is one of them.  
It is formatted in a way to be used by machine learning.   
This is not covered in this presentation, but for your own reference to get it from scikit-learn:


In [19]:
from sklearn import datasets
iris_sk = datasets.load_iris()
print ('type of iris: ', type(iris_sk))
print ('dict keys: ' , iris_sk.keys())
print ('type of iris data: ', type(iris_sk.data))
print ('iris column names: ' , iris_sk.feature_names)
print ('type of data: ', type(iris_sk.data))

type of iris:  <class 'sklearn.utils.Bunch'>
dict keys:  dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
type of iris data:  <class 'numpy.ndarray'>
iris column names:  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
type of data:  <class 'numpy.ndarray'>


### Load iris from csv:
We are going to work out of CSV file.

In [20]:
import pandas as pd
iris = pd.read_csv('data/iris.data.csv', header=None, 
    names=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'species'])

In [21]:
iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


# Descriptive Statistics with Numpy
`numpy` provides the basic of descriptive statistics.  
Although `pandas` has statistical functions, but they are from `numpy`.  
In this tutorial we will work mainly on numpy.

## Get Numpy arrays from Pandas's dataframe:

In [22]:
sepal_widths = iris['sepal width (cm)'].to_numpy()
petal_widths = iris['petal width (cm)'].to_numpy()
petal_lengths = pd.Series.to_numpy(iris['petal length (cm)'])
sepal_lengths = pd.Series.to_numpy(iris['sepal length (cm)'])

## Basic Descriptive Statistics:

In [23]:
import numpy as np
print ('The max of sepal length: ', sepal_lengths.max(), ' -- ', np.max(sepal_lengths))
print ('The min of sepal length: ', sepal_lengths.min(),  ' -- ', np.min(sepal_lengths))
print ('The mean of sepal length: ', sepal_lengths.mean(), ' -- ',  np.mean(sepal_lengths))
print ('The STD of sepal length: ', sepal_lengths.std(),  ' -- ', np.std(sepal_lengths))


The max of sepal length:  7.9  --  7.9
The min of sepal length:  4.3  --  4.3
The mean of sepal length:  5.843333333333334  --  5.843333333333334
The STD of sepal length:  0.8253012917851409  --  0.8253012917851409


### Median and Percentile:

In [24]:
from scipy.stats import scoreatpercentile
print ('The median of sepal length: ', np.median(sepal_lengths))
print ('Percentile or quartile 50%: ', scoreatpercentile(sepal_lengths, 50))
print ('Percentile or quartile 75%: ', scoreatpercentile(sepal_lengths, 75))

The median of sepal length:  5.8
Percentile or quartile 50%:  5.8
Percentile or quartile 75%:  6.4


## Pandas statistics:

`pandas` adds its own functionalities that are built on top of `numpy`.  

### Dataframe's describe:

In [25]:
iris.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### Dataframe's plot:

In [26]:
#ipython magic function
%matplotlib notebook
iris.boxplot(return_type='axes')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x12a7fc4a8>

In [27]:
iris.quantile([0.1, 0.9])

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0.1,4.8,2.5,1.4,0.2
0.9,6.9,3.61,5.8,2.2


# Dealing with empty / invalid data:

many times the data that we got is not ideal, and we could find empty or invalid data.  
We are going to explore Netherlands weather data as example.  

Let us read weather data, and replace empty values with `nan`

In [28]:
to_float = lambda x: float(x.strip() or np.nan)
min_temp = np.loadtxt('./data/KNMI.txt', delimiter=',', usecols=(12), unpack=True, 
                                converters={12: to_float}, 
                                                 skiprows=1) * .1

print ('min including nan: ', min_temp.min())
print ('max including nan: ', min_temp.max())

min including nan:  nan
max including nan:  nan


### exclude NaN in statistics:

* nanmin
* nanmax
* nanmean
* nanmedian
* ...

In [38]:
print ('excluding nan: ')
print ('min', np.nanmin(min_temp))
print ('max', np.nanmax(min_temp))
print ('mean', np.nanmean(min_temp))
print ('median', np.nanmedian(min_temp))

excluding nan: 
min -19.700000000000003
max 23.6
mean 6.627724838066767
median 6.800000000000001


## More robust approach to exclude invalid data Masked Arrays:

nanmin is just a simpler and shorter approach to a more robust way to deal with invalid data.  
It is called `Masked Arrays`.  
It is part of sub-module `numpy.ma`.  

In [87]:
import numpy.ma as ma
inp = np.array([1,2,3,-100,5])
negative = lambda x: x < 0
print ('negative of the input array : ', negative(inp))
mask_inp= ma.array(inp, mask = negative(inp))
print ('mean of positive numbers: ', mask_inp.mean())

negative of the input array :  [False False False  True False]
mean of positive numbers:  2.75


### More features of masked arrays:  

1. By using masked arrays then we don't need to use nanmin, ....  
We can use min, max, mean, ...etc

In [51]:
import numpy.ma as ma

masked_min = ma.array(ranges, mask = np.isnan(min_temp))
print ('min without using nanmin', masked_min.min())

min without using nanmin 0.4


2. we can get the valid only value:

In [54]:
print ('length all : ', len(masked_min))
print ('length of valid: ', len(masked_min.compressed()))

length all :  93272
length of valid:  64224


another way is to use: `masked_invalid`, which will include `NaN` and `Infinite`

In [47]:
maxs = ma.masked_invalid(max_temp)
maxs.min()

-13.200000000000001

### Check for global warming:


In [60]:
avg_temp, min_temp, max_temp = np.loadtxt('./data/KNMI.txt', delimiter=',', usecols=(11, 12,14), unpack=True, 
                                converters={11: to_float, 12: to_float, 14: to_float}, 
                                                 encoding='latin1', skiprows=1) * .1

In [61]:
avg_temp = ma.array(avg_temp, mask = np.isnan(avg_temp))
min_temp = ma.array(min_temp, mask = np.isnan(min_temp))
max_temp = ma.array(avg_temp, mask = np.isnan(max_temp))

In [57]:
from datetime import datetime as dt
to_year = lambda x: dt.strptime(x, "%Y%m%d").year

years = np.loadtxt('./data/KNMI.txt', delimiter=',', usecols=(1), unpack=True,
                  converters={1: to_year}, encoding='latin1', skiprows=1)

print ('first year: ', years.min(), ' and last year: ', years.max())

first year:  1951.0  and last year:  2019.0


and we will draw a graph of mean tempreture by year

In [58]:
year_range = range(int(years.min()), int(years.max()))

print ('range of years: ', year_range)
avg_of_avg_temp_by_year = [avg_temp[np.where (years == year)].mean() for year in year_range]


range of years:  range(1951, 2019)


#### another way: 

```python
avg_of_avg_temp_by_year=[]
for year in range(int(years[0]), int(years[-1]) -1):
    indices = np.where (years==year)
    avgs.append(avg_temp[indices].mean())
```

In [59]:
import matplotlib.pyplot as plt
plt.plot(year_range, avg_of_avg_temp_by_year, 'r-', label='Yearly Averages')
plt.plot(year_range, np.ones(len(avg_of_avg_temp_by_year)) * np.mean(avg_of_avg_temp_by_year))
plt.legend(prop={'size': 'x-small'})
plt.show()

<IPython.core.display.Javascript object>

# EDA: Beyond Basic Statistics
Descriptive statistics are the basic of more deeper and comprehensive analysis. It is the basic tool for the `EDA` **Exploratory Data Analysis**. EDA is required to understand a dataset better, check its features and get a perliminary idea about the data.  

## Correlation Coefficient

`Numpy` has functionality to study the [`Correlation coefficient`](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) between two variables:

In [84]:
sun_radiation = np.loadtxt('./data/KNMI.txt', delimiter=',', usecols=(20), unpack=True, 
                                converters={20: to_float}, 
                                                 encoding='latin1', skiprows=1)
sun_radiation = ma.masked_invalid(sun_radiation)
match_temp = avg_temp[np.logical_not(np.isnan(sun_radiation))]
corr = np.corrcoef(match_temp, sun_radiation.compressed())[0][1]
print ('Correlation Coefficient: ', corr)

Correlation Coefficient:  0.6171336988259745


## Differentiation (Derivative)

