# Exercise 2: Statisics

## Task 1: Data Analysis with Pandas
Make sure you installed the pandas package. Download the Iris Plant Dataset from the UCI Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/Iris


In [1]:
import numpy as np
import pandas as pd

### a) Preprocessing and Descriptive Statistics

Read the Iris dataset into a pandas dataframe. Note that you will need to name the columns yourself according to the _attribute information_ on the UCI website above. Print the dataframe and make sure your dataframe has 150 rows.

In [2]:
columns = [
    "sepal length", # in cm
    "sepal width", # in cm
    "petal length", # in cm
    "petal width", # in cm
    "class" # one of: Setosa, Versicolour, Virginica
]

dataFrame = pd.read_csv("./iris.data", header=None, names=columns)

print("Dataframe rows: ", len(dataFrame.index))

dataFrame.head()

Dataframe rows:  150


Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Use pandas built-in functions to compute the mean, variance, minimum and maximum of the _sepal length_ of all plants in the datasets

In [3]:
dataFrame["sepal length"].describe()
dataFrame["petal width"].describe()

count    150.000000
mean       1.198667
std        0.763161
min        0.100000
25%        0.300000
50%        1.300000
75%        1.800000
max        2.500000
Name: petal width, dtype: float64

Write a function that takes a (numerical) column of a pandas dataframe as input and computes its median. Use it to compute the median of the _petal width_ and compare it to the output of pandas built-in median function.

In [4]:
import math

def my_median (column):
    column.sort_values()
    median_pos = (len(column) - 1) / 2
    start = math.floor(median_pos)
    end = math.ceil(median_pos) + 1
    
    return sum(column[start:end]) / 2

column = dataFrame["petal width"]

print("built-in median: ", column.median())
print("custom median: ", my_median(column)) # TODO!

built-in median:  1.3
custom median:  1.35


### b) Pearson's Correlation Coefficient

Write a function that takes two (numerical) pandas columns as input and returns their Pearson correlation coefficient. Do not use any pandas/numpy/scipy built-ins.

In [5]:
def pearson (col_a, col_b):
    item_count = len(col_a)
    
    mean_a = sum(col_a) / item_count
    mean_b = sum(col_b) / item_count
        
    deviations_a = list(map(lambda a: a - mean_a, col_a))
    deviations_b = list(map(lambda b: b - mean_b, col_b))
    
    pairs = zip(deviations_a, deviations_b)
    covariance = sum(map(lambda pair: pair[0]*pair[1], pairs)) / item_count  
        
    variance_a = sum(map(lambda a: a**2, deviations_a)) / item_count
    variance_b = sum(map(lambda b: b**2, deviations_b)) / item_count
        
    sigma_a = math.sqrt(variance_a)
    sigma_b = math.sqrt(variance_b)
    
    return covariance / (sigma_a * sigma_b)

Apply your function to compute the correlation between _sepal length_ and _sepal width_. Check it for correctness by applying the corresponding scipy built-in. 

In [6]:
import scipy.stats

col_a = dataFrame["sepal length"]
col_b = dataFrame["sepal width"]

custom_pearson = pearson(col_a, col_b)
scipy_pearson = scipy.stats.pearsonr(col_a, col_b)

print("Custom Pearson: ", custom_pearson)
print("Scipy Pearson: ", scipy_pearson[0])

Custom Pearson:  -0.10936924995064935
Scipy Pearson:  -0.10936924995064937


### c) Hypothesis Testing

Compute the mean _sepal width_ for all plants that are classed as _Iris-versicolor_ and for all plants that are classed as _Iris-virginica_ .

In [7]:
filter_by_class = dataFrame["class"].isin(["Iris-versicolor", "Iris-virginica"])

dataFrame[filter_by_class]["sepal width"].mean()

2.8719999999999994

Consider the null hypothesis that there is no difference in the means of the groups. Compute the corresponding _p_-value by shuffling the class labels 100000 times and computing the difference in means each of these times. Do you oberve a significant difference?

### d) The Bootstrap

We consider the _sepal width_ of all plants in the data that are classed as _Iris-setosa_. Compute the 95% confidence interval of their mean by bootstrapping the data 10000 times. 

## Task 2: A Dice Game

Consider the following game of dices: You roll 5 dice, and you get points for each die you roll.
For each one, you get 100 points, for each six, you get 60 points, for all other numbers just the shown value (e.g., you get 3 points for a 3). Your total score is the sum of these points.

### a) Simulation and Plotting

Simulate the game 100,000 times, and save both every single dice roll as well as the resulting score for each of the 100000 rounds. Plot a histogram of outcomes.

In [8]:
# use matplotlib/pyplot
from matplotlib import pyplot as plt

ModuleNotFoundError: No module named 'matplotlib'

### b) Hypothesis Testing pt. 2
Assume that in your initial roll, you scored 268. Is this signficantly above what is to be expected? Compute the corresponding _p_-value.

### c) Conditional Probability and Bayes Theorem

Based on your simulation, give an estimation of the probability of scoring over 100 points, given that you did not roll a single 1.

Now estimate the probability of scoring over 100 points, and apply your previous results and Bayes Theorem to compute the probability of not rolling a 1 given that you score over 100 points.