# Practicum 4: Exploring Data



## Task 6: Computing summary statistics using numpy

  - Load the Iris dataset (`data/iris.data`) into a 4x150 numpy array.
  - Complete the tables using numpy.

Hint: you can exploit the fact that the input is ordered by class: the first 50 records are Iris Setosa,
records 51-100 are Iris Versicolour, and records 101-150 are Iris Virginica.

We will use the **csv** module for reading in data from a file.

In [1]:
import csv

It is common to import **numpy** under the briefer name **np**.

In [2]:
import numpy as np

The data set is stored in a comma-separated text file.

We read it and store it as a list of records, where each record is represented using a dict.

In [3]:
def load_iris_data(filename):
    records = []
    with open(filename, 'rt') as csvfile:
        csvreader = csv.reader(csvfile, delimiter=',')
        for row in csvreader:
            if len(row) == 5:  # if we have 4 fields in that line
                records.append({
                    "sepal_length": float(row[0]),
                    "sepal_width": float(row[1]),
                    "petal_length": float(row[2]),
                    "petal_width": float(row[3]),
                    "class": row[4]
                })
    return records

iris_data = load_iris_data("data/iris.data")

Load data into a numpy array

In [4]:
arr = np.array([
    [x['sepal_length'] for x in iris_data],
    [x['sepal_width'] for x in iris_data],
    [x['petal_length'] for x in iris_data],
    [x['petal_width'] for x in iris_data],
], float)

### Complete the tables using numpy

| Class | Attribute | Summary statistics | Result | 
| --- | --- | --- | --- |
| Iris Setosa | sepal length | mean | 5.01  |
| Iris Virginica | petal length | median | 5.55 |
| Iris Versicolor | sepal width | range | 1.4 |
| All together | sepal length | 70% percentile | 6.3 |
| All together | sepal width | 70% percentile | 3.2 |

  * What is the mean `sepal length` for Iris Setosa?

In [5]:
np.mean(arr[0][0:50])

5.0060000000000002

  * What is the median `petal length` for Iris Virginica?

In [6]:
np.median(arr[2][100:150])

5.5499999999999998

  * What is the range of `sepal width` for Iris Versicolor?

In [7]:
np.ptp(arr[1][50:100])

1.3999999999999999

  * What is the 70% percentile for `sepal length` and `sepal width` (for all classes together)?

In [8]:
np.percentile(arr[0], 70)

6.2999999999999998

In [9]:
np.percentile(arr[1], 70)

3.2000000000000002

  * Which class (Setosa/Versicolour/Virginica) shows the highest variance in `petal width`?

| Class | Variance | 
| --- | --- |
| Iris Setosa | 0.0115 |
| Iris Versicolor | 0.0391 |
| Iris Virginica | 0.0754 |

In [10]:
np.var(arr[3][0:50], ddof=1)

0.011493877551020411

In [11]:
np.var(arr[3][50:100], ddof=1)

0.03910612244897959

In [12]:
np.var(arr[3][100:150], ddof=1)

0.075432653061224486