# Pandas use case:

# Part I - Data Loading, Munging, Missing Values

First, you need to understand the data in more detail, clean and combine it.

**Try to make as many of your solution cells idempotent as possible.**

# Iris Dataset revisited

Do you remember Bernd the Botanist from the previous lecture? The one with the Iris flowers? If not, go back and re-read this business problem!

Bernd has decided to go ahead and solve his Iris-Flower classification problem. As there are too many flowers for him to measure all by himself, he her asked a few people to help him (you will find all datasets mentioned below in the folder `data`).

### 1) Mary, the biology student
He asks Mary to measure all the Iris setosa he has in the lab. At the end of the week, Mary provides him with a file `setosa.csv` and mentions "sorry it took so long, but I had to study for my final exam in every break I took". 

### 2) Tom, the gardener
He asks Tom to measure all the Iris versicolor. Tom is a very diligent person and comes back after two days with the file `versicolor.xlsx`, telling Bernd "I did all the measurements for the 50 flowers you asked for, first the sepal length and width in centimeters, and than the petal length and width, which I did in millimeters, as the numbers were quite small. Hope this helps!"

### 3) Angi and Angus, two summer interns
He asks Angi and Angus to measure all the Iris viriginica. The two decide to split up the work. They number each flower (from 1 to 50). Angi does the sepal measurements while Angus is responsible for the petal measurements. At the end of the week they give him two files, Angi has done the measurements in cm starting with plant number 1 going forward (`virginica angi.txt`) and Angus has done his measurements in mm starting with plant number 50 going backwards (`virginica angus.csv`). 

### Exercise I.1

Load each of the 4 files into a separate DataFrames each, called 'setosa_raw', 'versicolor_raw', 'angi_virginica_raw' and 'angus_virginica_raw'.

In [2]:
# imports
%matplotlib notebook
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

In [3]:
## ---------- SOLUTIONS

In [4]:
## let's look at the files using operating system command first 
# (the following commented out to keep the output concise)
#!ls -la /data/datasets/iris/lecture03
#!cat /data/datasets/iris/lecture03/setosa.csv
# etc

In [5]:
setosa_raw = pd.read_csv('data/setosa.csv', delimiter=';')
setosa_raw

Unnamed: 0,sepal length,sepal width,petal length,petal width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.9,3.0,1.4,
4,4.6,3.1,1.5,0.2
5,5.0,3.6,1.4,0.2
6,5.4,3.9,1.7,0.4
7,4.6,3.4,1.4,0.3
8,5.0,3.4,1.5,0.2
9,4.6,3.5,,0.1


In [6]:
versicolor_raw = pd.read_excel('data/versicolor.xls', header=None, 
                               names=['sepal length', 'sepal width', 'petal length', 
                                      'petal width'])
versicolor_raw

Unnamed: 0,sepal length,sepal width,petal length,petal width
0,7.0,3.2,47,14
1,6.4,3.2,45,15
2,6.9,3.1,49,15
3,5.5,2.3,40,13
4,6.5,2.8,46,15
5,5.7,2.8,45,13
6,6.3,3.3,47,16
7,4.9,2.4,33,10
8,6.6,2.9,46,13
9,5.2,2.7,39,14


In [7]:
angi_virginica_raw = pd.read_csv('data/virginica angi.txt', 
                                delimiter='\s+', header=None,
                                index_col=0, names=['sepal length', 'sepal width'])
# header = None is not really needed, but nice to have for documentation
angi_virginica_raw

Unnamed: 0,sepal length,sepal width
1,6.3,3.3
2,5.8,2.7
3,7.1,3.0
4,6.3,2.9
5,6.5,3.0
6,7.6,3.0
7,4.9,2.5
8,7.3,2.9
9,6.7,2.5
10,7.2,3.6


In [8]:
angus_virginica_raw = pd.read_csv('data/virginica angus.csv', 
                                  index_col=0, 
                                  names=['petal length', 'petal width'], header=0)
# header = 0 is needed here
angus_virginica_raw

Unnamed: 0,petal length,petal width
50,51,18
49,54,23
48,52,20
47,50,19
46,52,23
45,57,25
44,59,23
43,51,19
42,51,23
41,56,24


### Exercise I.2

Convert all measurements which are not in cm to cm (by changing the 'xxx_raw' DataFrames) and combine the DataFrames for Angi and Angus into one DataFrame 'virginica_raw'.

In [9]:
## ---------- SOLUTIONS

In [10]:
versicolor_raw['petal length'] = versicolor_raw['petal length']/10
versicolor_raw['petal width'] = versicolor_raw['petal width']/10

In [11]:
angus_virginica_raw['petal length'] = angus_virginica_raw['petal length']/10
angus_virginica_raw['petal width'] = angus_virginica_raw['petal width']/10

In [12]:
virginica_raw = pd.merge(angi_virginica_raw, angus_virginica_raw, 
                         left_index=True, right_index=True)
virginica_raw

Unnamed: 0,sepal length,sepal width,petal length,petal width
1,6.3,3.3,6.0,2.5
2,5.8,2.7,5.1,1.9
3,7.1,3.0,5.9,2.1
4,6.3,2.9,5.6,1.8
5,6.5,3.0,5.8,2.2
6,7.6,3.0,6.6,2.1
7,4.9,2.5,4.5,1.7
8,7.3,2.9,6.3,1.8
9,6.7,2.5,5.8,1.8
10,7.2,3.6,6.1,2.5


### Exercise I.3

Now you should have one DataFrame for each kind of Iris (each 'class'). Let's check each of these for missing values!

* Remove all tupels with missing values (creating three new DataFrames 'xxx_nmv' (for NoMissingValues), do not change the 'xxx_raw' DataFrames!).
* Replace all missing values with the mean of the attribute (again, create three new DataFrames 'xxx_mmv' (vor MeanMissingValues).

Save all three 'xxx_nmv' datasets into three csv files in the '/output'-directory.

In [13]:
## ---------- SOLUTIONS

In [14]:
# only setosa contains missing values, the other two are fine

In [15]:
versicolor_nmv = versicolor_raw
versicolor_mmv = versicolor_raw
virginica_nmv = virginica_raw
virginica_mmv = virginica_raw

In [16]:
setosa_nmv = setosa_raw.dropna()
setosa_nmv

Unnamed: 0,sepal length,sepal width,petal length,petal width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
4,4.6,3.1,1.5,0.2
5,5.0,3.6,1.4,0.2
6,5.4,3.9,1.7,0.4
7,4.6,3.4,1.4,0.3
8,5.0,3.4,1.5,0.2
10,4.4,2.9,1.4,0.2
11,4.9,3.1,1.5,0.1


In [17]:
setosa_mmv = setosa_raw.fillna(value=setosa_raw.mean())
setosa_mmv

Unnamed: 0,sepal length,sepal width,petal length,petal width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.9,3.0,1.4,0.25614
4,4.6,3.1,1.5,0.2
5,5.0,3.6,1.4,0.2
6,5.4,3.9,1.7,0.4
7,4.6,3.4,1.4,0.3
8,5.0,3.4,1.5,0.2
9,4.6,3.5,2.012281,0.1


In [18]:
setosa_nmv.to_csv('output/setosa_nmv.csv')
versicolor_nmv.to_csv('output/versicolor_nmv.csv')
virginica_nmv.to_csv('output/virginica_nmv.csv')

------