# T3 Wrangling Data with Python 

We will work with three very different data-sets in today's tutorial : 

1. An image (file : `gcamp7.bmp`)
2. A text (file : `gospel-of-buddha.txt`)
3. Tabular data  (file : `time-series2.xls`)

We will load each of these data-sets, understand the content and structure of the data and display it in a sensible way. Note that all three accompanying files need to be downloaded from the github repository. 


## Working with an image

Our image is in the bitmap format which has the file ending `.bmp`. We can use `matplotlib` to load the image. 

In [4]:
import numpy as np
import matplotlib.pyplot as plt

img = plt.imread('gcamp7.bmp')

Let's first try to understand the structure of the data. In other words, let's try to understand how the image information is represented in the data structure?

Let's plot the image. We can use the matplotlib function `plt.imshow()` for this. 

Now, let's extrat the only channel containing information. Which channel 

Now, let's extrat the only channel containing information. Which channel contains the information and how to access it? Let's save it in new variable called `img2` and plot it. 

We next want to get an idea about the distribution of brightness levels in the new image (`img2`). Let's plot the histogram for that pupose using the matplotlib function `plt.hist()`. Chose an appropriate number of bins with the argument `bins=...` in the `plt.hist()` function. <br> <b>Attention!</b> The image needs to be flattened for this purpose (try `img2.flatten()` for that purpose).

Let's extract the brightest regions. Which percentage of pixels contain brightness levels above 60, for example? 

Lastly, let's plot the image by setting the pixel with brightness level above 60 to 255 and the remaining pixels to 0. 

## Working with texts 

We will work the the text of the book : 
- The Gospel of Buddha, by Paul Carus (Only because I could find that text easily online.)

The below code opens the text file, reads its content and splits the content into individual words based on the space between the words. 

In [218]:
file_path = 'gospel-of-buddha.txt'
file =  open(file_path, 'r')   # open the text file as read-only 
text = file.read()             # reads the entire text contained in the file
words = text.split()           # splits the text in individual words based on space as delimiter 
words = np.array(words)        # converts the list of words into numpy array

Let's learn more about the length and content of the book. In which language is the book written? How many words are contained in the book? What is the word at the 20000th position in the book? 

Let's extract information about the statistics of the text. How often are the words `'are'`, `'mother'` and `'a'` contained in the text? 

Let's do this a bit more systematic and find the most common word in the text. This can be implemented with numpy functions as shown below. Let's understand what each line is doing. 

In [None]:
unique, pos = np.unique(words,return_inverse=True)  # Finds all unique elements and their positions
counts = np.bincount(pos)                           # Count number of occurrences of each value in array 
maxpos = counts.argmax()                            # find element with the most occurrences 
print(unique[maxpos],':',counts[maxpos])

## Working with tabular data using pandas 

We will learn how to use the `pandas` library. `pandas` is a powerful package to use with time-series and the section below gives you some first notions on how to use it. 

Note that you need the `time-series2.xls` file from the github repository to complete the below exercises.  The file has to be located in the same directory as this notebook.

#### Load data 
First we will load the data saved in an excel file using the pandas `read_excel` function.

In [3]:
import pandas as pd

data2 = pd.read_excel('time-series2.xls',sheet_name='NZRainfall',index_col='DATE') # load excel spreadsheet

Let's convert the first `DATE` column into the data-time format recognized by pandas. 

In [49]:
data2.index = pd.to_datetime(data2.index) # convert the index column to the date format

#### Get an idea about the data

We already have an idea from the sheetname what the data is about. Let's find out more about the data. 

- How does that data look like? What information are contained in the data? 
- What is the interval/sampling frequency of the data? 
- What are the dimensions of the `DataFrame`? 
- Get the statistics of the data by using the pandas `[name of the DataFrame].describe()` function. 

**Hint :** You can simply display the `DataFrame` to the screen, or see the first lines with `data2.head()`. 

The table is 2D array (3 column and 154 rows/entries) of time stamps and rainfall numbers of three towns - Auckland, Christchurch, Wellington - in New Zealand. The rainfall is measured monthly (interval is 1 month) and given for the period from Jan 2000 through Dec 2012. 

### Plotting and slicing the data 

`pandas` has a build-in plot function which is called by `[name of the DataFrame].plot()`. 

- Plot the entire data.
- Plot the data for the year 2004.
- Plot the data in from the year 2006 through 2011. 

Further accessing parts of the data. Let's plot the rainfall data of `Chirstchurch` only. 

Lastly, extract and show all the months for which the rainfall in Christchurch exceeded 100 mm.

## The end