# Session 03: Working with Image Corpora

As this workshop is about image data, we wanted to start with images
in Python as soon as possible. Here we show how to organize a corpus 
of images and how to read data about images into Python.

## Setup

We need to load several python modules that provide functionalities that
will be used throughout this tutorial. We will also set-up some default parameters
that make the graphical output easier to look at. **Make sure you run this block
of code prior to proceeding.**

In [1]:
%pylab inline

import numpy as np
import scipy as sp
import pandas as pd
import urllib

import os
from os.path import join

Populating the interactive namespace from numpy and matplotlib


In [2]:
import matplotlib.pyplot as plt
import matplotlib.patches as patches

plt.rcParams["figure.figsize"] = (8,8)

In [3]:
imread

<function matplotlib.pyplot.imread(fname, format=None)>

### Loading an image into Python

We are now ready to read an image file into Python. We have several corpora that we
will be working with, but for now let's just read in a test
image I took of a teapot in my kitchen at home. To do this, we need to tell
Python where the image is: its in a directory called 'test', which is inside
a directory called 'images' and the file is called 'teapot.jpg'). Once we have
the filename, we can read in the image into Python with the function `imread`
as follows:

In [None]:
img = imread(join("..", "images", "test", "teapot.jpg"))

There is now an object in python called `img` that contains all of the data that
describes my image of a teapot. We can have Python print the image itself by 
calling the function `plt.imshow` on the image, as follows:

In [None]:
plt.imshow(img)

It may not look like much, but trust me, it makes delicious tea.

### Building a corpus with metadata

In the next session we will dive into how images are actually represented
in Python and what we can and want to do with image data. Now, though, let's
talk a bit about how one should organize a corpus of images.

To start, create a subfolder in the images directory called 'example'. Then,
go find 3-5 images on [wikipedia](https://www.wikipedia.org/). Please do not
pick an SVG image (more on this later). Make sure to keep the pages for these
files open as we will need the information.

Next, open a spreadsheet file in your favorite spreadsheet program (Excel, OpenOffice,
Google Sheets, ect.). In the very first row, it is important that you do not
leave any blank rows/columns, write the following column names:

- **page**
- **description**
- **date**
- **source**
- **author**
- **url**
- **filename**

Fill these fields in using the information contained in the first 6 fields. Then,
create a filename for each image. For the filenames you can use anything that consists
of only letter, numbers, dashes, and underscores (no spaces). Make sure to include
a file extension that matches the original image file.

When you are done, export the data as comma seperated values (csv) file named 
`example.csv`. Save the file in the directory `data`.

### Reading csv files with pandas

We are now going to use the pandas library to read in the csv file that you
created. We will read the file into a python object called `df`.

In [None]:
df = pd.read_csv(join("..", "data", "example.csv"))

We can print out the pandas data frame object by including it in a cell
line all by itself.

In [None]:
df

There is a particular reason that we included columns for the url and filename.
In general, the other fields are flexible and you can modify to suit the needs
of your corpus.

It will be useful to grab certain parts of this object. To access a particular
column, follow the name of the dataset with a dot and the name of the column:

In [None]:
df.page

To get a specific row of the column, follow this with square brackets and
the number of the row that you want. Note that Python starts numberin at zero
rather than one.

In [None]:
df.page[2] 

Try to access the url of the first image in the corpus:

### Hydrating the corpus (Part 1)

Currently we only have metadata about the images, but not copies of the
images themselves. We need to download these from the urls giving in the
data. This is a process called *hyrdration*. It is a useful way of passing
some datasets because (1) it avoids issues with ToS from some websites, and
(2) drastically reduces the file sizes that need to be shared with others.

Let's start by seeing how to download the first image in the data. You already
saw how to get the url of the file; where do we want the output of the file?
Let's put it in our `examples` directory and name it according to the filename
column in the metadata.

In [None]:
output_file = join("..", "images", "example", df.filename[0])
output_file

Now, to actually download the file run the following:

In [None]:
urllib.request.urlretrieve(df.url[0], output_file)

Go into the directory and confirm that the image is loaded into the directory.

### Hydrating the corpus (Part 2)

We now want to repeat the process above for each row in the dataset. Here,
with just three rows, we could manually change the zeros to ones, re-run
the code, change the ones to twos, and then be finish. In general, though,
we need a better way of cycling through every row in the dataset. To do
this, we need a `for` loop. 

In a loop, we write code that will get run multiple times. On each execution
one variable, called the index, will change. Here's a simple example that prints
out every number from 0 through 6:

In [None]:
for i in range(7):
    print(i)

Notice that the code is indented by four spaces (the notebook will help you
by inserting a tab whenever you make a tab) and the value of i changes each
time the code is run. Here's another example that prints out the description
of each row in our metadata:

In [None]:
for i in range(df.shape[0]):
    print(df.description[i])

Now, try to download images for all rows in your dataset using a loop in Python:

In [None]:
for i in range(df.shape[0]):
    output_file = join("..", "images", "example", df.filename[i])
    urllib.request.urlretrieve(df.url[i], output_file)