In [3]:
from dstaster import *

<h2>Loading the dataset</h2>

The following code cells loads a dataset and stores it in the variable `collection`. The data is stored in a so-called `DataFrame` which is provided to us by the pandas library (abbreviated in the code by `pd`).

The second line passes the DataFrame to the magic Jupyter `display(...)` function which provides us with a pretty-printed excerpt of the dataset. If you run the cell, you should see a table with six columns (artist, title, year, groundtruth, height and width).

In [4]:
collection = pd.read_csv("../tate/paintings.csv", index_col=0)
display(collection)

Unnamed: 0,artist,title,year,groundtruth,height,width
T13896,John Constable,Salisbury Cathedral from the Meadows,1831,L,1537,1920
T05010,Pablo Picasso,Weeping Woman,1937,O,608,500
N05915,Pablo Picasso,Bust of a Woman,1909,P,727,600
N00530,Joseph Mallord William Turner,Snow Storm - Steam-Boat off a Harbour’s Mouth,1842,L,914,1219
T00598,Richard Dadd,The Fairy Feller’s Master-Stroke,1855,O,540,394
...,...,...,...,...,...,...
N05609,Maurice Sterne,Mexican Church Interior,1934,O,1283,1022
T14823,Unknown artist,Leon Trotsky,1980,P,510,480
AL00397,Louise Bourgeois,Untitled,1946,O,660,1116
T14824,Unknown artist,Leon Trotsky,1980,P,638,511


Pandas DataFrames are very powerful data structures that come with a lot of useful functionality. For example, we can ask the DataFrame to compute common statistics for all numerical columns. When we call `collection.describe()` we receive a new DataFrame containing the summary of `collection`. Note that we can leave away the call to `display`: Jupyter automatically displays whatever the last statement in the cell returns.

In [5]:
collection.describe()

Unnamed: 0,year,height,width
count,2158.0,2158.0,2158.0
mean,1873.828082,960.444856,1026.646895
std,76.739168,529.841346,642.269151
min,1594.0,137.0,102.0
25%,1824.0,610.0,616.0
50%,1889.5,813.0,893.5
75%,1934.0,1219.0,1232.0
max,2017.0,4285.0,8915.0


<h2>Working with columns</h2>

Each column of the DataFrame can be access individually using the index brackets `[]`. For example, `collection['artist']` will give us the artist column, `collection['year']` the year column and so on. 

<div class="note">Note: A single column of a DataFrame is a data structure called a Series, so it's representation in the notebook looks slightly differently.</div>

<div class="task">
    <div class="no">1</div>
    <div class="text">
        Change the index string in the following cell to
        values other than <code>'artist'</code> and observe how 
        the output changes.
    </div>
</div>

In [6]:
collection['artist']

T13896                    John Constable
T05010                     Pablo Picasso
N05915                     Pablo Picasso
N00530     Joseph Mallord William Turner
T00598                      Richard Dadd
                       ...              
N05609                    Maurice Sterne
T14823                    Unknown artist
AL00397                 Louise Bourgeois
T14824                    Unknown artist
T14825                    Unknown artist
Name: artist, Length: 2158, dtype: object

Pandas series also come equip with a number of methods that allow us to quickly compute various statistics. Let's say we want to find out which artist appear most often in our dataset. Then we can use the method `.value_counts()` to obtain a count for every unique entry in `collection['artist']`. The output is already sorted from hight to low, so we can read of the most common artist at the top:

In [24]:
collection['artist'].value_counts()

Joseph Mallord William Turner    240
John Constable                    34
John Singer Sargent               32
Sir Joshua Reynolds               30
Thomas Gainsborough               25
                                ... 
Auguste Renoir                     1
Hugh Carter                        1
Edward Bower                       1
Maurice de Vlaminck                1
George Smith of Chichester         1
Name: artist, Length: 869, dtype: int64

The <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html">complete list of available methods</a> to aggregate and modify pandas Series is quite long! A few methods we are interested in are `.sum()`, `.count()`, `.mean()`, `.min()`, `.median()`, and `.max()`.

<div class="task">
    <div class="no">2</div>
    <div class="text">
        Change `.sum()` in the following cell to the methods 
        mentioned above and observe the resulting output.
    </div>
</div>

In [32]:
collection['width'].sum()

2215504

<h2>Querying data</h2>

We can also write basic queries to filter the data. The following cell creates a <i>boolean index</i>: for every entry of the year-column, it checks whether the value is equal to 1900. 

In [34]:
idx = collection['year'] == 1900
idx # Equivalent to display(idx)

T13896     False
T05010     False
N05915     False
N00530     False
T00598     False
           ...  
N05609     False
T14823     False
AL00397    False
T14824     False
T14825     False
Name: year, Length: 2158, dtype: bool

The result is a Series object that contains exactly as many entries as the series `collection['year']` (it also shares its <em>index</em> which is displayed on the left-hand size of the output). This series will contains the value `True` at every position where the corresponding entry in `collection['year']` is equal to 1900 and otherwise the value `False`.

If we pass such a series of True/False to a DataFrame using the index brackets `[]`, the result is a new DataFrame which contains all rows for which the corresponding value was `True`:

In [35]:
collection[idx]

Unnamed: 0,artist,title,year,groundtruth,height,width
N01839,Sir Frank Dicksee,The Two Crowns,1900,O,2311,1823
N02940,Sir William Orpen,The Mirror,1900,P,508,406
N05261,Philip Wilson Steer,Seated Nude: The Black Hat,1900,P,508,406
N05905,Henri Matisse,Notre-Dame,1900,L,460,375
N01772,Ralph Peacock,The Sisters,1900,P,1675,1275
N01901,Sir James Jebusa Shannon,The Flower Girl,1900,P,838,660
N04917,Sir Alfred East,Golden Autumn,1900,L,1225,1530
N01838,Harry William Adams,Winter’s Sleep,1900,L,1226,1841
N06080,Ambrose McEvoy,"Bessborough Street, Pimlico",1900,O,457,356
T03648,William Evelyn Osborn,"Royal Avenue, Chelsea",1900,O,505,610


<div class="task">
    <div class="no">3</div>
    <div class="text">
        Try changing the boolean statement 
        on the right-hand size of the first line by replacing the 
        equality operator <code>==</code> with less-than operator 
        <code>&lt;</code> or the greater-than <code>&gt;</code>.
        What output do you expect? What output do you see?
    </div>
</div>

In [12]:
idx = collection['year'] == 1900
collection[idx]

Unnamed: 0,artist,title,year,groundtruth,height,width
N01839,Sir Frank Dicksee,The Two Crowns,1900,O,2311,1823
N02940,Sir William Orpen,The Mirror,1900,P,508,406
N05261,Philip Wilson Steer,Seated Nude: The Black Hat,1900,P,508,406
N05905,Henri Matisse,Notre-Dame,1900,L,460,375
N01772,Ralph Peacock,The Sisters,1900,P,1675,1275
N01901,Sir James Jebusa Shannon,The Flower Girl,1900,P,838,660
N04917,Sir Alfred East,Golden Autumn,1900,L,1225,1530
N01838,Harry William Adams,Winter’s Sleep,1900,L,1226,1841
N06080,Ambrose McEvoy,"Bessborough Street, Pimlico",1900,O,457,356
T03648,William Evelyn Osborn,"Royal Avenue, Chelsea",1900,O,505,610


Let us conclude this topic by noting that we do not have to store the index Series in a variable. Instead, we can directly put the boolean statement inside the index brackets and you will find this  pattern a lot in pandas code. Moreover, these boolean queries can be combined, for example, the following retrieves all rows for which the year is 1900 <em>and</em> the width is at most 400:

In [39]:
collection[(collection['year'] == 1900) & (collection['width'] <= 400)]

Unnamed: 0,artist,title,year,groundtruth,height,width
N05905,Henri Matisse,Notre-Dame,1900,L,460,375
N06080,Ambrose McEvoy,"Bessborough Street, Pimlico",1900,O,457,356
