# Python Open Labs: Data wrangling with pandas

## Setup

With this Google Colaboratory (Colab) notebook open, click the "Copy to Drive" button that appears in the menu bar. The notebook will then be attached to your own user account, so you can edit it in any way you like -- you can even take notes directly in the notebook.

## Open Labs agenda

1. **Guided activity**: One of the instructors will share their screen to work through the guided activity and teach concepts along the way.

2. **Open lab time**: After the guided portion of the Open Lab, the rest of the time is for you to ask questions, work collaboratively, or have self-guided practice time. You will have access to instructors and peers for questions and support.

## Learning objectives

By the end of our workshop today, we hope you'll understand basic pandas methods for normalizing values, modifying data, dealing with missing data, and working with strings and dates.

## Today's Topics

- Replacing values in a DataFrame column
- Creating new columns using expressions
- Creating new columns with functions
- Dealing with missing values
- Working with string data
- Working with dates

## Questions during the workshop

Please feel free to ask questions throughout the workshop.

We have a second instructor who will available during the workshop. They will answer as able, and will collect questions with answers that might help everyone to be answered at the end of the workshop.

The open lab time is when you will be able to ask more questions and work together on the exercises.

## Guided Instruction

In this Open Lab we're introducing how to use the pandas library to "wrangle" data; or clean, manipulate, and prepare datasets for analysis.

In this section, we will work through examples using data from the [Museum of Modern Art (MoMA) research dataset](https://github.com/MuseumofModernArt/collection) containing records of all of the works that have been cataloged in the database of the MoMA collection.

> "The Museum’s website features 89,695 artworks from 26,494 artists. This research dataset contains 138,151 records, representing all of the works that have been accessioned into MoMA’s collection and cataloged in our database. It includes basic metadata for each work, including title, artist, date made, medium, dimensions, and date acquired by the Museum. Some of these records have incomplete information and are noted as “not Curator Approved." - [MoMA Github repository for collection data](https://github.com/MuseumofModernArt/collection)

We have prepared a dataset that consists of a subset of MoMA artworks classified as paintings and their associated artist information to use in the following activities. We will be referencing the data that we have prepared in our [Github repository for teaching datasets](https://github.com/ncsu-libraries-data-vis/teaching-datasets/tree/main/moma_data).

In [1]:
# Import the Pandas library as pd (callable in our code as pd)
import pandas as pd

### Load the dataset

In [2]:
# Import the MoMA paintings with artist information dataset (CSV file)
# The file location
file_url = 'https://raw.githubusercontent.com/ncsu-libraries-data-vis/teaching-datasets/main/moma_data/moma_paintings_full.csv'

# Read in the file and print out the DataFrame
paintings = pd.read_csv(file_url)
paintings

Unnamed: 0,Title,Artist,ConstituentID,Date,Medium,Dimensions,CreditLine,AccessionNumber,Classification,Department,...,Height (cm),Length (cm),Weight (kg),Width (cm),Duration (sec.),ArtistBio,Nationality,Gender,BeginDate,EndDate
1,"Rope and People, I",Joan Miró,4016.0,"Barcelona, March 27, 1935","Oil on cardboard mounted on wood, with coil of...","41 1/4 x 29 3/8"" (104.8 x 74.6 cm)",Gift of the Pierre Matisse Gallery,71.1936,Painting,Painting & Sculpture,...,104.800000,,,74.600000,,"Spanish, 1893–1983",Spanish,Male,1893.0,1983.0
2,Fire in the Evening,Paul Klee,3130.0,1929,Oil on cardboard,"13 3/8 x 13 1/4"" (33.8 x 33.3 cm)",Mr. and Mrs. Joachim Jean Aberbach Fund,153.1970,Painting,Painting & Sculpture,...,33.800000,,,33.300000,,"German, born Switzerland. 1879–1940",German,Male,1879.0,1940.0
3,Portrait of an Equilibrist,Paul Klee,3130.0,1927,Oil and collage on cardboard over wood with pa...,"24 7/8 x 15 3/4"" (63.2 x 40 cm)",Mrs. Simon Guggenheim Fund,195.1966,Painting,Painting & Sculpture,...,60.300000,,,36.800000,,"German, born Switzerland. 1879–1940",German,Male,1879.0,1940.0
4,Guitar,Pablo Picasso,4609.0,"Paris, early 1919","Oil, charcoal and pinned paper on canvas","7' 1"" x 31"" (216 x 78.8 cm)",Gift of A. Conger Goodyear,384.1955,Painting,Painting & Sculpture,...,215.900000,,,78.700000,,"Spanish, 1881–1973",Spanish,Male,1881.0,1973.0
5,Grandmother,Arthur Dove,1602.0,1925,"Shingles, needlepoint, page from Concordance, ...","20 x 21 1/4"" (50.8 x 54.0 cm)",Gift of Philip L. Goodwin (by exchange),636.1939,Painting,Painting & Sculpture,...,50.800000,,,54.000000,,"American, 1880–1946",American,Male,1880.0,1946.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38098,Zacimba Gaba,Dalton Paula,132719.0,2020,"Oil, pencil, and gold leaf on two joined canvases","24 × 17 3/4"" (61 × 45.1 cm), in two parts",,TR16514.1,Painting,Painting & Sculpture,...,60.960122,,,45.085090,,"Brazilian, born 1982",Brazilian,,1982.0,0.0
38099,Zumbi,Dalton Paula,132719.0,2020,"Oil, pencil, and gold leaf on two joined canvases","24 × 17 3/4"" (61 × 45.1 cm), in two parts",,TR16514.2,Painting,Painting & Sculpture,...,60.960122,,,45.085090,,"Brazilian, born 1982",Brazilian,,1982.0,0.0
38100,Vertigo #2,Ouattara Watts,132954.0,2011,"Acrylic, paper pulp, cut and pasted fabrics, a...","118 1/4 × 165 1/2 × 3 3/4"" (300.4 × 420.4 × 9....",,TR16516,Painting,Painting & Sculpture,...,300.355601,,,420.370841,,"American, born Ivory Coast, 1957",American,,1957.0,0.0
38101,Lot 111113 (flare 1),Donald Moffett,7435.0,2013,Acrylic and lacquer on linen with cotton and a...,"54 × 44"" (137.2 × 111.8 cm)",,TR16517,Painting,Painting & Sculpture,...,137.160274,,,111.760224,,"American, born 1955",American,Male,1955.0,0.0


In [3]:
# Observe a summary of the DataFrame columns using the DataFrame method info()
paintings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2350 entries, 1 to 38102
Data columns (total 26 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Title               2350 non-null   object 
 1   Artist              2349 non-null   object 
 2   ConstituentID       2349 non-null   float64
 3   Date                2343 non-null   object 
 4   Medium              2349 non-null   object 
 5   Dimensions          2345 non-null   object 
 6   CreditLine          2343 non-null   object 
 7   AccessionNumber     2350 non-null   object 
 8   Classification      2350 non-null   object 
 9   Department          2350 non-null   object 
 10  DateAcquired        2340 non-null   object 
 11  Cataloged           2350 non-null   object 
 12  ObjectID            2350 non-null   int64  
 13  Circumference (cm)  0 non-null      float64
 14  Depth (cm)          403 non-null    float64
 15  Diameter (cm)       12 non-null     float64
 16  Heigh

### Replacing values in a column

We can replace values in a column by first accessing the column and using the Series method `replace()` (*remember accessing one column from a DataFrame returns a pandas Series*). The `replace()` method can accept a dictionary of items in which the dictionary keys are the values to be replaced and the dictionary values are the new values to be inserted.

We will demonstrate this method by replacing the values `Y` and `N` in the `Cataloged` column to the more explicit values `Yes` and `No`, respectively. Also, we will edit the DataFrame directly by including the keyword argument `inplace=True`.

In [4]:
# Print the unique values contained in the "Cataloged" column using the
# DataFrame method unique()
paintings['Cataloged'].unique()

array(['Y', 'N'], dtype=object)

In [5]:
# Replace the values "Y" and "N" in the "Cataloged" column with "Yes" and "No"
paintings['Cataloged'].replace({'Y': 'Yes', 'N': 'No'}, inplace=True)

# Print out the unique values of the "Cataloged" column
paintings['Cataloged'].unique()

array(['Yes', 'No'], dtype=object)

### Creating a new column using an expression

We can create new columns of data on an existing DataFrame and assign values calculated from existing columns of data. This is done by first calling the index of the new column (for example, `df['NEW_COLUMN_NAME']`) and assigning the new column the results of some expression that can include existing columns of data from the DataFrame.

We will create a new column named `Area (cm^2)` that contains the centimeter squared area of a painting based on the values provided in the columns `Width (cm)` and `Height (cm)`.

In [6]:
# Create a new column "Area (cm^2)" containing the size of a painting based
# on the width and height value of the painting
paintings['Area (cm^2)'] = paintings['Width (cm)'] * paintings['Height (cm)']

# Print out the full DataFrame
paintings

Unnamed: 0,Title,Artist,ConstituentID,Date,Medium,Dimensions,CreditLine,AccessionNumber,Classification,Department,...,Length (cm),Weight (kg),Width (cm),Duration (sec.),ArtistBio,Nationality,Gender,BeginDate,EndDate,Area (cm^2)
1,"Rope and People, I",Joan Miró,4016.0,"Barcelona, March 27, 1935","Oil on cardboard mounted on wood, with coil of...","41 1/4 x 29 3/8"" (104.8 x 74.6 cm)",Gift of the Pierre Matisse Gallery,71.1936,Painting,Painting & Sculpture,...,,,74.600000,,"Spanish, 1893–1983",Spanish,Male,1893.0,1983.0,7818.080000
2,Fire in the Evening,Paul Klee,3130.0,1929,Oil on cardboard,"13 3/8 x 13 1/4"" (33.8 x 33.3 cm)",Mr. and Mrs. Joachim Jean Aberbach Fund,153.1970,Painting,Painting & Sculpture,...,,,33.300000,,"German, born Switzerland. 1879–1940",German,Male,1879.0,1940.0,1125.540000
3,Portrait of an Equilibrist,Paul Klee,3130.0,1927,Oil and collage on cardboard over wood with pa...,"24 7/8 x 15 3/4"" (63.2 x 40 cm)",Mrs. Simon Guggenheim Fund,195.1966,Painting,Painting & Sculpture,...,,,36.800000,,"German, born Switzerland. 1879–1940",German,Male,1879.0,1940.0,2219.040000
4,Guitar,Pablo Picasso,4609.0,"Paris, early 1919","Oil, charcoal and pinned paper on canvas","7' 1"" x 31"" (216 x 78.8 cm)",Gift of A. Conger Goodyear,384.1955,Painting,Painting & Sculpture,...,,,78.700000,,"Spanish, 1881–1973",Spanish,Male,1881.0,1973.0,16991.330000
5,Grandmother,Arthur Dove,1602.0,1925,"Shingles, needlepoint, page from Concordance, ...","20 x 21 1/4"" (50.8 x 54.0 cm)",Gift of Philip L. Goodwin (by exchange),636.1939,Painting,Painting & Sculpture,...,,,54.000000,,"American, 1880–1946",American,Male,1880.0,1946.0,2743.200000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38098,Zacimba Gaba,Dalton Paula,132719.0,2020,"Oil, pencil, and gold leaf on two joined canvases","24 × 17 3/4"" (61 × 45.1 cm), in two parts",,TR16514.1,Painting,Painting & Sculpture,...,,,45.085090,,"Brazilian, born 1982",Brazilian,,1982.0,0.0,2748.392594
38099,Zumbi,Dalton Paula,132719.0,2020,"Oil, pencil, and gold leaf on two joined canvases","24 × 17 3/4"" (61 × 45.1 cm), in two parts",,TR16514.2,Painting,Painting & Sculpture,...,,,45.085090,,"Brazilian, born 1982",Brazilian,,1982.0,0.0,2748.392594
38100,Vertigo #2,Ouattara Watts,132954.0,2011,"Acrylic, paper pulp, cut and pasted fabrics, a...","118 1/4 × 165 1/2 × 3 3/4"" (300.4 × 420.4 × 9....",,TR16516,Painting,Painting & Sculpture,...,,,420.370841,,"American, born Ivory Coast, 1957",American,,1957.0,0.0,126260.736392
38101,Lot 111113 (flare 1),Donald Moffett,7435.0,2013,Acrylic and lacquer on linen with cotton and a...,"54 × 44"" (137.2 × 111.8 cm)",,TR16517,Painting,Painting & Sculpture,...,,,111.760224,,"American, born 1955",American,Male,1955.0,0.0,15329.062916


### Creating a new column using apply

Sometimes we need to create new data using more complex methods than a simple expression. We will create a new column named `OilPainting` to identify all painting that contain the word "oil" in their medium description (contained in the column `medium`). This cannot be reliably accomplished without doing some advanced manipulation of the medium description.

We will use the Series method `apply()` to call the function `is_oil_based_painting`, which contains the necessary code to produce the desired results, on the column `Medium`.

In [8]:
# Return whether a painting is oil-based (Yes) or not (No) based on the
# occurrence of the word "oil" in the artwork medium description
def is_oil_based_painting(medium):
    # Test if value is a string (can't apply string methods on NaNs)
    if type(medium) == str:
        # Create a list of lowercase words, commas removed, from description
        description = medium.lower().replace(',', '').split(' ')
        # Test if "oil" is in list
        if 'oil' in description:
            return 'Yes'
        return 'No'

# Use the Series method apply to call the "is_oil_based_painting" function on
# the column "Medium"
paintings['OilPainting'] = paintings['Medium'].apply(
    is_oil_based_painting
)

# Print out the resulting DataFrame
paintings

Unnamed: 0,Title,Artist,ConstituentID,Date,Medium,Dimensions,CreditLine,AccessionNumber,Classification,Department,...,Weight (kg),Width (cm),Duration (sec.),ArtistBio,Nationality,Gender,BeginDate,EndDate,Area (cm^2),OilPainting
1,"Rope and People, I",Joan Miró,4016.0,"Barcelona, March 27, 1935","Oil on cardboard mounted on wood, with coil of...","41 1/4 x 29 3/8"" (104.8 x 74.6 cm)",Gift of the Pierre Matisse Gallery,71.1936,Painting,Painting & Sculpture,...,,74.600000,,"Spanish, 1893–1983",Spanish,Male,1893.0,1983.0,7818.080000,Yes
2,Fire in the Evening,Paul Klee,3130.0,1929,Oil on cardboard,"13 3/8 x 13 1/4"" (33.8 x 33.3 cm)",Mr. and Mrs. Joachim Jean Aberbach Fund,153.1970,Painting,Painting & Sculpture,...,,33.300000,,"German, born Switzerland. 1879–1940",German,Male,1879.0,1940.0,1125.540000,Yes
3,Portrait of an Equilibrist,Paul Klee,3130.0,1927,Oil and collage on cardboard over wood with pa...,"24 7/8 x 15 3/4"" (63.2 x 40 cm)",Mrs. Simon Guggenheim Fund,195.1966,Painting,Painting & Sculpture,...,,36.800000,,"German, born Switzerland. 1879–1940",German,Male,1879.0,1940.0,2219.040000,Yes
4,Guitar,Pablo Picasso,4609.0,"Paris, early 1919","Oil, charcoal and pinned paper on canvas","7' 1"" x 31"" (216 x 78.8 cm)",Gift of A. Conger Goodyear,384.1955,Painting,Painting & Sculpture,...,,78.700000,,"Spanish, 1881–1973",Spanish,Male,1881.0,1973.0,16991.330000,Yes
5,Grandmother,Arthur Dove,1602.0,1925,"Shingles, needlepoint, page from Concordance, ...","20 x 21 1/4"" (50.8 x 54.0 cm)",Gift of Philip L. Goodwin (by exchange),636.1939,Painting,Painting & Sculpture,...,,54.000000,,"American, 1880–1946",American,Male,1880.0,1946.0,2743.200000,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38098,Zacimba Gaba,Dalton Paula,132719.0,2020,"Oil, pencil, and gold leaf on two joined canvases","24 × 17 3/4"" (61 × 45.1 cm), in two parts",,TR16514.1,Painting,Painting & Sculpture,...,,45.085090,,"Brazilian, born 1982",Brazilian,,1982.0,0.0,2748.392594,Yes
38099,Zumbi,Dalton Paula,132719.0,2020,"Oil, pencil, and gold leaf on two joined canvases","24 × 17 3/4"" (61 × 45.1 cm), in two parts",,TR16514.2,Painting,Painting & Sculpture,...,,45.085090,,"Brazilian, born 1982",Brazilian,,1982.0,0.0,2748.392594,Yes
38100,Vertigo #2,Ouattara Watts,132954.0,2011,"Acrylic, paper pulp, cut and pasted fabrics, a...","118 1/4 × 165 1/2 × 3 3/4"" (300.4 × 420.4 × 9....",,TR16516,Painting,Painting & Sculpture,...,,420.370841,,"American, born Ivory Coast, 1957",American,,1957.0,0.0,126260.736392,No
38101,Lot 111113 (flare 1),Donald Moffett,7435.0,2013,Acrylic and lacquer on linen with cotton and a...,"54 × 44"" (137.2 × 111.8 cm)",,TR16517,Painting,Painting & Sculpture,...,,111.760224,,"American, born 1955",American,Male,1955.0,0.0,15329.062916,No


### Removing missing data

The MoMA paintings dataset contains many missing values. Missing values in a pandas are typically represented as `NaN`. We can handle missing values in several ways, removing all rows or columns containing one or more `NaN`s, removing rows or columns based on the occurrence of `NaN`s within specific rows or columns, or filling in `NaN` values with another value.

We want to work with paintings that have artist information associated with them. A painting that does not have artist information can be identified by a value of `NaN` in the column `ArtistBio`. We will create a subset of the full `paintings` DataFrame by removing all rows that contain `NaN` in the `ArtistBio` column using the DataFrame method `dropna()` and specifying a column `subset` over which to look for `NaN`s.

In [9]:
# Create a new DataFrame containing only paintings that include artist
# information by removing rows that include "NaN" in the column "ArtistBio"
# using the DataFrame method dropna()
paintings_clean = paintings.dropna(subset=['ArtistBio'])

# Print out the resulting DataFrame
paintings_clean

Unnamed: 0,Title,Artist,ConstituentID,Date,Medium,Dimensions,CreditLine,AccessionNumber,Classification,Department,...,Weight (kg),Width (cm),Duration (sec.),ArtistBio,Nationality,Gender,BeginDate,EndDate,Area (cm^2),OilPainting
1,"Rope and People, I",Joan Miró,4016.0,"Barcelona, March 27, 1935","Oil on cardboard mounted on wood, with coil of...","41 1/4 x 29 3/8"" (104.8 x 74.6 cm)",Gift of the Pierre Matisse Gallery,71.1936,Painting,Painting & Sculpture,...,,74.600000,,"Spanish, 1893–1983",Spanish,Male,1893.0,1983.0,7818.080000,Yes
2,Fire in the Evening,Paul Klee,3130.0,1929,Oil on cardboard,"13 3/8 x 13 1/4"" (33.8 x 33.3 cm)",Mr. and Mrs. Joachim Jean Aberbach Fund,153.1970,Painting,Painting & Sculpture,...,,33.300000,,"German, born Switzerland. 1879–1940",German,Male,1879.0,1940.0,1125.540000,Yes
3,Portrait of an Equilibrist,Paul Klee,3130.0,1927,Oil and collage on cardboard over wood with pa...,"24 7/8 x 15 3/4"" (63.2 x 40 cm)",Mrs. Simon Guggenheim Fund,195.1966,Painting,Painting & Sculpture,...,,36.800000,,"German, born Switzerland. 1879–1940",German,Male,1879.0,1940.0,2219.040000,Yes
4,Guitar,Pablo Picasso,4609.0,"Paris, early 1919","Oil, charcoal and pinned paper on canvas","7' 1"" x 31"" (216 x 78.8 cm)",Gift of A. Conger Goodyear,384.1955,Painting,Painting & Sculpture,...,,78.700000,,"Spanish, 1881–1973",Spanish,Male,1881.0,1973.0,16991.330000,Yes
5,Grandmother,Arthur Dove,1602.0,1925,"Shingles, needlepoint, page from Concordance, ...","20 x 21 1/4"" (50.8 x 54.0 cm)",Gift of Philip L. Goodwin (by exchange),636.1939,Painting,Painting & Sculpture,...,,54.000000,,"American, 1880–1946",American,Male,1880.0,1946.0,2743.200000,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38098,Zacimba Gaba,Dalton Paula,132719.0,2020,"Oil, pencil, and gold leaf on two joined canvases","24 × 17 3/4"" (61 × 45.1 cm), in two parts",,TR16514.1,Painting,Painting & Sculpture,...,,45.085090,,"Brazilian, born 1982",Brazilian,,1982.0,0.0,2748.392594,Yes
38099,Zumbi,Dalton Paula,132719.0,2020,"Oil, pencil, and gold leaf on two joined canvases","24 × 17 3/4"" (61 × 45.1 cm), in two parts",,TR16514.2,Painting,Painting & Sculpture,...,,45.085090,,"Brazilian, born 1982",Brazilian,,1982.0,0.0,2748.392594,Yes
38100,Vertigo #2,Ouattara Watts,132954.0,2011,"Acrylic, paper pulp, cut and pasted fabrics, a...","118 1/4 × 165 1/2 × 3 3/4"" (300.4 × 420.4 × 9....",,TR16516,Painting,Painting & Sculpture,...,,420.370841,,"American, born Ivory Coast, 1957",American,,1957.0,0.0,126260.736392,No
38101,Lot 111113 (flare 1),Donald Moffett,7435.0,2013,Acrylic and lacquer on linen with cotton and a...,"54 × 44"" (137.2 × 111.8 cm)",,TR16517,Painting,Painting & Sculpture,...,,111.760224,,"American, born 1955",American,Male,1955.0,0.0,15329.062916,No


### Working with string data

A lot of the MoMA data consists of strings. There is currently no string data type in pandas, strings are represented by the pandas *object* data type. We can apply string methods to pandas arrays by accessing the `.str` attribute of the array. For example, we can access the string values of the `Artists` column in the full dataset by calling `moma_data[Artists].str`.

If we look at the `Gender` column, we see that the gender types are not represented in a normalized way (for example, female is represented in some cases by the value `Female` and in other cases `female`). We will use the `lower()` string method to produce a lowercase string for all values in the `Gender` column.

In [10]:
# Print the unique values contained in the "Gender" column using the
# DataFrame method unique()
paintings_clean['Gender'].unique()

array(['Male', 'Female', 'male', nan, 'female'], dtype=object)

In [11]:
# Reassign the column "Gender" with the results of calling the string method
# lower() on the values in the "Gender" column
paintings_clean['Gender'] = paintings_clean['Gender'].str.lower()

# Print the unique values contained in the "Gender" column using the
# DataFrame method unique()
paintings_clean['Gender'].unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  paintings_clean['Gender'] = paintings_clean['Gender'].str.lower()


array(['male', 'female', nan], dtype=object)

### Working with Datetime data

Our MoMA dataset contains some columns of data that represent date values (for example, `DateAcquired`, `BeginDate`, `EndDate`). Currently, each of these columns is recognized as strings (the pandas object datatype). If we look at the `DateAcquired` column we see that it contains the date that an artwork was acquired by the museum in the form `YYYY-MM-DD` (for example, `1964-10-06`). If we wanted to filter these values by year, we would not be able to do this in their current string format.

We can convert the values in the `DateAcquired` column to a pandas Datetime data type to use them as a datetime format, a format for a value that contains date and time information, using the pandas method `to_datetime()` on the column.

In [12]:
# Reassign the string values in the column "DateAcquired" as Datetime values
# using the pandas method to_datetime()
paintings_clean['DateAcquired'] = pd.to_datetime(
    paintings_clean['DateAcquired']
)

# Print out the new "DateAcquired" column
paintings_clean['DateAcquired']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  paintings_clean['DateAcquired'] = pd.to_datetime(paintings_clean['DateAcquired'])


1       1936-10-16
2       1970-04-08
3       1966-04-12
4       1955-12-28
5       1939-12-08
           ...    
38098   2020-10-26
38099   2020-10-26
38100   2020-10-26
38101   2020-10-26
38102   2020-10-26
Name: DateAcquired, Length: 2348, dtype: datetime64[ns]

We can now filter the DataFrame by date, based on the values in the `DataAcquired` column. Let's filter the cleaned paintings DataFrame to only include paintings acquired before 1960

In [20]:
# Filter the cleaned paintings DataFrame to only include paintings acquired
# before 1960
paintings_clean[paintings_clean['DateAcquired'] < pd.to_datetime('1960')]

Unnamed: 0,Title,Artist,ConstituentID,Date,Medium,Dimensions,CreditLine,AccessionNumber,Classification,Department,...,Weight (kg),Width (cm),Duration (sec.),ArtistBio,Nationality,Gender,BeginDate,EndDate,Area (cm^2),OilPainting
1,"Rope and People, I",Joan Miró,4016.0,"Barcelona, March 27, 1935","Oil on cardboard mounted on wood, with coil of...","41 1/4 x 29 3/8"" (104.8 x 74.6 cm)",Gift of the Pierre Matisse Gallery,71.1936,Painting,Painting & Sculpture,...,,74.6000,,"Spanish, 1893–1983",Spanish,male,1893.0,1983.0,7818.08000,Yes
4,Guitar,Pablo Picasso,4609.0,"Paris, early 1919","Oil, charcoal and pinned paper on canvas","7' 1"" x 31"" (216 x 78.8 cm)",Gift of A. Conger Goodyear,384.1955,Painting,Painting & Sculpture,...,,78.7000,,"Spanish, 1881–1973",Spanish,male,1881.0,1973.0,16991.33000,Yes
5,Grandmother,Arthur Dove,1602.0,1925,"Shingles, needlepoint, page from Concordance, ...","20 x 21 1/4"" (50.8 x 54.0 cm)",Gift of Philip L. Goodwin (by exchange),636.1939,Painting,Painting & Sculpture,...,,54.0000,,"American, 1880–1946",American,male,1880.0,1946.0,2743.20000,No
19213,Daylight Savings Time,Pierre Roy,5065.0,1929,Oil on canvas,"21 1/2 x 15"" (54.6 x 38.1 cm)",Gift of Mrs. Ray Slater Murphy,1.1931,Painting,Painting & Sculpture,...,,38.1000,,"French, 1880–1950",French,male,1880.0,1950.0,2080.26000,Yes
19214,The Bather,Paul Cézanne,1053.0,c. 1885,Oil on canvas,"50 x 38 1/8"" (127 x 96.8 cm)",Lillie P. Bliss Collection. Conservation was m...,1.1934,Painting,Painting & Sculpture,...,,96.8000,,"French, 1839–1906",French,male,1839.0,1906.0,12293.60000,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20955,Tribulations of Saint Anthony,James Ensor,1739.0,1887,Oil on canvas,"46 3/8 x 66"" (117.8 x 167.6 cm)",Purchase,1642.1940,Painting,Painting & Sculpture,...,,167.6000,,"Belgian, 1860–1949",Belgian,male,1860.0,1949.0,19743.28000,Yes
20956,Leaves and Navels,Jean (Hans) Arp,11.0,1929,Oil and cord on canvas,"13 3/4 x 10 3/4"" (35 x 27.3 cm)",Purchase,1647.1940,Painting,Painting & Sculpture,...,,27.3000,,"French, born Germany (Alsace). 1886–1966",French,male,1886.0,1966.0,955.50000,Yes
20957,Ethnography,David Alfaro Siqueiros,5454.0,1939,Enamel on board,"48 1/8 x 32 3/8"" (122.2 x 82.2 cm)",Gift of Abby Aldrich Rockefeller,1657.1940,Painting,Painting & Sculpture,...,,82.2000,,"Mexican, 1896–1974",Mexican,male,1896.0,1974.0,10044.84000,No
21063,"Colorhythm, 1",Alejandro Otero,4445.0,1955,Enamel on plywood,"6' 6 3/4"" x 19"" (200.1 x 48.2 cm)",Inter-American Fund,21.1956,Painting,Painting & Sculpture,...,,48.2000,,"Venezuelan, 1921–1990",Venezuelan,male,1921.0,1990.0,9644.82000,No


----

## Open work time

You can use this time to ask questions, collaborate, or work on the following exercises (on your own or in a group).

For these exercises you will be using a dataset that consists of MoMA artworks classified as photographs and each photograph's associated artist information. This dataset has the same columns and column names as the original paintings dataset from the guided activity.

Before starting the exercises you will need to load the new dataset as DataFrames. It is available as a CSV file and the URL to the file is provided in the variable below.

In [21]:
# URLs to the photographs dataset
photos_file_url = 'https://raw.githubusercontent.com/ncsu-libraries-data-vis/teaching-datasets/main/moma_data/moma_photographs_full.csv'

# Import the photographs dataset as a DataFrame
photos = pd.read_csv(photos_file_url)
photos

Unnamed: 0,Title,Artist,ConstituentID,Date,Medium,Dimensions,CreditLine,AccessionNumber,Classification,Department,...,Height (cm),Length (cm),Weight (kg),Width (cm),Duration (sec.),ArtistBio,Nationality,Gender,BeginDate,EndDate
0,Untitled from VVV Portfolio,David Hare,2504.0,"c. 1941, published 1943",Gelatin silver print mounted on paper from a p...,"composition: 12 x 9 3/4"" (30.5 x 24.8 cm); she...",The Louis E. Stern Collection,1113.1964.6,Photograph,Drawings & Prints,...,30.50,,,24.8,,"American, 1917–1992",American,Male,1917.0,1992.0
7,Tuileries Sanglier / d'apres l'antique,Eugène Atget,229.0,1911,Albumen silver print,"8 11/16 × 6 9/16"" (22 × 16.7 cm)",Abbott-Levy Collection. Partial gift of Shirle...,1.1969.1,Photograph,Photography,...,,,,,,"French, 1857–1927",French,Male,1857.0,1927.0
8,Sapin (Trianon),Eugène Atget,229.0,1910-14,Albumen silver print,"Approx. 7 1/8 × 8 5/8"" (18.1 × 21.9 cm)",Abbott-Levy Collection. Partial gift of Shirle...,1.1969.10,Photograph,Photography,...,,,,,,"French, 1857–1927",French,Male,1857.0,1927.0
9,"Versailles, vase par Ballin",Eugène Atget,229.0,1902,Matte albumen silver print,"Approx. 8 9/16 × 7 1/16"" (21.8 × 18 cm)",Abbott-Levy Collection. Partial gift of Shirle...,1.1969.100,Photograph,Photography,...,,,,,,"French, 1857–1927",French,Male,1857.0,1927.0
10,Facteur,Eugène Atget,229.0,1899-1900,Gelatin silver printing-out-paper print,"Approx. 8 11/16 × 6 9/16"" (22 × 16.7 cm)",Abbott-Levy Collection. Partial gift of Shirle...,1.1969.1000,Photograph,Photography,...,,,,,,"French, 1857–1927",French,Male,1857.0,1927.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38133,Untitled,Unknown photographer,8595.0,c. 1910,Gelatin silver print,"2 3/8 × 4 1/8"" (6 × 10.5 cm)",Gift of John Jeremiah Sullivan,TR16527.20,Photograph,Photography,...,6.00,,,10.5,,,,,0.0,0.0
38134,Untitled,Unknown photographer,8595.0,c. 1910,"Gelatin silver print, printed later","8 1/16 × 13 1/16"" (20.5 × 33.2 cm)",Gift of John Jeremiah Sullivan,TR16527.21,Photograph,Photography,...,20.50,,,33.2,,,,,0.0,0.0
38135,Untitled,Unknown photographer,8595.0,c. 1918-30,Gelatin silver print (postcard),"4 × 3 3/8"" (10.2 × 8.6 cm)",Gift of John Jeremiah Sullivan,TR16527.22,Photograph,Photography,...,10.20,,,8.6,,,,,0.0,0.0
38136,Untitled,Unknown photographer,8595.0,c. 1900,Gelatin silver print,"6 7/16 × 9 3/4"" (16.4 × 24.7 cm)",Gift of John Jeremiah Sullivan,TR16527.23,Photograph,Photography,...,16.36,,,24.7,,,,,0.0,0.0


### Exercise 1: Rename column values

In the photographs dataset, replace the values in the column `Cataloged`. Replace all occurrences of the value `Y` with the value `Yes` and all occurrences of the value `N` with the value `No`. Overwrite the existing values with the new values.

In [36]:
# Replace the values "Y" and "N" in the "Cataloged" column with "Yes" and "No"
photos['Cataloged'].replace({'Y': 'Yes', 'N': 'No'}, inplace=True)

0        Yes
7        Yes
8        Yes
9        Yes
10       Yes
        ... 
38133    Yes
38134    Yes
38135    Yes
38136    Yes
38137    Yes
Name: Cataloged, Length: 31443, dtype: object

### Exercise 2: Create a new column using an expression

Create a new column, `Aspect`, that contains the aspect ratio (width / height) of a photograph using the values in the `Width (cm)` and `Height (cm)` columns.

In [37]:
# Create a new column named "Aspect" that contains the aspect ratio of a
# photograph based on the values in columns "Width (cm)" and "Height (cm)"
photos['Aspect'] = photos['Width (cm)'] / photos['Height (cm)']
photos

Unnamed: 0,Title,Artist,ConstituentID,Date,Medium,Dimensions,CreditLine,AccessionNumber,Classification,Department,...,Length (cm),Weight (kg),Width (cm),Duration (sec.),ArtistBio,Nationality,Gender,BeginDate,EndDate,Aspect
0,Untitled from VVV Portfolio,David Hare,2504.0,"c. 1941, published 1943",Gelatin silver print mounted on paper from a p...,"composition: 12 x 9 3/4"" (30.5 x 24.8 cm); she...",The Louis E. Stern Collection,1113.1964.6,Photograph,Drawings & Prints,...,,,24.8,,"American, 1917–1992",American,Male,1917.0,1992.0,0.813115
7,Tuileries Sanglier / d'apres l'antique,Eugène Atget,229.0,1911,Albumen silver print,"8 11/16 × 6 9/16"" (22 × 16.7 cm)",Abbott-Levy Collection. Partial gift of Shirle...,1.1969.1,Photograph,Photography,...,,,,,"French, 1857–1927",French,Male,1857.0,1927.0,
8,Sapin (Trianon),Eugène Atget,229.0,1910-14,Albumen silver print,"Approx. 7 1/8 × 8 5/8"" (18.1 × 21.9 cm)",Abbott-Levy Collection. Partial gift of Shirle...,1.1969.10,Photograph,Photography,...,,,,,"French, 1857–1927",French,Male,1857.0,1927.0,
9,"Versailles, vase par Ballin",Eugène Atget,229.0,1902,Matte albumen silver print,"Approx. 8 9/16 × 7 1/16"" (21.8 × 18 cm)",Abbott-Levy Collection. Partial gift of Shirle...,1.1969.100,Photograph,Photography,...,,,,,"French, 1857–1927",French,Male,1857.0,1927.0,
10,Facteur,Eugène Atget,229.0,1899-1900,Gelatin silver printing-out-paper print,"Approx. 8 11/16 × 6 9/16"" (22 × 16.7 cm)",Abbott-Levy Collection. Partial gift of Shirle...,1.1969.1000,Photograph,Photography,...,,,,,"French, 1857–1927",French,Male,1857.0,1927.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38133,Untitled,Unknown photographer,8595.0,c. 1910,Gelatin silver print,"2 3/8 × 4 1/8"" (6 × 10.5 cm)",Gift of John Jeremiah Sullivan,TR16527.20,Photograph,Photography,...,,,10.5,,,,,0.0,0.0,1.750000
38134,Untitled,Unknown photographer,8595.0,c. 1910,"Gelatin silver print, printed later","8 1/16 × 13 1/16"" (20.5 × 33.2 cm)",Gift of John Jeremiah Sullivan,TR16527.21,Photograph,Photography,...,,,33.2,,,,,0.0,0.0,1.619512
38135,Untitled,Unknown photographer,8595.0,c. 1918-30,Gelatin silver print (postcard),"4 × 3 3/8"" (10.2 × 8.6 cm)",Gift of John Jeremiah Sullivan,TR16527.22,Photograph,Photography,...,,,8.6,,,,,0.0,0.0,0.843137
38136,Untitled,Unknown photographer,8595.0,c. 1900,Gelatin silver print,"6 7/16 × 9 3/4"" (16.4 × 24.7 cm)",Gift of John Jeremiah Sullivan,TR16527.23,Photograph,Photography,...,,,24.7,,,,,0.0,0.0,1.509780


### Exercise 3: Create a new column using apply

Use the values in the column `BeginDate` to create a new column, `CenturyBorn`, in the photos DataFrame that indicates the century in which an artist was born using a function that returns:
- `18th` if the artist was born between 1700-1799,
- `19th` if the artist was born between 1800-1899,
- `20th` if the artist was born between 1900-1999,
- `21st` if the artist was born between 2000-present, and
- `unknown` otherwise.

The `BeginDate` column contains the year in which an artist was born. Unknown artist birth years are identified with the value `0`. The function `century_born()` has been provided for you to use with the apply method, but you can create your own for extra practice.

In [55]:
# Return the century in which an artist was born given their birth year
def century_born(year):
    if year > 1999:
        return '21st'
    elif year > 1899:
        return '20th'
    elif year > 1799:
        return '19th'
    elif year > 1699:
        return '18th'
    return 'unknown'

# Create a new column named "CenturyBorn" that contains the century in which an
# artist was born using data from the column "BeginDate" and the function
# century_born
photos['CenturyBorn'] = photos['BeginDate'].apply(century_born)

photos[['CenturyBorn', 'BeginDate']]

Unnamed: 0,CenturyBorn,BeginDate
0,20th,1917.0
7,19th,1857.0
8,19th,1857.0
9,19th,1857.0
10,19th,1857.0
...,...,...
38133,unknown,0.0
38134,unknown,0.0
38135,unknown,0.0
38136,unknown,0.0


### Exercise 4: Remove missing values

We are only interested in artworks that have a defined width and height. Use the column `Aspect` to remove any rows from the dataset that do not have a defined aspect ratio (in other words, values that are `NaN`).

In [85]:
# Remove rows from the dataset that have an "NaN" value in the "Aspect" column
photos_clean = photos.dropna(subset=['Aspect'])

photos_clean

Unnamed: 0,Title,Artist,ConstituentID,Date,Medium,Dimensions,CreditLine,AccessionNumber,Classification,Department,...,Weight (kg),Width (cm),Duration (sec.),ArtistBio,Nationality,Gender,BeginDate,EndDate,Aspect,CenturyBorn
0,Untitled from VVV Portfolio,David Hare,2504.0,"c. 1941, published 1943",Gelatin silver print mounted on paper from a p...,"composition: 12 x 9 3/4"" (30.5 x 24.8 cm); she...",The Louis E. Stern Collection,1113.1964.6,Photograph,Drawings & Prints,...,,24.8,,"American, 1917–1992",American,Male,1917.0,1992.0,0.813115,20th
13,Marchand de paniers,Eugène Atget,229.0,1899-1900,Albumen silver print,"8 3/8 × 6 5/8"" (21.3 × 16.8 cm)",Abbott-Levy Collection. Partial gift of Shirle...,1.1969.1003,Photograph,Photography,...,,16.8,,"French, 1857–1927",French,Male,1857.0,1927.0,0.788732,19th
15,Porte de Montreuil,Eugène Atget,229.0,1913,Matte albumen silver print,"9 × 6 7/8"" (22.8 × 17.5 cm)",Abbott-Levy Collection. Partial gift of Shirle...,1.1969.1005,Photograph,Photography,...,,17.5,,"French, 1857–1927",French,Male,1857.0,1927.0,0.767544,19th
272,"Versailles, vase",Eugène Atget,229.0,1905,Albumen silver print,"Approx. 8 7/8 × 7"" (22.5 × 17.8 cm)",Abbott-Levy Collection. Partial gift of Shirle...,1.1969.124,Photograph,Photography,...,,17.8,,"French, 1857–1927",French,Male,1857.0,1927.0,0.791111,19th
277,Bords de la Marne,Eugène Atget,229.0,1903,Gelatin silver printing-out-paper print,"6 11/16 × 8 3/4"" (17 × 22.2 cm)",Abbott-Levy Collection. Partial gift of Shirle...,1.1969.1244,Photograph,Photography,...,,22.2,,"French, 1857–1927",French,Male,1857.0,1927.0,1.305882,19th
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38133,Untitled,Unknown photographer,8595.0,c. 1910,Gelatin silver print,"2 3/8 × 4 1/8"" (6 × 10.5 cm)",Gift of John Jeremiah Sullivan,TR16527.20,Photograph,Photography,...,,10.5,,,,,0.0,0.0,1.750000,unknown
38134,Untitled,Unknown photographer,8595.0,c. 1910,"Gelatin silver print, printed later","8 1/16 × 13 1/16"" (20.5 × 33.2 cm)",Gift of John Jeremiah Sullivan,TR16527.21,Photograph,Photography,...,,33.2,,,,,0.0,0.0,1.619512,unknown
38135,Untitled,Unknown photographer,8595.0,c. 1918-30,Gelatin silver print (postcard),"4 × 3 3/8"" (10.2 × 8.6 cm)",Gift of John Jeremiah Sullivan,TR16527.22,Photograph,Photography,...,,8.6,,,,,0.0,0.0,0.843137,unknown
38136,Untitled,Unknown photographer,8595.0,c. 1900,Gelatin silver print,"6 7/16 × 9 3/4"" (16.4 × 24.7 cm)",Gift of John Jeremiah Sullivan,TR16527.23,Photograph,Photography,...,,24.7,,,,,0.0,0.0,1.509780,unknown


### Exercise 5: Use a string method to normalize data

Normalize the values in the column `Gender` by applying the string method `title()` to convert all strings to title case (for example, convert `male` to `Male` and `female` to `Female`). Print out the unique values of this column after performing the operation to ensure only three values are present (`Male`, `Female`, and `nan`).

In [86]:
# Normalize the string values in the "Gender" column to title case (each word
# beginning with a capital letter)
photos['Gender'].str.title().unique()

array(['Male', 'Female', nan, 'male'], dtype=object)

### Exercise 6: Convert string data to Datetime

Convert the string data contained in the column `DateAcquired` to Datetime data type. Reassign the new Datetime values to the existing `DateAcquired` column and then use this column to filter the DataFrame for items acquired in or after the year 2020.

In [90]:
# Convert the string data in the column "DateAcquired" to a Datetime data type
photos['DateAcquired'] = pd.to_datetime(photos['DateAcquired'])

# Use the updated column to filter the DataFrame to photos acquired in or after
# the year 2020
photos[photos['DateAcquired'] >= pd.to_datetime('2020')]

Unnamed: 0,Title,Artist,ConstituentID,Date,Medium,Dimensions,CreditLine,AccessionNumber,Classification,Department,...,Weight (kg),Width (cm),Duration (sec.),ArtistBio,Nationality,Gender,BeginDate,EndDate,Aspect,CenturyBorn
37502,Untitled,Alfredo Boulton,20354.0,1934,Gelatin silver print,"9 1/2 × 7 1/16"" (24.2 × 18 cm)",Gift of Sofía Maduro through the Latin America...,28.2020.1,Photograph,Photography,...,,18.0,,"Venezuelan, 1908–1995",Venezuelan,Male,1908.0,1995.0,0.743802,20th
37510,Untitled,Victoria Cabezas,132160.0,1983-84,Chromogenic color print (solarized),"10 × 8 7/8"" (25.4 × 22.5 cm)",Acquired through the generosity of Judko Rosen...,TR16426.1,Photograph,Photography,...,,22.5,,"American, born 1950",American,Female,1950.0,0.0,0.885827,20th
37512,Untitled,Alfredo Boulton,20354.0,1934,Gelatin silver print,"9 5/8 × 7 1/16"" (24.5 × 18 cm)",Gift of Sofía Maduro through the Latin America...,28.2020.2,Photograph,Photography,...,,18.0,,"Venezuelan, 1908–1995",Venezuelan,Male,1908.0,1995.0,0.734694,20th
37513,Untitled,Alfredo Boulton,20354.0,1934,Gelatin silver print,"6 7/8 × 9 7/16"" (17.4 × 24 cm)",Gift of Sofía Maduro through the Latin America...,28.2020.3,Photograph,Photography,...,,24.0,,"Venezuelan, 1908–1995",Venezuelan,Male,1908.0,1995.0,1.379310,20th
37514,Untitled,Alfredo Boulton,20354.0,1934,Gelatin silver print,"9 3/4 × 6 3/4"" (24.8 × 17.1 cm)",Gift of Sofía Maduro through the Latin America...,28.2020.4,Photograph,Photography,...,,17.1,,"Venezuelan, 1908–1995",Venezuelan,Male,1908.0,1995.0,0.689516,20th
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38133,Untitled,Unknown photographer,8595.0,c. 1910,Gelatin silver print,"2 3/8 × 4 1/8"" (6 × 10.5 cm)",Gift of John Jeremiah Sullivan,TR16527.20,Photograph,Photography,...,,10.5,,,,,0.0,0.0,1.750000,unknown
38134,Untitled,Unknown photographer,8595.0,c. 1910,"Gelatin silver print, printed later","8 1/16 × 13 1/16"" (20.5 × 33.2 cm)",Gift of John Jeremiah Sullivan,TR16527.21,Photograph,Photography,...,,33.2,,,,,0.0,0.0,1.619512,unknown
38135,Untitled,Unknown photographer,8595.0,c. 1918-30,Gelatin silver print (postcard),"4 × 3 3/8"" (10.2 × 8.6 cm)",Gift of John Jeremiah Sullivan,TR16527.22,Photograph,Photography,...,,8.6,,,,,0.0,0.0,0.843137,unknown
38136,Untitled,Unknown photographer,8595.0,c. 1900,Gelatin silver print,"6 7/16 × 9 3/4"" (16.4 × 24.7 cm)",Gift of John Jeremiah Sullivan,TR16527.23,Photograph,Photography,...,,24.7,,,,,0.0,0.0,1.509780,unknown


## Further resources

### Filled version of this notebook

[Python Open Labs Week 3 unfilled notebook](https://colab.research.google.com/github/ncsu-libraries-data-vis/python-open-labs/blob/main/Open_Lab_3_data_wrangling_with_pandas/Open_Lab_3_data_wrangling_with_pandas.ipynb) - a version of this notebook without code filled in for the guided activity and exercises. Use the unfilled version to learn these materials or lead a workshop session

### Learning resources

- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html) - a free, online version of Jake VanderPlas' introduction to data science with Python, includes a chapter on data manipulation with pandas.
- [Python Programming for Data Science](https://www.tomasbeuzen.com/python-programming-for-data-science/README.html) - a website providing a great overview of conducting data science with Python including pandas.
- [Real Python](https://realpython.com/) contains a lot of different tutorials at different levels
- [LinkedIn Learning](https://www.lynda.com/Python-training-tutorials/415-0.html) is free with NC State accounts and contains several video series for learning Python
- [Dataquest](https://www.dataquest.io/) is a free then paid series of courses with an emphasis on data science

### Finding help with pandas

The [Pandas website](https://pandas.pydata.org/) and [online documentation](http://pandas.pydata.org/pandas-docs/stable/) are useful resources, and of course the indispensible [Stack Overflow has a "pandas" tag](https://stackoverflow.com/questions/tagged/pandas).  There is also a (much younger, much smaller) [sister site dedicated to Data Science questions that has a "pandas" tag](https://datascience.stackexchange.com/questions/tagged/pandas) too.

## Evaluation Survey
Please, spend 1 minute answering these questions that help improve future workshops.

[go.ncsu.edu/dvs-eval](https://go.ncsu.edu/dvs-eval)

## Credits

This workshop was created by Claire Cahoon and Walt Gurley, adapted from previous workshop materials by Scott Bailey and Simon Wiles, of Stanford Libraries.