# Python Open Labs: Working with multiple datasets in pandas

## Setup
With this Google Colaboratory (Colab) notebook open, click the "Copy to Drive" button that appears in the menu bar. The notebook will then be attached to your own user account, so you can edit it in any way you like -- you can even take notes directly in the notebook.

## Instructors
- Walt Gurley
- Claire Cahoon

## Open Labs agenda

1.   **Guided activity**: One of the instructors will share their screen to work through the guided activity and teach concepts along the way.

2.   **Open lab time**: After the guided portion of the Open Lab, the rest of the time is for you to ask questions, work collaboratively, or have self-guided practice time. You will have access to instructors and peers for questions and support.

Breakout rooms will be available if you would like to work in small groups. If you have trouble joining a room, ask in the chat to be moved into a room.

## Learning objectives

By the end of our workshop today, we hope you'll understand basic pandas methods for loading, combining, and preparing different types of datasets for analyses with pandas.

## Today's Topics

- Editing DataFrame index labels and column headers
- Concatenating DataFrames
- Merging DataFrames

## Questions during the workshop

Please feel free to ask questions throughout the workshop.

We have a second instructor who will available during the workshop. They will answer as able, and will collect questions with answers that might help everyone to be answered at the end of the workshop.

The open lab time is when you will be able to ask more questions and work together on the exercises.

## Guided Instruction

In this Open Lab we're introducing how to use the pandas library to load, combine, and prepare multiple datasets for analysis.

In this section, we will work through examples using data from the [Museum of Modern Art (MoMA) research dataset](https://github.com/MuseumofModernArt/collection) containing records of all of the works that have been cataloged in the database of the MoMA collection.

> "The Museum’s website features 89,695 artworks from 26,494 artists. This research dataset contains 138,151 records, representing all of the works that have been accessioned into MoMA’s collection and cataloged in our database. It includes basic metadata for each work, including title, artist, date made, medium, dimensions, and date acquired by the Museum. Some of these records have incomplete information and are noted as “not Curator Approved." - [MoMA Github repository for collection data](https://github.com/MuseumofModernArt/collection)

We have split the dataset into several different subsections (paintings, sculptures, photographs, and artist information) and file types to use in activities. We will be referencing the data that we have prepared in our [Github repository for teaching datasets](https://github.com/ncsu-libraries-data-vis/teaching-datasets/tree/main/moma_data).

In [43]:
# Import the Pandas library as pd (callable in our code as pd)
import pandas as pd

### Load the datasets

In [44]:
# Import the MoMA paintings dataset (CSV file)

# The file location
paintings_file_url = 'https://raw.githubusercontent.com/ncsu-libraries-data-vis/teaching-datasets/main/moma_data/moma_paintings.csv'

# Read in the file and print out the DataFrame
paintings = pd.read_csv(paintings_file_url)
paintings.head()

Unnamed: 0,Title,Artist,ConstituentID,Date,Medium,Dimensions,CreditLine,AccessionNumber,Classification,Department,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
0,"Rope and People, I",Joan Miró,4016,"Barcelona, March 27, 1935","Oil on cardboard mounted on wood, with coil of...","41 1/4 x 29 3/8"" (104.8 x 74.6 cm)",Gift of the Pierre Matisse Gallery,71.1936,Painting,Painting & Sculpture,...,http://www.moma.org/media/W1siZiIsIjE2MDU0NiJd...,,,,104.8,,,74.6,,
1,Fire in the Evening,Paul Klee,3130,1929,Oil on cardboard,"13 3/8 x 13 1/4"" (33.8 x 33.3 cm)",Mr. and Mrs. Joachim Jean Aberbach Fund,153.197,Painting,Painting & Sculpture,...,http://www.moma.org/media/W1siZiIsIjE3Njc2NyJd...,,,,33.8,,,33.3,,
2,Portrait of an Equilibrist,Paul Klee,3130,1927,Oil and collage on cardboard over wood with pa...,"24 7/8 x 15 3/4"" (63.2 x 40 cm)",Mrs. Simon Guggenheim Fund,195.1966,Painting,Painting & Sculpture,...,http://www.moma.org/media/W1siZiIsIjE3OTI4NSJd...,,,,60.3,,,36.8,,
3,Guitar,Pablo Picasso,4609,"Paris, early 1919","Oil, charcoal and pinned paper on canvas","7' 1"" x 31"" (216 x 78.8 cm)",Gift of A. Conger Goodyear,384.1955,Painting,Painting & Sculpture,...,http://www.moma.org/media/W1siZiIsIjE1MDQ2MiJd...,,,,215.9,,,78.7,,
4,Grandmother,Arthur Dove,1602,1925,"Shingles, needlepoint, page from Concordance, ...","20 x 21 1/4"" (50.8 x 54.0 cm)",Gift of Philip L. Goodwin (by exchange),636.1939,Painting,Painting & Sculpture,...,http://www.moma.org/media/W1siZiIsIjI0NzA5NCJd...,,,,50.8,,,54.0,,


In [45]:
# Import the MoMA sculptures dataset (JSON file)

# The file location
sculptures_file_url = 'https://raw.githubusercontent.com/ncsu-libraries-data-vis/teaching-datasets/main/moma_data/moma_sculptures.json'

# Read in the file and print out the DataFrame
sculptures = pd.read_json(sculptures_file_url)
sculptures.head()

Unnamed: 0,Title,Artist,ConstituentID,Date,Medium,Dimensions,CreditLine,AccessionNumber,Classification,Department,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
73418,Surface with Vibrating Texture,Getulio Alviani,137,1964,Brushed aluminum on board,"33 x 32 3/4"" (83.6 x 83.2 cm)",Larry Aldrich Foundation Fund,105.1965,Sculpture,Painting & Sculpture,...,http://www.moma.org/media/W1siZiIsIjIwODIwOCJd...,,,,83.6,,,83.2,,
73474,IN RELATION TO AN INCREASE IN QUANTITY REGARDL...,Lawrence Weiner,6288,1973-74,LANGUAGE + THE MATERIALS REFERRED TO,Dimensions variable,Given anonymously,117.1975,Sculpture,Painting & Sculpture,...,http://www.moma.org/media/W1siZiIsIjMxODk1MSJd...,,,,,,,,,
73564,3 Standard Stoppages,Marcel Duchamp,1634,Paris 1913-14,"Wood box 11 1/8 x 50 7/8 x 9"" (28.2 x 129.2 x ...",,Katherine S. Dreier Bequest,149.1953.a-i,Sculpture,Painting & Sculpture,...,http://www.moma.org/media/W1siZiIsIjEzODY0NSJd...,,,,13.3,,,120.0,,
73567,To Be Looked at (from the Other Side of the Gl...,Marcel Duchamp,1634,Buenos Aires 1918,"Oil, silver leaf, lead wire, and magnifying le...","Overall 22"" (55.8 cm) high",Katherine S. Dreier Bequest,150.1953,Sculpture,Painting & Sculpture,...,http://www.moma.org/media/W1siZiIsIjI0MzI3MyJd...,,,,49.5,,,39.7,,
73733,Revolving,Kurt Schwitters,5293,1919,"Wood, metal, cord, cardboard, wool, wire, leat...","48 3/8 x 35"" (122.7 x 88.7 cm)",Advisory Committee Fund,231.1968,Sculpture,Painting & Sculpture,...,http://www.moma.org/media/W1siZiIsIjEyMjc3MCJd...,,,,122.7,,,88.7,,


In [46]:
# Import the MoMA photographs dataset (Excel file)

# The file location
photos_file_url = 'https://raw.githubusercontent.com/ncsu-libraries-data-vis/teaching-datasets/main/moma_data/moma_photographs.xlsx'

# Read in the file and print out the DataFrame
photos = pd.read_excel(photos_file_url)
photos.head()

Unnamed: 0,Title,Artist,ConstituentID,Date,Medium,Dimensions,CreditLine,AccessionNumber,Classification,Department,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
0,Untitled from VVV Portfolio,David Hare,2504,"c. 1941, published 1943",Gelatin silver print mounted on paper from a p...,"composition: 12 x 9 3/4"" (30.5 x 24.8 cm); she...",The Louis E. Stern Collection,1113.1964.6,Photograph,Drawings & Prints,...,http://www.moma.org/media/W1siZiIsIjM0NTUzOCJd...,,,,30.5,,,24.8,,
1,Tuileries Sanglier / d'apres l'antique,Eugène Atget,229,1911,Albumen silver print,"8 11/16 × 6 9/16"" (22 × 16.7 cm)",Abbott-Levy Collection. Partial gift of Shirle...,1.1969.1,Photograph,Photography,...,http://www.moma.org/media/W1siZiIsIjMwMTMwNCJd...,,,,,,,,,
2,Sapin (Trianon),Eugène Atget,229,1910-14,Albumen silver print,"Approx. 7 1/8 × 8 5/8"" (18.1 × 21.9 cm)",Abbott-Levy Collection. Partial gift of Shirle...,1.1969.10,Photograph,Photography,...,http://www.moma.org/media/W1siZiIsIjMxODEwMSJd...,,,,,,,,,
3,"Versailles, vase par Ballin",Eugène Atget,229,1902,Matte albumen silver print,"Approx. 8 9/16 × 7 1/16"" (21.8 × 18 cm)",Abbott-Levy Collection. Partial gift of Shirle...,1.1969.100,Photograph,Photography,...,http://www.moma.org/media/W1siZiIsIjMxODEwMiJd...,,,,,,,,,
4,Facteur,Eugène Atget,229,1899-1900,Gelatin silver printing-out-paper print,"Approx. 8 11/16 × 6 9/16"" (22 × 16.7 cm)",Abbott-Levy Collection. Partial gift of Shirle...,1.1969.1000,Photograph,Photography,...,http://www.moma.org/media/W1siZiIsIjI4NjE4MiJd...,,,,,,,,,


### Concatenate datasets

### Join datasets on shared column values

## Further resources

### Filled version of this notebook

[Python Open Labs Week 1 filled notebook](https://colab.research.google.com/github/NCSU-Libraries/data-viz-workshops/blob/master/Python_Open_Labs/Reading_exploring_and_writing_data_with_Pandas/Python_Open_Labs_Week1_filled.ipynb) - a version of this notebook with all code filled in for the guided activity and exercises. TODO - update link

### Learning resources

- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html) - a free, online version of Jake VanderPlas' introduction to data science with Python, includes a chapter on data manipulation with pandas.
- [Python Programming for Data Science](https://www.tomasbeuzen.com/python-programming-for-data-science/README.html) - a website providing a great overview of conducting data science with Python including pandas.
- [Real Python](https://realpython.com/) contains a lot of different tutorials at different levels
- [LinkedIn Learning](https://www.lynda.com/Python-training-tutorials/415-0.html) is free with NC State accounts and contains several video series for learning Python
- [Dataquest](https://www.dataquest.io/) is a free then paid series of courses with an emphasis on data science

### Finding help with pandas

The [Pandas website](https://pandas.pydata.org/) and [online documentation](http://pandas.pydata.org/pandas-docs/stable/) are useful resources, and of course the indispensible [Stack Overflow has a "pandas" tag](https://stackoverflow.com/questions/tagged/pandas).  There is also a (much younger, much smaller) [sister site dedicated to Data Science questions that has a "pandas" tag](https://datascience.stackexchange.com/questions/tagged/pandas) too.

## Evaluation Survey
Please, spend 1 minute answering these questions that help improve future workshops.

[go.ncsu.edu/dvs-eval](https://go.ncsu.edu/dvs-eval)

## Credits

This workshop was created by Claire Cahoon and Walt Gurley, adapted from previous workshop materials by Scott Bailey and Simon Wiles, of Stanford Libraries.