# Python Open Labs: Exploratory analysis with pandas

## Setup
With this Google Colaboratory (Colab) notebook open, click the "Copy to Drive" button that appears in the menu bar. The notebook will then be attached to your own user account, so you can edit it in any way you like -- you can even take notes directly in the notebook.

## Instructors
- Walt Gurley
- Claire Cahoon

## Open Labs agenda

1.   **Guided activity**: One of the instructors will share their screen to work through the guided activity and teach concepts along the way.

2.   **Open lab time**: After the guided portion of the Open Lab, the rest of the time is for you to ask questions, work collaboratively, or have self-guided practice time. You will have access to instructors and peers for questions and support.

Breakout rooms will be available if you would like to work in small groups. If you have trouble joining a room, ask in the chat to be moved into a room.

## Learning objectives

By the end of the workshop today, we hope you'll be able to explore datasets using aggregation methods and grouping.

## Today's Topics
- Exploratory analysis
- Unique values
- Value counts
- Minimum, maximum, and average
- Grouping using `groupby()`

## Questions during the workshop

Please feel free to ask questions throughout the workshop.

We have a second instructor who will available during the workshop. They will answer as able, and will collect questions with answers that might help everyone to be answered at the end of the workshop.

The open lab time is when you will be able to ask more questions and work together on the exercises.

## Using Jupyter Notebooks and Google Colaboratory

Google Colab notebooks are a way to write and run Python code in an interactive way. If you would like to know more about Colaboratory and how to use notebooks, you can visit the [Welcome Notebook](https://colab.research.google.com/notebooks/welcome.ipynb).

If you'd like to install a Python distribution locally, we're happy to help. Feel free to [get help from our graduate consultants](https://www.lib.ncsu.edu/dxl) or [schedule an appointment with Libraries staff](https://go.ncsu.edu/dvs-request).

## Guided Instruction
In this section, we will work through examples using data from the [Museum of Modern Art (MoMA) research dataset](https://github.com/MuseumofModernArt/collection) containing records of all of the works that have been cataloged in the database of the MoMA collection.

We have prepared a dataset that consists of a subset of MoMA artworks classified as paintings and their associated artist information to use in the following activities. We will be referencing the data that we have prepared in our [Github repository for teaching datasets](https://github.com/ncsu-libraries-data-vis/teaching-datasets/tree/main/moma_data).

### Exploratory analysis of the dataset

After observing and cleaning our dataset, it is now easier to conduct analyses on our data. We will conduct some numerical and visual analyses that will help us explore questions such as:

- How many unique species have been identified in the data set?
- Which species are struck the most?
- How have number of strikes changed over time?
- Are there times of the year when most strikes occur?
- How frequently are land-based animals involved?

We can do this by calculating summaries of rows and columns, grouping data, and visualizing the results.

### Importing the dataset

In [1]:
# Import the pandas library as pd (callable in our code as pd)
import pandas as pd

In [2]:
# Import the paintings data from a csv file
# This dataset was cleaned based on methods from previous workshops
paintings_file_url = 'https://raw.githubusercontent.com/ncsu-libraries-data-vis/teaching-datasets/main/moma_data/moma_paintings_final.csv'

paintings = pd.read_csv(paintings_file_url)

# Print out the first five columns of the dataset
paintings.head()

Unnamed: 0,Title,Artist,ConstituentID,Date,Medium,Dimensions,CreditLine,AccessionNumber,Classification,Department,...,Duration (sec.),ArtistBio,Nationality,Gender,BeginDate,EndDate,Area (cm^2),Aspect,OilPainting,YearCreated
0,"Rope and People, I",Joan Miró,4016.0,"Barcelona, March 27, 1935","Oil on cardboard mounted on wood, with coil of...","41 1/4 x 29 3/8"" (104.8 x 74.6 cm)",Gift of the Pierre Matisse Gallery,71.1936,Painting,Painting & Sculpture,...,,"Spanish, 1893–1983",Spanish,male,1893.0,1983.0,7818.08,0.711832,Y,1935.0
1,Fire in the Evening,Paul Klee,3130.0,1929,Oil on cardboard,"13 3/8 x 13 1/4"" (33.8 x 33.3 cm)",Mr. and Mrs. Joachim Jean Aberbach Fund,153.197,Painting,Painting & Sculpture,...,,"German, born Switzerland. 1879–1940",German,male,1879.0,1940.0,1125.54,0.985207,Y,1929.0
2,Portrait of an Equilibrist,Paul Klee,3130.0,1927,Oil and collage on cardboard over wood with pa...,"24 7/8 x 15 3/4"" (63.2 x 40 cm)",Mrs. Simon Guggenheim Fund,195.1966,Painting,Painting & Sculpture,...,,"German, born Switzerland. 1879–1940",German,male,1879.0,1940.0,2219.04,0.610282,Y,1927.0
3,Guitar,Pablo Picasso,4609.0,"Paris, early 1919","Oil, charcoal and pinned paper on canvas","7' 1"" x 31"" (216 x 78.8 cm)",Gift of A. Conger Goodyear,384.1955,Painting,Painting & Sculpture,...,,"Spanish, 1881–1973",Spanish,male,1881.0,1973.0,16991.33,0.364521,Y,1919.0
4,Grandmother,Arthur Dove,1602.0,1925,"Shingles, needlepoint, page from Concordance, ...","20 x 21 1/4"" (50.8 x 54.0 cm)",Gift of Philip L. Goodwin (by exchange),636.1939,Painting,Painting & Sculpture,...,,"American, 1880–1946",American,male,1880.0,1946.0,2743.2,1.062992,N,1925.0


### Aggregation Methods

There are several methods that can be used to calculate aggregated values from the dataset, such as the number of unique values, unique value counts, minimum, maximum, and average.

#### Unique
`unique()` will return an array containing each unique value in a column of data. That means that duplicate values are only shown once, so it is a useful tool for finding each different reponse in a column of data. If we were interested in how many different responses there were, we could use the `len()` function to find the length of that array.

In this example, we use the `unique()` method on the "Artist" column to create an array of unique artist names to see each artist that is represented in the collection. The length of this array will provide the number of unique artist.

In [3]:
# Create a list of the unique artists with unique()
unique_artists = paintings['Artist'].unique()

# Print out the unique artists
unique_artists

array(['Joan Miró', 'Paul Klee', 'Pablo Picasso', ..., 'Ouattara Watts',
       'Donald Moffett', 'Matthew Wong'], dtype=object)

In [4]:
# Get the length of the new array using len()
# How many unique species are there?
len(unique_artists)

1018

#### Value counts

`.value_counts()` show how many instances there are of each unique entry in a column. It lists each unique value and how many times it appears in a column of data.

Here, we are interested in seeing the nationalities of artists in order to figure out which areas of the world are most represented in MoMA.

We will specify the `Nationality` column in our Dataframe and call the method `value_counts()`. This will return a Series with an index label of each nationality from the data and a value corresponding to the count of how many times that nationality is lsited in the 'Nationality' column of the DataFrame.

In [5]:
# Count the occurance of unique values on the column 'Nationality'
nationality = paintings['Nationality'].value_counts()

# Sort the Series by the value counts using sort_values()
nationality.sort_values(ascending=False)

American         1149
French            321
German            129
Spanish            96
British            87
Italian            75
Japanese           55
Brazilian          48
Belgian            35
Dutch              35
Venezuelan         35
Russian            32
Argentine          31
Mexican            28
Swiss              20
Austrian           19
Canadian           14
Uruguayan          12
Cuban              12
Australian          8
Chilean             8
Israeli             8
Polish              8
Danish              6
Romanian            6
Haitian             5
Korean              5
Colombian           5
Czech               5
Congolese           5
Indian              5
Hungarian           4
Peruvian            4
Swedish             3
Croatian            3
South African       3
Turkish             2
Yugoslav            2
Icelandic           2
Iranian             2
Irish               2
Zimbabwean          2
Moroccan            1
Sudanese            1
Norwegian           1
Ukrainian 

`.value_counts()` can also be a useful exploratory tool. You are able to see how many different categories are in a particular column and if there are areas that you would like to investigate more. 

For example, our data today only contains artwork that is classified as a "painting" in the `Classification` column. However, if we look at the `.value_counts()` to show how many pieces of art are in each department, we can see that they aren't all in the Painting & Sculpture department. Which paintings are housed in other departments and why? Should they be included in our dataset of paintings or not? We can take what we learned in the value counts and continue to filter the data to search for answers.

In [6]:
# Find the value counts for each department in the paintings data
paintings['Department'].value_counts()

Painting & Sculpture     2286
Drawings & Prints          29
Film                       27
Fluxus Collection           5
Media and Performance       3
Name: Department, dtype: int64

In [7]:
# To find out more about why some paintings are housed in the Film department
# filter the paintings data by rows where the department is Film
# notice that these paintings are all by the same artist
paintings[paintings["Department"] == "Film"]

Unnamed: 0,Title,Artist,ConstituentID,Date,Medium,Dimensions,CreditLine,AccessionNumber,Classification,Department,...,Duration (sec.),ArtistBio,Nationality,Gender,BeginDate,EndDate,Area (cm^2),Aspect,OilPainting,YearCreated
1946,A Thief in Paradise,Batiste Madalena,35204.0,1925,Tempera on poster board,"Overall: 44 x 23 3/4"" (111.8 x 60.3 cm)",Gift of Judith and Steven Katten,F2012.1307,Painting,Film,...,,"American, 1902–1988",American,male,1902.0,1988.0,6741.948968,0.539773,N,1925.0
1947,Beggar on Horseback,Batiste Madalena,35204.0,1925,Tempera on poster board,"Overall: 44 1/8 x 24 3/4"" (112.1 x 62.9 cm)",Courtesy of Judith and Steven Katten,F2012.1308,Painting,Film,...,,"American, 1902–1988",American,male,1902.0,1988.0,7045.780221,0.560907,N,1925.0
1948,Classmates,Batiste Madalena,35204.0,1924,Tempera on poster board,"Overall: 44 x 24 3/4"" (111.8 x 62.9 cm)",Gift of Judith and Steven Katten,F2012.1309,Painting,Film,...,,"American, 1902–1988",American,male,1902.0,1988.0,7025.820503,0.5625,N,1924.0
1949,The Freshman,Batiste Madalena,35204.0,1925,Tempera and paper on poster board,"Overall: 43 x 24 1/8"" (109.2 x 61.3 cm)",Courtesy of Judith and Steven Katten,F1922,Painting,Film,...,,"American, 1902–1988",American,male,1902.0,1988.0,6692.755321,0.561047,N,1925.0
1950,The Haunted House,Batiste Madalena,35204.0,1928,Tempera on poster board,"Overall: 43 3/4 x 24 3/4"" (111.1 x 62.9 cm)",Courtesy of Judith and Steven Katten,TR14133.10,Painting,Film,...,,"American, 1902–1988",American,male,1902.0,1988.0,6985.901069,0.565714,N,1928.0
1951,The Kid Brother,Batiste Madalena,35204.0,1927,Tempera on poster board,"Overall: 43 3/4 x 24 3/8"" (111.1 x 61.9 cm)",Courtesy of Judith and Steven Katten,F2012.1310,Painting,Film,...,,"American, 1902–1988",American,male,1902.0,1988.0,6880.054083,0.557143,N,1927.0
1952,The Loves of Carmen,Batiste Madalena,35204.0,c. 1927,Tempera on poster board,"Overall: 43 x 24"" (109.2 x 61 cm)",Courtesy of Judith and Steven Katten,TR14133.6,Painting,Film,...,,"American, 1902–1988",American,male,1902.0,1988.0,6658.077832,0.55814,N,1927.0
1953,The Noose,Batiste Madalena,35204.0,c. 1928,Tempera on poster board,"Overall: 43 1/2 x 24 3/4"" (110.5 x 62.9 cm)",Courtesy of Judith and Steven Katten,F1926,Painting,Film,...,,"American, 1902–1988",American,male,1902.0,1988.0,6945.981634,0.568966,N,1928.0
1954,So This Is Marriage,Batiste Madalena,35204.0,1925,Tempera on poster board,"Overall: 43 1/8 x 24 1/4"" (109.5 x 61.6 cm)",Courtesy of Judith and Steven Katten,TR14133.7,Painting,Film,...,,"American, 1902–1988",American,male,1902.0,1988.0,6746.9893,0.562319,N,1925.0
1955,The Unknown,Batiste Madalena,35204.0,1927,Tempera on poster board,"Overall: 42 3/4 x 24"" (108.6 x 61 cm)",Courtesy of Judith and Steven Katten,F1929,Painting,Film,...,,"American, 1902–1988",American,male,1902.0,1988.0,6619.368077,0.561404,N,1927.0


#### Minimum, maximum, and average

We can also calculate aggregates like the minimum, maximum, and mean of values in a DataFrame or Series. This is primarily useful for numeric values. For strings, aggregates are based on alphabetical order, with uppercase preceeding lowercase (for example, "B" would come before "a").

Here are a few examples:

- `mean()` to find the average of a range
- `min()` to find the smallest value
- `max()` to find the largest value
- `sum()` to sum the values of a range

In [8]:
# Calculate the minimum values for each column with .min()
# Note the minimum values for columns not containing numbers (e.g., the minimum
# for strings is alphabetical order, with uppercase characters preceeding
# lowercase characters - "B" comes before "a")
paintings.min()

Title                       "#1 - 1966"
ConstituentID                        11
AccessionNumber                  1.1931
Classification                 Painting
Department            Drawings & Prints
Cataloged                             N
ObjectID                          33621
Circumference (cm)                  NaN
Depth (cm)                            0
Diameter (cm)                      21.3
Height (cm)                           0
Length (cm)                         NaN
Weight (kg)                     18.0975
Width (cm)                            0
Duration (sec.)                     NaN
BeginDate                             0
EndDate                               0
Area (cm^2)                           0
Aspect                                0
YearCreated                        1872
dtype: object

In [9]:
# Calculate the average height for all pieces of art in this collection with .mean()
paintings['Height (cm)'].mean()

122.08828976977154

In [10]:
# Calculate the minimum, maximum, and average diameter for art in this collection with .agg()
paintings['Diameter (cm)'].agg(['mean', 'min', 'max'])

mean    138.357934
min      21.300000
max     203.200000
Name: Diameter (cm), dtype: float64

### Group values using groupby

We may be interested in seeing our data in groups. After the data has been grouped, we can do the same calculations on it as before (mean, min, max). In this example, we group the data based on the column `OilPainting` to see the average of each column based on whether or not it is an oil painting.

We can do this by calling `groupby()` on our dataset and passing in the column we would like to group by. We will group our data by the column `OilPainting` and then use the `.mean()` method to find the average of every column based on those groups.

In [11]:
# Group the dataset by "Department"
oil_paintings = paintings.groupby('OilPainting')

# This creates a groupby object that contains information about the groups
type(oil_paintings)

oil_paintings.mean()

Unnamed: 0_level_0,ConstituentID,ObjectID,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Duration (sec.),BeginDate,EndDate,Area (cm^2),Aspect,YearCreated
OilPainting,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
N,11023.11889,115468.081902,,8.924672,146.96,126.913884,,112.000073,146.602565,,1923.900925,1081.113606,28282.497759,1.188009,1969.260347
Y,6136.550251,92430.844221,,7.091307,132.2136,119.908527,,88.42825,122.469139,,1904.556533,1626.805905,19438.358664,1.059232,1947.796226


In [12]:
# Group the art based on the listed nationality of the artist
nationalities = paintings.groupby('Nationality')

# We can also sort the grouped data by any column
# Find the mean, then sort by "Height (cm)"
# ascending=False puts the tallest groups at the top
nationalities.mean().sort_values('Height (cm)', ascending=False)

Unnamed: 0_level_0,ConstituentID,ObjectID,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Duration (sec.),BeginDate,EndDate,Area (cm^2),Aspect,YearCreated
Nationality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Korean,44082.2,188725.4,,0.0,,186.502616,,,220.17968,,1938.4,402.2,42191.547168,1.306454,1982.2
Guyanese,719.0,78919.0,,,,183.5,,,122.7,,1934.0,0.0,22515.45,0.668665,1973.0
Danish,19420.666667,123385.5,,0.0,,165.611007,,,120.4918,,1951.333333,1004.833333,20605.756608,0.76773,1994.0
South African,7521.0,113773.666667,,,,163.618653,,,176.6362,,1953.0,0.0,29453.453294,1.11502,2002.666667
Indian,8220.4,83941.6,,,,155.6081,,,196.14018,,1936.6,1201.6,36668.079211,1.218904,1971.2
Hungarian,10152.0,90595.5,,,,152.522606,,,184.565088,,1910.0,1996.5,46016.860521,1.108948,1945.0
Swiss,3951.35,107237.05,,0.0,,148.010914,,,105.463921,,1904.15,1672.65,20199.559041,0.929979,1951.55
British,7654.045977,113498.793103,,4.169565,,138.690094,,,137.833605,,1927.528736,1119.770115,24998.005399,1.018145,1968.632184
American,6949.020888,94392.101828,,12.975261,120.6994,137.475268,,96.285524,154.128731,,1920.357702,1381.623151,28716.39242,1.154387,1962.225524
Peruvian,11407.5,89292.5,,,,137.136347,,,134.156331,,1924.75,1494.5,19770.908848,1.132755,1968.25


You can also use `groupby()` to group data by multiple variables. We will create a hierarchical grouping of `OilPainting` and then `Gender` to see the counts of oil paintings by different genders. We can use `.mean()` to find the average for each of those subcategories, or we can use `.size()` to find counts of each category.

In [13]:
# Group the data by OilPainting and then Gender 
# find the counts of subcategories with .size()
paintings.groupby(['OilPainting', 'Gender']).size()

OilPainting  Gender
N            female     108
             male       646
Y            female     187
             male      1398
dtype: int64

## Open work time
You can use this time to ask questions, collaborate, or work on the following activities (on your own or in a group).

All of the follow exercises will use a dataset about photos from the MoMA collection. All of the columns are the same as the examples above, but the contents of the rows will be different. You can run the first cell below to read in the dataset as a DataFrame.

In [14]:
photos_file_url = 'https://raw.githubusercontent.com/ncsu-libraries-data-vis/teaching-datasets/main/moma_data/moma_photographs_final.csv'

photos = pd.read_csv(photos_file_url)

# Print out the first five columns of the dataset
photos.head()

Unnamed: 0,Title,Artist,ConstituentID,Date,Medium,Dimensions,CreditLine,AccessionNumber,Classification,Department,...,Duration (sec.),ArtistBio,Nationality,Gender,BeginDate,EndDate,Area (cm^2),Aspect,OilPainting,YearCreated
0,Untitled from VVV Portfolio,David Hare,2504.0,"c. 1941, published 1943",Gelatin silver print mounted on paper from a p...,"composition: 12 x 9 3/4"" (30.5 x 24.8 cm); she...",The Louis E. Stern Collection,1113.1964.6,Photograph,Drawings & Prints,...,,"American, 1917–1992",American,male,1917.0,1992.0,756.4,0.813115,N,1941.0
1,Tuileries Sanglier / d'apres l'antique,Eugène Atget,229.0,1911,Albumen silver print,"8 11/16 × 6 9/16"" (22 × 16.7 cm)",Abbott-Levy Collection. Partial gift of Shirle...,1.1969.1,Photograph,Photography,...,,"French, 1857–1927",French,male,1857.0,1927.0,,,N,1911.0
2,Sapin (Trianon),Eugène Atget,229.0,1910-14,Albumen silver print,"Approx. 7 1/8 × 8 5/8"" (18.1 × 21.9 cm)",Abbott-Levy Collection. Partial gift of Shirle...,1.1969.10,Photograph,Photography,...,,"French, 1857–1927",French,male,1857.0,1927.0,,,N,1910.0
3,"Versailles, vase par Ballin",Eugène Atget,229.0,1902,Matte albumen silver print,"Approx. 8 9/16 × 7 1/16"" (21.8 × 18 cm)",Abbott-Levy Collection. Partial gift of Shirle...,1.1969.100,Photograph,Photography,...,,"French, 1857–1927",French,male,1857.0,1927.0,,,N,1902.0
4,Facteur,Eugène Atget,229.0,1899-1900,Gelatin silver printing-out-paper print,"Approx. 8 11/16 × 6 9/16"" (22 × 16.7 cm)",Abbott-Levy Collection. Partial gift of Shirle...,1.1969.1000,Photograph,Photography,...,,"French, 1857–1927",French,male,1857.0,1927.0,,,N,1899.0


### Exercise 1: Find value counts

1a. How many different mediums were used  to create these photographs? 

1b. What is the most common medium? Print out a list of how many of each of the mediums is listed with the most common at the top.

In [15]:
# 1a. Print how many different mediums are listed
len(photos["Medium"].unique())

1094

In [16]:
# 1b. Print a list of how many of each medium is in the data
photos["Medium"].value_counts()

Gelatin silver print                                                   14953
Albumen silver print                                                    4768
Chromogenic color print                                                 1670
Gelatin silver printing-out-paper print                                  909
Pigmented inkjet print                                                   516
                                                                       ...  
Salted paper print from a glass negative (copy of a daguerreotype)         1
Nine gelatin silver prints with applied color                              1
Ten vintage gelatin silver prints and video (color, sound; 20 min.)        1
Album of twelve gelatin silver prints (Rayographs)                         1
Chromogenic color print in self-lubricating frame                          1
Name: Medium, Length: 1093, dtype: int64

### Exercise 2: Find the average, minimum and maximum

Find the average, minimum, and maximum from the column 'Width (cm)'.

In [17]:
# Get general statistics of the width of the photographs
photos['Width (cm)'].agg(['mean', 'min', 'max'])

mean      31.970628
min        0.000000
max     1226.822454
Name: Width (cm), dtype: float64

### Exercise 3: Grouping Values

Find the average year created for each department (Painting & Sculpture, Drawings & Prints, Film, Fluxus Collection, Media & Performance). Group the data using 'Department,' then show a table with the average 'YearCreated' for each department, sorted from oldest to newest.

> Bonus discussion question: does showing the data this way tell the whole story? What other factors could affect the average year the art was created?

In [18]:
# Create the groupby object and store in a variable
departments = paintings.groupby('Department')

# Find the mean of each group and sort values by YearCreated
departments.mean().sort_values('YearCreated', ascending=True)

Unnamed: 0_level_0,ConstituentID,ObjectID,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Duration (sec.),BeginDate,EndDate,Area (cm^2),Aspect,YearCreated
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Film,35204.0,123348.111111,,0.0,,109.588519,,,61.017272,,1902.0,1988.0,6688.993217,0.556802,1926.074074
Painting & Sculpture,7401.635171,99089.028871,,8.700456,138.357934,123.939428,,96.285524,132.694652,,1910.492126,1463.969379,22778.105516,1.103793,1954.675
Fluxus Collection,5262.5,133986.4,,1.04,,14.24,,,13.42,,1934.0,0.0,465.448,1.158385,1976.333333
Drawings & Prints,6794.62069,130695.896552,,1.809131,,17.787072,,,20.433278,,1934.310345,274.758621,582.894061,1.266991,1976.653846
Media and Performance,8383.0,129208.666667,,0.0,,20.955042,,,28.575057,,1959.0,0.0,602.958195,1.359944,1995.666667


### Exercise 4: Group values by two factors

Find how many works of art each artist created using different mediums. Group by artist and medium to create a chart that shows how many works of art in the collection by each artist used each medium. 

In [19]:
# Create the groupby object and store in a variable
test = paintings.groupby(['Artist', 'Medium'])

# Find the mean of each group and sort values by YearCreated
test.size()

Artist                     Medium                                                                                     
A. E. Gallatin             Oil on canvas                                                                                  1
A.R. Penck (Ralf Winkler)  Synthetic polymer paint on canvas                                                              1
Abraham Palatnik           Jacaranda wood                                                                                 1
                           Wood, metal, synthetic fabric, lightbulbs, and motor                                           1
Abraham Rattner            Oil on canvas                                                                                  1
                                                                                                                         ..
Zvi Gali                   Encaustic on plywood                                                                           1
Édouard Vuill

## Further resources

### Filled version of this notebook

[Python Open Labs Week 4 unfilled notebook](https://colab.research.google.com/github/ncsu-libraries-data-vis/python-open-labs/blob/main/Open_Lab_4_exploratory_analysis_with_pandas/Open_Lab_4_exploratory_analysis_with_pandas.ipynb) - a blank version of this notebook with empty cells for the guided activity and exercises.
### Learning resources

- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html) - a free, online version of Jake VanderPlas' introduction to data science with Python, includes a chapter on data manipulation with pandas.
- [Python Programming for Data Science](https://www.tomasbeuzen.com/python-programming-for-data-science/README.html) - a website providing a great overview of conducting data science with Python including pandas.
- [Real Python](https://realpython.com/) contains a lot of different tutorials at different levels
- [LinkedIn Learning](https://www.lynda.com/Python-training-tutorials/415-0.html) is free with NC State accounts and contains several video series for learning Python
- [Dataquest](https://www.dataquest.io/) is a free then paid series of courses with an emphasis on data science

### Finding help with pandas

The [Pandas website](https://pandas.pydata.org/) and [online documentation](http://pandas.pydata.org/pandas-docs/stable/) are useful resources, and of course the indispensible [Stack Overflow has a "pandas" tag](https://stackoverflow.com/questions/tagged/pandas).  There is also a (much younger, much smaller) [sister site dedicated to Data Science questions that has a "pandas" tag](https://datascience.stackexchange.com/questions/tagged/pandas) too.

## Evaluation Survey
Please, spend 1 minute answering these questions that help improve future workshops.

https://go.ncsu.edu/dvs-eval

## Credits

This workshop was created by Claire Cahoon and Walt Gurley, adapted from previous workshop materials by Scott Bailey and Simon Wiles, of Stanford Libraries.