# Dataset 2: MOMA Artworks Data

We read in the csv file obtained from https://github.com/MuseumofModernArt/collection/blob/master/Artworks.csv and use `head()` to peek at the data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

art = pd.read_csv('Artworks.csv')
art.head()

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
0,"Ferdinandsbrücke Project, Vienna, Austria, Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,http://www.moma.org/media/W1siZiIsIjU5NDA1Il0s...,,,,48.6,,,168.9,,
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw...,,,,40.6401,,,29.8451,,
2,"Villa near Vienna Project, Outside Vienna, Aus...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, pen, color pencil, ink, and gouache ...",...,http://www.moma.org/media/W1siZiIsIjk4Il0sWyJw...,,,,34.3,,,31.8,,
3,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)",(),(1944),(0),(Male),1980,Photographic reproduction with colored synthet...,...,http://www.moma.org/media/W1siZiIsIjEyNCJdLFsi...,,,,50.8,,,50.8,,
4,"Villa, project, outside Vienna, Austria, Exter...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, color pencil, ink, and gouache on tr...",...,http://www.moma.org/media/W1siZiIsIjEyNiJdLFsi...,,,,38.4,,,19.1,,


This is a wide dataset, and even `head()` cannot show everything.  Notice the ellipsis in the column headings, and that at the bottom it says "5 rows x 29 columns".  We use `count()` to get additional details 

In [2]:
art.count()

Title                 133288
Artist                131876
ConstituentID         131876
ArtistBio             128071
Nationality           131876
BeginDate             131876
EndDate               131876
Gender                131876
Date                  131195
Medium                122182
Dimensions            122203
CreditLine            130433
AccessionNumber       133331
Classification        133330
Department            133331
DateAcquired          127220
Cataloged             133331
ObjectID              133331
URL                    75569
ThumbnailURL           64648
Circumference (cm)        10
Depth (cm)             12149
Diameter (cm)           1418
Height (cm)           114603
Length (cm)              738
Weight (kg)              291
Width (cm)            113704
Seat Height (cm)           0
Duration (sec.)         3153
dtype: int64

We notice that there are a maximum of 133,331 records, though many of these columns are missing values (as indicated by lower counts).  For this analysis, we focus on the Date Acquired column.

In [3]:
acquired = art.loc[:, 'DateAcquired']
acquired

0         1996-04-09
1         1995-01-17
2         1997-01-15
3         1995-01-17
4         1997-01-15
5         1995-01-17
6         1995-01-17
7         1995-01-17
8         1995-01-17
9         1995-01-17
10        1995-01-17
11        1995-01-17
12        1995-01-17
13        1995-01-17
14        1995-01-17
15        1995-01-17
16        1995-01-17
17        1995-01-17
18        1995-01-17
19        1995-01-17
20        1995-01-17
21        1995-01-17
22        1995-01-17
23        1995-01-17
24        1995-01-17
25        1995-01-17
26        1995-01-17
27        1995-01-17
28        1995-01-17
29        1995-01-17
             ...    
133301    2018-02-07
133302    2018-02-07
133303    2018-02-07
133304    2018-02-07
133305    2018-02-07
133306    2018-02-07
133307    2018-02-07
133308    2018-02-07
133309    2018-02-07
133310    2018-02-07
133311           NaN
133312           NaN
133313    2018-02-07
133314    2018-02-07
133315    2018-02-07
133316    2017-05-24
133317    201

We use `pd.to_datetime()` to convert the dates into datetimes, and make sure we use `errors='coerce'` so that if errors are encountered, values are returned as NaT rather than stopping the processing.

In [4]:
clean = pd.to_datetime(acquired, errors='coerce')
clean

0        1996-04-09
1        1995-01-17
2        1997-01-15
3        1995-01-17
4        1997-01-15
5        1995-01-17
6        1995-01-17
7        1995-01-17
8        1995-01-17
9        1995-01-17
10       1995-01-17
11       1995-01-17
12       1995-01-17
13       1995-01-17
14       1995-01-17
15       1995-01-17
16       1995-01-17
17       1995-01-17
18       1995-01-17
19       1995-01-17
20       1995-01-17
21       1995-01-17
22       1995-01-17
23       1995-01-17
24       1995-01-17
25       1995-01-17
26       1995-01-17
27       1995-01-17
28       1995-01-17
29       1995-01-17
            ...    
133301   2018-02-07
133302   2018-02-07
133303   2018-02-07
133304   2018-02-07
133305   2018-02-07
133306   2018-02-07
133307   2018-02-07
133308   2018-02-07
133309   2018-02-07
133310   2018-02-07
133311          NaT
133312          NaT
133313   2018-02-07
133314   2018-02-07
133315   2018-02-07
133316   2017-05-24
133317   2017-05-24
133318   2017-05-24
133319   2017-05-24


Now we use `dropna()` to drop the rows that contain NaT Values.

In [5]:
acquired = clean.dropna()
acquired

0        1996-04-09
1        1995-01-17
2        1997-01-15
3        1995-01-17
4        1997-01-15
5        1995-01-17
6        1995-01-17
7        1995-01-17
8        1995-01-17
9        1995-01-17
10       1995-01-17
11       1995-01-17
12       1995-01-17
13       1995-01-17
14       1995-01-17
15       1995-01-17
16       1995-01-17
17       1995-01-17
18       1995-01-17
19       1995-01-17
20       1995-01-17
21       1995-01-17
22       1995-01-17
23       1995-01-17
24       1995-01-17
25       1995-01-17
26       1995-01-17
27       1995-01-17
28       1995-01-17
29       1995-01-17
            ...    
133298   2018-02-07
133299   2018-02-07
133300   2018-02-07
133301   2018-02-07
133302   2018-02-07
133303   2018-02-07
133304   2018-02-07
133305   2018-02-07
133306   2018-02-07
133307   2018-02-07
133308   2018-02-07
133309   2018-02-07
133310   2018-02-07
133313   2018-02-07
133314   2018-02-07
133315   2018-02-07
133316   2017-05-24
133317   2017-05-24
133318   2017-05-24


Now we can use `value_counts()` and `head()` to find the top 5 days of when items were added to the MOMA collection.  

In [6]:
acquired.value_counts().head()

1964-10-06    11220
2008-10-08     5423
1968-03-06     4992
2005-05-10     2606
1974-01-10     1668
Name: DateAcquired, dtype: int64

Based on this analysis, we conclude that MOMA saw its collection grow the most on **October 6, 1964**, when it added **11,220** items to the collection.