# Minneapolis Institute of Art (MIA) data exploration

 The Minneapolis Institute of Art have published the data of their collection (both on and off exhibit) publicly in GitHub. I will be examining the data below and creating a data pipeline to upsert future information into my final table. My final table will be used to visualize the data and answer some questions that I will be coming up with below.

In [223]:
import pandas as pd
import json
import glob

In [224]:
path1 = "collection-main\\departments\\1.json"
path2 = "collection-main\\departments\\2.json"
dept1 = pd.read_json(path1)
dept2 = pd.read_json(path2)

In [225]:
print(dept2.describe())
print(dept2.info())

            id       artworks
count  41524.0   41524.000000
mean       2.0   73288.526154
std        0.0   33144.521075
min        2.0       0.000000
25%        2.0   51615.750000
50%        2.0   69288.500000
75%        2.0   83592.250000
max        2.0  142432.000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41524 entries, 0 to 41523
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   name      41524 non-null  object
 1   id        41524 non-null  int64 
 2   artworks  41524 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 973.3+ KB
None


In [226]:
print(dept1.head())
print(dept2.head())

                                     name  id  artworks
0  Chinese, South and Southeast Asian Art   1        66
1  Chinese, South and Southeast Asian Art   1        67
2  Chinese, South and Southeast Asian Art   1        68
3  Chinese, South and Southeast Asian Art   1        69
4  Chinese, South and Southeast Asian Art   1        70
                  name  id  artworks
0  Prints and Drawings   2         0
1  Prints and Drawings   2         1
2  Prints and Drawings   2         2
3  Prints and Drawings   2         3
4  Prints and Drawings   2         4


So each department is assigned a number, that's the id in this table, and the name is the name of the department. The only changing item in this table is the artworks column. I think those are art ids and I'm thinking there are probably no duplicates? At least so far

In [227]:
path_to_exhibitions = "collection-main\\exhibitions\\0\\10.json"
with open(path_to_exhibitions, 'r') as f:
  exhibit = json.load(f)

print(exhibit)

{'exhibition_id': 10, 'exhibition_department': 'Decorative Arts, Textiles & Sculpture', 'exhibition_title': 'Japonisme', 'exhibition_description': None, 'begin': 2000, 'end': 2001, 'display_date': 'Tuesday, September 19, 2000 - Friday, October 26, 2001', 'public_info': 0, 'objects': [292, 3868, 4036, 4515, 5130, 8317, 8363, 12785, 29034, 40607, 40981, 40982], 'venues': []}


In [228]:
object_path = "collection-main\\objects\\0\\0.json"
object0 = pd.read_json(object_path)
print(object0.info())
print(object0.describe())
print(object0.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 36 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   accession_number     4 non-null      float64
 1   art_champions_text   0 non-null      float64
 2   artist               4 non-null      object 
 3   catalogue_raissonne  0 non-null      float64
 4   classification       4 non-null      object 
 5   continent            4 non-null      object 
 6   country              4 non-null      object 
 7   creditline           4 non-null      object 
 8   culture              0 non-null      float64
 9   curator_approved     4 non-null      int64  
 10  dated                4 non-null      object 
 11  department           4 non-null      object 
 12  description          4 non-null      object 
 13  dimension            4 non-null      object 
 14  id                   4 non-null      object 
 15  image                4 non-null      object 

^^see_also is referring to other artworks that are related to this artwork. In this case it is a set of four pieces that go together. Some objects have this empty. I think each object json contains just one piece even though the read_json makes it seem like more.

It seems to me that the file names of the object folder is the object id. I don't love that personally, but it is what we have to work with. So the artwork id in the department files connects to the names of the object files.
And the exhibition files include the artwork ids that are a part of the exhibition.
So artwork ids are basically what we'll be working around for the most part.

So what might the data structure of our relationial database look like? What are the relationships?

- One to Many relationship between Department and Object (dept_id and art_id)
- Many to Many relationship between Object and Exhibition (art_id and exhibit_id)

I want a department table, object table, and exhibit table.
department table will have a primary key of dept_id
object table will have a primary key of art_id
exhibit table will have primary key of exhibit_id. and then have a cross-reference table between object and exhibit that contains art_ids and exhibit_id pairs.
Four (4) total tables

## Beginning Construction of tables
### Department Table
The department table will contain the department id and the department name

In [229]:
dept_df = []
for file in glob.glob("collection-main\\departments\\*.json"):
    dept_df.append(pd.read_json(file))
print(len(dept_df))


10


In [230]:
departments = pd.concat(dept_df).groupby(['name', 'id']).artworks.count().reset_index()
departments.rename(columns={'artworks': 'num_artworks'}, inplace=True)
print(departments.head())
print(departments.info())


                                      name  id  num_artworks
0           Art of Africa and the Americas   8          5880
1   Chinese, South and Southeast Asian Art   1          8989
2                         Contemporary Art  14           655
3  Decorative Arts, Textiles and Sculpture   4         15328
4                  Japanese and Korean Art  13          9629
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   name          10 non-null     object
 1   id            10 non-null     int64 
 2   num_artworks  10 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 372.0+ bytes
None


### Artwork Table
Need a lot of columns in this one to cover all the field that are present in the original json files
And how exactly to handle the see_also field? If we pd.read_json, we will have multiples, and if we drop duplicates then we will lose the see_also items. How important is the see_also to this project? They're already grouped via the artist name..

I'm gonna go ahead and drop the see_also column completely, and then also drop duplicates from this process by using the subset accession_number.
This is also going to end up be a gigantic table at the end since there are SO MANY artworks.

In [231]:
#I'm going to start with folder 0 of the artwork (I know there is so much more, but I can run all that through my 
#python script for it later. It'll be a good test of my datapipeline, I think)
artwork_df = []
for file in glob.glob("collection-main\\objects\\0\\*.json"):
    artwork_df.append(pd.read_json(file))
print(len(artwork_df))

906


In [232]:
artworks = pd.concat(artwork_df).drop_duplicates(subset=['accession_number'])
print(artworks.info())
print(artworks.describe())

<class 'pandas.core.frame.DataFrame'>
Index: 900 entries, 0 to 0
Data columns (total 36 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   accession_number     900 non-null    object 
 1   art_champions_text   0 non-null      float64
 2   artist               900 non-null    object 
 3   catalogue_raissonne  10 non-null     object 
 4   classification       900 non-null    object 
 5   continent            889 non-null    object 
 6   country              888 non-null    object 
 7   creditline           900 non-null    object 
 8   culture              153 non-null    object 
 9   curator_approved     900 non-null    int64  
 10  dated                900 non-null    object 
 11  department           900 non-null    object 
 12  description          900 non-null    object 
 13  dimension            883 non-null    object 
 14  id                   900 non-null    object 
 15  image                900 non-null    object 
 1

I want info on the following columns: id, accession_number, artist, classification/object type, continent, country, creditline, dated (need to transform), department/dept_id, dimension (need to transform), medium (?), room (need to transform - make another column with on display= 1 or 0), style

In [233]:
artwork_slim = artworks.drop(columns=['art_champions_text', 'catalogue_raissonne', 'culture', 'description', 'image', 'image_copyright', 'image_height', 'image_width', 'inscription', 'life_date', 'markings', 'nationality', 'portfolio', 'provenance', 'restricted', 'rights_type', 'role', 'see_also', 'signed', 'text', 'title'])
print(artwork_slim.info())

<class 'pandas.core.frame.DataFrame'>
Index: 900 entries, 0 to 0
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   accession_number  900 non-null    object
 1   artist            900 non-null    object
 2   classification    900 non-null    object
 3   continent         889 non-null    object
 4   country           888 non-null    object
 5   creditline        900 non-null    object
 6   curator_approved  900 non-null    int64 
 7   dated             900 non-null    object
 8   department        900 non-null    object
 9   dimension         883 non-null    object
 10  id                900 non-null    object
 11  medium            900 non-null    object
 12  object_name       835 non-null    object
 13  room              900 non-null    object
 14  style             898 non-null    object
dtypes: int64(1), object(14)
memory usage: 112.5+ KB
None


In [234]:
# want to change the id now into a number instead of an object, need to trim a lot off.
# Also want to change the dated, dimension, and room to make them more usable to me.

In [235]:
artwork_slim['id'] = artwork_slim.id.apply(lambda x: x[(x.rindex('/')+1):])

In [236]:
# Add column to easily see if artwork is currently on display or not (0=not displayed, 1=displayed)
artwork_slim['display'] = artwork_slim.room.apply(lambda x: 0 if x=='Not on View' else 1)

In [241]:
# dimensions is more complicated than I originally thought. I'd like to blow it out into height, width, depth. There
# are just a lot of formats that we need to work with.
artwork_slim.fillna({'dimension': '0 in.'}, inplace=True)
dimensions = artwork_slim.dimension

In [242]:
contains_inches = dimensions[dimensions.apply(lambda x: ("in" in x))]

In [243]:
print(len(contains_inches))

898


In [244]:
no_inches = dimensions[dimensions.apply(lambda x: ('in' not in x))]
print(len(no_inches))
print(no_inches)

2
0    H.14.5 x W.7.6 x D.3.8
0                         f
Name: dimension, dtype: object
