# Minneapolis Institute of Art (MIA) data exploration

 The Minneapolis Institute of Art have published the data of their collection (both on and off exhibit) publicly in GitHub. I will be examining the data below and creating a data pipeline to upsert future information into my final table. My final table will be used to visualize the data and answer some questions that I will be coming up with below.

In [1]:
import pandas as pd
import json
import glob
import re
import datetime
from datetime import date
import numpy as np

In [2]:
path1 = "collection-main\\departments\\1.json"
path2 = "collection-main\\departments\\2.json"
dept1 = pd.read_json(path1)
dept2 = pd.read_json(path2)

In [3]:
print(dept2.describe())
print(dept2.info())

            id       artworks
count  41524.0   41524.000000
mean       2.0   73288.526154
std        0.0   33144.521075
min        2.0       0.000000
25%        2.0   51615.750000
50%        2.0   69288.500000
75%        2.0   83592.250000
max        2.0  142432.000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41524 entries, 0 to 41523
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   name      41524 non-null  object
 1   id        41524 non-null  int64 
 2   artworks  41524 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 973.3+ KB
None


In [4]:
print(dept1.head())
print(dept2.head())

                                     name  id  artworks
0  Chinese, South and Southeast Asian Art   1        66
1  Chinese, South and Southeast Asian Art   1        67
2  Chinese, South and Southeast Asian Art   1        68
3  Chinese, South and Southeast Asian Art   1        69
4  Chinese, South and Southeast Asian Art   1        70
                  name  id  artworks
0  Prints and Drawings   2         0
1  Prints and Drawings   2         1
2  Prints and Drawings   2         2
3  Prints and Drawings   2         3
4  Prints and Drawings   2         4


So each department is assigned a number, that's the id in this table, and the name is the name of the department. The only changing item in this table is the artworks column. I think those are art ids and I'm thinking there are probably no duplicates? At least so far

In [5]:
path_to_exhibitions = "collection-main\\exhibitions\\0\\10.json"
df_list = []
with open(path_to_exhibitions, 'r') as f:
    exhibit = json.load(f)  # probably going to need a try except here to make sure that the file is not empty. Some of them are.
    for art in exhibit['objects']:
        one_line = {'exhibition_id': exhibit['exhibition_id'], 'art_id': art, 'display_date': exhibit['display_date']}
        print(one_line)
        df_list.append(one_line)

df_test = pd.DataFrame(df_list)
print(df_test.head())

{'exhibition_id': 10, 'art_id': 292, 'display_date': 'Tuesday, September 19, 2000 - Friday, October 26, 2001'}
{'exhibition_id': 10, 'art_id': 3868, 'display_date': 'Tuesday, September 19, 2000 - Friday, October 26, 2001'}
{'exhibition_id': 10, 'art_id': 4036, 'display_date': 'Tuesday, September 19, 2000 - Friday, October 26, 2001'}
{'exhibition_id': 10, 'art_id': 4515, 'display_date': 'Tuesday, September 19, 2000 - Friday, October 26, 2001'}
{'exhibition_id': 10, 'art_id': 5130, 'display_date': 'Tuesday, September 19, 2000 - Friday, October 26, 2001'}
{'exhibition_id': 10, 'art_id': 8317, 'display_date': 'Tuesday, September 19, 2000 - Friday, October 26, 2001'}
{'exhibition_id': 10, 'art_id': 8363, 'display_date': 'Tuesday, September 19, 2000 - Friday, October 26, 2001'}
{'exhibition_id': 10, 'art_id': 12785, 'display_date': 'Tuesday, September 19, 2000 - Friday, October 26, 2001'}
{'exhibition_id': 10, 'art_id': 29034, 'display_date': 'Tuesday, September 19, 2000 - Friday, October 26

^^I want exhibition_id, display_date, and the objects list. 
Alright, the above is proof of concept for me to go through a ton of files and grab the 3 things I want. Then I want to transform the display date into a number of days(?) that it was on display. So then later I can add up how long each piece has been on display and see who was displayed the longest (and maybe also which piece has been in the most distinct exhibitions).

In [6]:
object_path = "collection-main\\objects\\0\\0.json"
object0 = pd.read_json(object_path)
print(object0.info())
print(object0.describe())
print(object0.department.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 36 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   accession_number     4 non-null      float64
 1   art_champions_text   0 non-null      float64
 2   artist               4 non-null      object 
 3   catalogue_raissonne  0 non-null      float64
 4   classification       4 non-null      object 
 5   continent            4 non-null      object 
 6   country              4 non-null      object 
 7   creditline           4 non-null      object 
 8   culture              0 non-null      float64
 9   curator_approved     4 non-null      int64  
 10  dated                4 non-null      object 
 11  department           4 non-null      object 
 12  description          4 non-null      object 
 13  dimension            4 non-null      object 
 14  id                   4 non-null      object 
 15  image                4 non-null      object 

^^see_also is referring to other artworks that are related to this artwork. In this case it is a set of four pieces that go together. Some objects have this empty. I think each object json contains just one piece even though the read_json makes it seem like more.

It seems to me that the file names of the object folder is the object id. I don't love that personally, but it is what we have to work with. So the artwork id in the department files connects to the names of the object files.
And the exhibition files include the artwork ids that are a part of the exhibition.
So artwork ids are basically what we'll be working around for the most part.

So what might the data structure of our relationial database look like? What are the relationships?

- One to Many relationship between Department and Object (dept_id and art_id)
- Many to Many relationship between Object and Exhibition (art_id and exhibit_id)

I want a department table, object table, and exhibit table.
department table will have a primary key of dept_id
object table will have a primary key of art_id
exhibit table will have primary key of exhibit_id. and then have a cross-reference table between object and exhibit that contains art_ids and exhibit_id pairs.
Four (4) total tables

## Beginning Construction of tables
### Department Table
The department table will contain the department id and the department name

In [7]:
dept_df = []
for file in glob.glob("collection-main\\departments\\*.json"):
    dept_df.append(pd.read_json(file))
print(len(dept_df))



10


In [10]:
print(departments.head())

                                     name  id  artworks
0  Chinese, South and Southeast Asian Art   1        66
1  Chinese, South and Southeast Asian Art   1        67
2  Chinese, South and Southeast Asian Art   1        68
3  Chinese, South and Southeast Asian Art   1        69
4  Chinese, South and Southeast Asian Art   1        70


In [9]:
departments = pd.concat(dept_df)
print(departments.artworks.nunique())
department_condensed = departments.groupby(['name', 'id']).artworks.count().reset_index()
department_condensed.rename(columns={'artworks': 'num_artworks'}, inplace=True)
print(len(departments))
print(len(department_condensed))
print(department_condensed)
duplicates = departments.drop_duplicates()
print(len(duplicates))


97024
104475
10
                                      name  id  num_artworks
0           Art of Africa and the Americas   8          5880
1   Chinese, South and Southeast Asian Art   1          8989
2                         Contemporary Art  14           655
3  Decorative Arts, Textiles and Sculpture   4         15328
4                  Japanese and Korean Art  13          9629
5     Minnesota Artists Exhibition Program  10           126
6                                Paintings   6          1875
7                Photography and New Media   7         13001
8                      Prints and Drawings   2         41524
9                                 Textiles   5          7468
104475


In [11]:
duplicates = departments[departments.duplicated(subset=['artworks'], keep=False)]
print(len(duplicates))
print(duplicates.name.unique())
print(duplicates.head())

14902
['Decorative Arts, Textiles and Sculpture' 'Textiles']
                                       name  id  artworks
0   Decorative Arts, Textiles and Sculpture   4        49
4   Decorative Arts, Textiles and Sculpture   4        55
6   Decorative Arts, Textiles and Sculpture   4        64
7   Decorative Arts, Textiles and Sculpture   4        65
10  Decorative Arts, Textiles and Sculpture   4       109


In [13]:
subset = departments.drop_duplicates(subset=['artworks'])
print(len(subset))

97024


In [14]:
departments.to_csv('C:\\Users\\henge\\PycharmProjects\\MIA\\final_tables\\departments.csv')

### Artwork Table
Need a lot of columns in this one to cover all the field that are present in the original json files
And how exactly to handle the see_also field? If we pd.read_json, we will have multiples, and if we drop duplicates then we will lose the see_also items. How important is the see_also to this project? They're already grouped via the artist name..

I'm gonna go ahead and drop the see_also column completely, and then also drop duplicates from this process by using the subset accession_number.
This is also going to end up be a gigantic table at the end since there are SO MANY artworks.

In [17]:
#I'm going to start with folder 0 of the artwork (I know there is so much more, but I can run all that through my 
#python script for it later. It'll be a good test of my datapipeline, I think)
artwork_df = []
for file in glob.glob("collection-main\\objects\\0\\*.json"):
    artwork_df.append(pd.read_json(file))
print(len(artwork_df))

906


In [18]:
artworks = pd.concat(artwork_df).drop_duplicates(subset=['accession_number'])
print(artworks.info())
print(artworks.describe())

<class 'pandas.core.frame.DataFrame'>
Index: 900 entries, 0 to 0
Data columns (total 36 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   accession_number     900 non-null    object 
 1   art_champions_text   0 non-null      float64
 2   artist               900 non-null    object 
 3   catalogue_raissonne  10 non-null     object 
 4   classification       900 non-null    object 
 5   continent            889 non-null    object 
 6   country              888 non-null    object 
 7   creditline           900 non-null    object 
 8   culture              153 non-null    object 
 9   curator_approved     900 non-null    int64  
 10  dated                900 non-null    object 
 11  department           900 non-null    object 
 12  description          900 non-null    object 
 13  dimension            883 non-null    object 
 14  id                   900 non-null    object 
 15  image                900 non-null    object 
 1

I want info on the following columns: id, accession_number, artist, classification/object type, continent, country, creditline, dated (need to transform), department/dept_id, dimension (need to transform), medium (?), room (need to transform - make another column with on display= 1 or 0), style

In [19]:
artwork_slim = artworks.drop(columns=['art_champions_text', 'catalogue_raissonne', 'culture', 'description', 'image', 'image_copyright', 'image_height', 'image_width', 'inscription', 'life_date', 'markings', 'nationality', 'portfolio', 'provenance', 'restricted', 'rights_type', 'role', 'see_also', 'signed', 'text', 'title', 'object_name'])
print(artwork_slim.info())

<class 'pandas.core.frame.DataFrame'>
Index: 900 entries, 0 to 0
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   accession_number  900 non-null    object
 1   artist            900 non-null    object
 2   classification    900 non-null    object
 3   continent         889 non-null    object
 4   country           888 non-null    object
 5   creditline        900 non-null    object
 6   curator_approved  900 non-null    int64 
 7   dated             900 non-null    object
 8   department        900 non-null    object
 9   dimension         883 non-null    object
 10  id                900 non-null    object
 11  medium            900 non-null    object
 12  room              900 non-null    object
 13  style             898 non-null    object
dtypes: int64(1), object(13)
memory usage: 105.5+ KB
None


In [20]:
# want to change the id now into a number instead of an object, need to trim a lot off.
# Also want to change the dated, dimension, and room to make them more usable to me.

In [21]:
artwork_slim['id'] = artwork_slim.id.apply(lambda x: x[(x.rindex('/')+1):])

In [22]:
# Add column to easily see if artwork is currently on display or not (0=not displayed, 1=displayed)
artwork_slim['display'] = artwork_slim.room.apply(lambda x: 0 if x=='Not on View' else 1)

For Below: I want to drop the entries where the dimensions are Nan. These entries will be stored in another table for review by the museum staff.

In [23]:
no_dimensions = artwork_slim[artwork_slim['dimension'].isnull()]
print(no_dimensions.info())
print(no_dimensions.head())

<class 'pandas.core.frame.DataFrame'>
Index: 17 entries, 0 to 0
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   accession_number  17 non-null     object
 1   artist            17 non-null     object
 2   classification    17 non-null     object
 3   continent         13 non-null     object
 4   country           13 non-null     object
 5   creditline        17 non-null     object
 6   curator_approved  17 non-null     int64 
 7   dated             17 non-null     object
 8   department        17 non-null     object
 9   dimension         0 non-null      object
 10  id                17 non-null     object
 11  medium            17 non-null     object
 12  room              17 non-null     object
 13  style             17 non-null     object
 14  display           17 non-null     int64 
dtypes: int64(2), object(13)
memory usage: 2.1+ KB
None
  accession_number                                             ar

In [24]:
# dimensions is more complicated than I originally thought. I'd like to blow it out into height, width, depth. There
# are just a lot of formats that we need to work with.
artwork_slim.dropna(subset='dimension', inplace=True)  # remove the entries with nan in the dimension column 
print(artwork_slim.info())
# if the dimension does not contain any numbers, we need to change it to 0 in. as well.
# no inches in dimension and we will add 'in' to the end
artwork_slim['dimension'] = artwork_slim.dimension.apply(lambda x: '0 in' if not re.search('[0-9]', x) else (x + 'in' if not re.search('in', x) else x))


<class 'pandas.core.frame.DataFrame'>
Index: 883 entries, 0 to 0
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   accession_number  883 non-null    object
 1   artist            883 non-null    object
 2   classification    883 non-null    object
 3   continent         876 non-null    object
 4   country           875 non-null    object
 5   creditline        883 non-null    object
 6   curator_approved  883 non-null    int64 
 7   dated             883 non-null    object
 8   department        883 non-null    object
 9   dimension         883 non-null    object
 10  id                883 non-null    object
 11  medium            883 non-null    object
 12  room              883 non-null    object
 13  style             881 non-null    object
 14  display           883 non-null    int64 
dtypes: int64(2), object(13)
memory usage: 110.4+ KB
None


In [25]:
# expand dimensions into 3 columns
# trim dimensions to only include up to the inches
artwork_slim['dimension'] = artwork_slim.dimension.apply(lambda x: x[:x.index('in')])
artwork_slim['dimension'] = artwork_slim.dimension.apply(lambda z: z.replace('×', 'x'))
expanded_dimensions = pd.DataFrame()
expanded_dimensions = artwork_slim['dimension'].str.split('x', expand=True)

In [26]:
expanded_dimensions.rename({0: 'height', 1: 'width', 2: 'depth'}, axis='columns', inplace=True)
print(expanded_dimensions.head())

    height     width depth     3
0  68-5/8    25-1/8   None  None
0  70 5/8    24 7/8   None  None
0      27    37 1/2     6   None
0   3 3/8    1 5/16   None  None
0   5 1/4         1   None  None


In [27]:
expanded_dimensions.replace({None: '0'}, inplace=True)
# change dimension strings into numbers now. Again dealing with multiple possible formats
def make_inches_num(in_string):
    just_nums = re.sub(r'[^0-9\-/\s]*', '', in_string)
    string_parts = just_nums.strip().replace('-', ' ').split(' ')
    decimal = 0.0
    if len(string_parts) > 3:
        # add message here about wrong format (warning)
        return float('Nan')  # this will be my flag to grab entries that I think have messed up size information.
    else:
        for part in string_parts:
            if re.search(r'[^0-9/]', part) or part == '':
                # add message here about measurement being in wrong format (warning)
                return float('Nan')
            elif re.search('/', part):
                numerator, denominator = part.split('/')
                addition = float(numerator) / int(denominator)
            else:
                addition = float(part)
            decimal += addition
        print(decimal)
    return decimal

#print(expanded_dimensions.head(60))
expanded_dimensions['height'] = expanded_dimensions.height.apply(make_inches_num)
expanded_dimensions['width'] = expanded_dimensions.width.apply(make_inches_num)
expanded_dimensions['depth'] = expanded_dimensions.depth.apply(make_inches_num)


68.625
70.625
27.0
3.375
5.25
4.75
1.875
3.625
3.625
9.625
3.625
134.75
3.5
33.0
5.5
9.25
11.75
4.0
9.25
8.0
11.375
7.125
11.5
10.0
29.0
3.375
35.0
76.0
32.0
21.25
32.0
63.0
12.1875
14.125
4.0
10.125
7.125
7.125
7.0
6.875
7.5
6.875
7.5625
9.25
7.5
4.0
9.375
41.75
34.0
1.25
16.3125
33.375
0.9375
10.875
3.25
65.6875
4.125
7.5
37.125
73.25
17.5
2.0625
1.125
23.5
84.8125
22.5
8.5
3.25
189.5625
2.0
12.5
15.3125
41.3125
8.5
2.25
30.8125
78.875
69.0
61.0
13.0625
3.5625
3.75
64.3125
64.0625
6.625
4.5
40.5
10.5
5.5
11.625
18.0
107.875
26.125
25.5
217.0
70.625
6.25
72.0625
3.5625
14.5
11.375
8.5625
20.625
4.0
5.1875
6.25
2.75
4.875
4.625
4.5
7.75
14.25
5.3125
13.5625
9.5625
2.0625
15.375
3.625
5.25
18.0
12.0
12.0
14.5
27.0
8.1875
8.375
4.5
62.0
39.0
20.0
1.0
7.75
5.5
5.5
12.25
6.75
5.625
3.75
5.625
6.375
73.75
8.5
38.25
7.0
32.75
19.0
4.125
30.0
22.5
88.75
36.5
22.75
46.5
30.0
45.0
39.125
19.25
37.375
34.75
31.5
31.5
12.375
21.5
41.125
4.5
2.25
3.5
6.0
1.0625
96.0
46.0
30.0
7.5
4.5625
19.5
42.0


In [28]:
expanded_dimensions.drop(labels=3, axis=1, inplace=True)
#wrong_dimensions = expanded_dimensions[expanded_dimensions]

In [29]:
wrong_dimension_format = expanded_dimensions[expanded_dimensions.isna().any(axis=1)]
artwork_thick = pd.concat([artwork_slim, expanded_dimensions], axis=1)
print(artwork_thick.head())

  accession_number                                             artist  \
0             10.1  Artist: Frederick G. Smith; Artist: Formerly a...   
0             10.2  Artist: Frederick G. Smith; Artist: Formerly a...   
0           16.496                                                      
0            16.51                                                      
0            16.52                                                      

       classification continent  country  \
0            Drawings    Europe  England   
0            Drawings    Europe  England   
0   Sculpture; Models    Africa    Egypt   
0           Sculpture    Africa    Egypt   
0           Sculpture    Africa    Egypt   

                                          creditline  curator_approved  \
0  Gift of Mrs. C. J. Martin, in memory of Charle...                 0   
0  Gift of Mrs. C. J. Martin, in memory of Charle...                 0   
0                     The William Hood Dunwoody Fund                 0   


In [30]:
wrong_dimensions = artwork_thick[artwork_thick[['height', 'width', 'depth']].isna().any(axis=1)]
print(wrong_dimensions)

  accession_number                artist classification continent  \
0            46.12  Artist: Aelbert Cuyp      Paintings    Europe   
0       50.46.3a,b                            Metalwork      Asia   

       country                      creditline  curator_approved  \
0  Netherlands     The John R. Van Derlip Fund                 0   
0        China  Bequest of Alfred F. Pillsbury                 0   

           dated    department                                dimension   id  \
0           1649  European Art           26 1/2 x 22 1/4 x 5/16 to 3/8   731   
0  1300-1201 BCE     Asian Art  14 5/16 x 11 3/16 x 9 1/8 (Diam: 8 7/8   972   

         medium         room                  style  display   height  \
0  Oil on panel         G312           17th century        1  26.5000   
0        Bronze  Not on View  13th-12th century BCE        0  14.3125   

     width  depth  
0  22.2500    NaN  
0  11.1875    NaN  


In [31]:
print(artwork_thick.info())
artwork_thick.dropna(axis=0, subset=['height', 'width', 'depth'], inplace=True)
print(artwork_thick.info())

<class 'pandas.core.frame.DataFrame'>
Index: 883 entries, 0 to 0
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   accession_number  883 non-null    object 
 1   artist            883 non-null    object 
 2   classification    883 non-null    object 
 3   continent         876 non-null    object 
 4   country           875 non-null    object 
 5   creditline        883 non-null    object 
 6   curator_approved  883 non-null    int64  
 7   dated             883 non-null    object 
 8   department        883 non-null    object 
 9   dimension         883 non-null    object 
 10  id                883 non-null    object 
 11  medium            883 non-null    object 
 12  room              883 non-null    object 
 13  style             881 non-null    object 
 14  display           883 non-null    int64  
 15  height            883 non-null    float64
 16  width             883 non-null    float64
 17  dept

In [32]:
print(artwork_thick['dated'].head(50))

0                             c.1888-89
0                            c. 1888-89
0                 22nd-18th century BCE
0                          1320-656 BCE
0                    c. 664 BCE - 30 CE
0                           2nd century
0                    c. 664 BCE - 30 CE
0                          1320-656 BCE
0                           664-525 BCE
0                          1070-712 BCE
0                               c. 1915
0                          c. 1460-1485
0                          19th century
0                               1475–85
0                     17th-20th century
0                     12th-13th century
0            mid 10th-late 13th century
0                          19th century
0                          16th century
0    Fourth quarter of the 16th century
0                          16th century
0                          16th century
0                          16th century
0                          18th century
0                                  1898


What I want to do with the dated column is make a column that has the date (negative number for BCE) and I think another column that puts it in a time range (need to figure out appropriate time ranges). Will first make a column with dates

Regarding the formatting of the above, it seems everything that is BCE is related can be labeled as a negative number. 

- Check if BCE exists (also check BC in case dates were entered before that change)
- Check for CE (and AD for same reason as above)
- Check for 'century' (if this is present, will need to subtract 1 from the century and multiply by 100
(ex. 22nd-18th century BCE --> 22 -> -22 --> -22 - 1 = -23 --> -23\*100 = -2300. so final is -2300 to -1900)

In [33]:
def era_convert(*args, century):
    eras = []
    for item in args:
        if 'BCE' in item or 'BC' in item:
            eras.append(-1)
        elif 'CE' in item or 'AD' in item:
            eras.append(1)
        else:
            eras.append(0)
    subs = [0,0]
    if century != 1:
        if eras == [1,1]:
            subs = [-1, 0]
        elif eras == [-1,-1]:
            subs = [0, 1]
    else:
        subs = [0,0]
    return eras, subs
    
# I think the best foot forward is to make a function that can then be passed through using apply
def deconstruct_dated(indate):
    str_date = str(indate)
    dates = []  # begin splitting the current string
    multiply = 1
    if not re.search(r'[0-9]', str_date) or str_date=='':
        return [float('Nan'), float('Nan')]
    else:
        if re.search('century', str_date, re.I) is not None:  # converting century to an actual number
            multiply = 100
        str_date = str_date.replace('–', '-')
        
        if '-' in str_date:
            dates = str_date.split('-')
            times = re.findall(r'BCE|CE|BC|AD', str_date)
        else:
            dates.append(str_date)
            times = re.findall(r'BCE|CE|BC|AD', str_date)
        if len(times) == 0:  # account for no specification of BCE or CE
            times = ['CE', 'CE']
        
        if len(times) != len(dates):  # accounts for entries where there is only one specification of a BCE/CE
            times.append(times[0])
     
        times, subs = era_convert(*times, century=multiply)
        

    # now let's go through and edit these a bit more
    # sub default is [0,1] if century present
        today = date.today()
        #print(type(today.year))
        d_converted = []
        for t, d, s in zip(times, dates, subs):
            d_clean = int(re.sub(r'[^0-9]*', '', d))
            # put a check here for ridiculous numbers
            if (d_clean*t) > today.year:
                print(d_clean)
                d_clean = int(d_clean/10000)
            d_converted.append(((d_clean*t)+s)*multiply)
    
        if len(d_converted) == 1:
            d_converted.append(d_converted[0])
        
        if d_converted[0] > d_converted[1]:
            d_converted[1] = d_converted[1] + (int(d_converted[0]/100)*100)
        
        return d_converted
 
print(deconstruct_dated('41st-30th century BCE'))

[-4100, -2900]


In [34]:
#artwork_thick['dated'] = artwork_thick.astype({'dated': str})
expanded_date = pd.DataFrame()
start = pd.Series()
end = pd.Series()
holder = artwork_thick['dated'].apply(deconstruct_dated)

16881688
17421743
15651611
16061611
16101611
16171611
16121611
15651611
15651611
15651611


In [35]:
print(holder.head(50))

0      [1888, 1889]
0      [1888, 1889]
0    [-2200, -1700]
0     [-1320, -656]
0        [-664, 30]
0        [200, 200]
0        [-664, 30]
0     [-1320, -656]
0      [-664, -525]
0     [-1070, -712]
0      [1915, 1915]
0      [1460, 1485]
0      [1900, 1900]
0      [1475, 1485]
0      [1600, 2000]
0      [1100, 1300]
0       [900, 1300]
0      [1900, 1900]
0      [1600, 1600]
0      [1600, 1600]
0      [1600, 1600]
0      [1600, 1600]
0      [1600, 1600]
0      [1800, 1800]
0      [1898, 1898]
0      [1900, 1900]
0      [1830, 1830]
0        [571, 571]
0      [1900, 1900]
0      [1900, 1900]
0      [1840, 1840]
0      [1450, 1460]
0      [1800, 2000]
0      [1950, 1950]
0      [1890, 1890]
0      [2000, 2000]
0      [2000, 2000]
0      [2000, 2000]
0      [2000, 2000]
0      [2000, 2000]
0      [1800, 2000]
0      [1897, 1897]
0      [1800, 2000]
0      [1899, 1899]
0      [1900, 1900]
0      [1900, 1900]
0      [1800, 2000]
0      [1905, 1905]
0      [1700, 1700]
0      [1150, 1150]


In [36]:
expanded_date['start'] = holder.apply(lambda x: x[0])
expanded_date['end'] = holder.apply(lambda x: x[1])
#expanded_date.astype({'start': int, 'end': int})

In [37]:
#expanded_date = expanded_date.reset_index()
print(expanded_date.dtypes)

start    float64
end      float64
dtype: object


In [38]:

artwork_complete = pd.concat([artwork_thick, expanded_date], axis=1)
artwork_complete.drop(columns= ['dated', 'department', 'dimension', 'room'], axis=1, inplace=True)
print(artwork_complete.head())


  accession_number                                             artist  \
0             10.1  Artist: Frederick G. Smith; Artist: Formerly a...   
0             10.2  Artist: Frederick G. Smith; Artist: Formerly a...   
0           16.496                                                      
0            16.51                                                      
0            16.52                                                      

       classification continent  country  \
0            Drawings    Europe  England   
0            Drawings    Europe  England   
0   Sculpture; Models    Africa    Egypt   
0           Sculpture    Africa    Egypt   
0           Sculpture    Africa    Egypt   

                                          creditline  curator_approved   id  \
0  Gift of Mrs. C. J. Martin, in memory of Charle...                 0    0   
0  Gift of Mrs. C. J. Martin, in memory of Charle...                 0    1   
0                     The William Hood Dunwoody Fund       

Alright, that was a little messy. I want to set up the ranges for each time period. I need to see the highest and lowest start and end dates

In [39]:
print(artwork_complete.describe())

       curator_approved     display        height       width       depth  \
count        881.000000  881.000000    881.000000  881.000000  881.000000   
mean           0.044268    0.315551     40.963181   19.070162    3.229711   
std            0.205806    0.464999    526.795793   39.245955   18.837082   
min            0.000000    0.000000      0.000000    0.000000    0.000000   
25%            0.000000    0.000000      4.125000    2.000000    0.000000   
50%            0.000000    0.000000      9.500000    7.000000    0.000000   
75%            0.000000    1.000000     29.500000   20.500000    1.187500   
max            1.000000    1.000000  15625.000000  525.000000  525.000000   

             start          end  
count   878.000000   878.000000  
mean    970.268793  1081.422551  
std    1209.003902  1077.015898  
min   -4000.000000 -3000.000000  
25%       0.000000   425.000000  
50%    1662.000000  1700.000000  
75%    1830.000000  1857.500000  
max    2000.000000  2000.000000  


minimum is -4000. maximum is 2000 (for this dataset - keep in mind we don't have contemporary art yet).
From https://www.historyskills.com/historical-knowledge/chronology/:

- stone age = 2.5 mil - 3000 BCE
- bronze age = 3000 - 1200 BCE
- iron age = 1200 - 800 BCE
- classical age = 800 BCE - 476 CE
- middle ages = 476 - 1450
- modern age = 1450 - present

I will be using the 'start' date to organize these items

In [40]:
def ages_decode(number):
    age = ''
    if number <= -3000:
        age = 'stone age'
    elif number <= -1200:
        age = 'bronze age'
    elif number <= -800:
        age = 'iron age'
    elif number <= 476:
        age = 'classical age'
    elif number <= 1450:
        age = 'middle ages'
    else:
        age = 'modern age'
    return age

missing_dates = artwork_complete[artwork_complete[['start', 'end']].isna().any(axis=1)]
artwork_complete.dropna(axis=0, subset=['start', 'end'], inplace=True)
artwork_complete['age'] = artwork_complete.start.apply(ages_decode)

In [41]:
print(artwork_complete.head())
print(artwork_complete.info())

  accession_number                                             artist  \
0             10.1  Artist: Frederick G. Smith; Artist: Formerly a...   
0             10.2  Artist: Frederick G. Smith; Artist: Formerly a...   
0           16.496                                                      
0            16.51                                                      
0            16.52                                                      

       classification continent  country  \
0            Drawings    Europe  England   
0            Drawings    Europe  England   
0   Sculpture; Models    Africa    Egypt   
0           Sculpture    Africa    Egypt   
0           Sculpture    Africa    Egypt   

                                          creditline  curator_approved   id  \
0  Gift of Mrs. C. J. Martin, in memory of Charle...                 0    0   
0  Gift of Mrs. C. J. Martin, in memory of Charle...                 0    1   
0                     The William Hood Dunwoody Fund       

In [42]:
continents = artwork_complete.groupby('continent').accession_number.count()

In [43]:
print(continents)

continent
Africa            45
Asia             376
Europe           206
North America    225
Oceania            5
South America     15
Name: accession_number, dtype: int64


In [44]:
missing_continents = artwork_complete[artwork_complete['continent'].isnull()]
print(len(missing_continents))
artwork_complete['continent'] = artwork_complete.continent.fillna('Unknown')
artwork_complete['country'] = artwork_complete.country.fillna('Unknown')

6


In [48]:
print(artwork_complete.info())
print(artwork_complete.id.nunique())
artwork_complete = artwork_complete.reset_index()
print(artwork_complete.head())

<class 'pandas.core.frame.DataFrame'>
Index: 878 entries, 0 to 0
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   accession_number  878 non-null    object 
 1   artist            878 non-null    object 
 2   classification    878 non-null    object 
 3   continent         878 non-null    object 
 4   country           878 non-null    object 
 5   creditline        878 non-null    object 
 6   curator_approved  878 non-null    int64  
 7   id                878 non-null    object 
 8   medium            878 non-null    object 
 9   style             878 non-null    object 
 10  display           878 non-null    int64  
 11  height            878 non-null    float64
 12  width             878 non-null    float64
 13  depth             878 non-null    float64
 14  start             878 non-null    float64
 15  end               878 non-null    float64
 16  age               878 non-null    object 
dtypes: f

In [49]:
artwork_complete.to_csv('C:\\Users\\henge\\PycharmProjects\\MIA\\final_tables\\artworks.csv')

## Moving onto the Exhibitions information

I have my proof of concept from the top, I'm going to need to add to this in order to make it work with going through many files (just gonna do one of the folders for now - the rest can be processed when I get this into a python script that is much more polished).

In [50]:
path_to_exhibits = "collection-main\\exhibitions\\0\\10.json"
df_list = []
for f in glob.glob("collection-main\\exhibitions\\0\\*.json"):
    info = 0
    try:
        with open(f, 'r') as file:
            info = json.load(file)  # probably going to need a try except here to make sure that the file is not empty. Some of them are.
    except Exception as error:
        print(error)
        print("file is empty")  # this will need to include the file name and also just be a message logged to another file
        continue

    for art in info['objects']:
        one_line = {'exhibition_id': info['exhibition_id'], 'art_id': art, 'display_date': info['display_date']}
        #print(one_line)
        df_list.append(one_line)

print(len(df_list))
exhibits = pd.DataFrame(df_list)
print(exhibits.head())  #gorgeous actually
print(len(exhibits))

Expecting value: line 1 column 1 (char 0)
file is empty
Expecting value: line 1 column 1 (char 0)
file is empty
Expecting value: line 1 column 1 (char 0)
file is empty
Expecting value: line 1 column 1 (char 0)
file is empty
Expecting value: line 1 column 1 (char 0)
file is empty
Expecting value: line 1 column 1 (char 0)
file is empty
Expecting value: line 1 column 1 (char 0)
file is empty
Expecting value: line 1 column 1 (char 0)
file is empty
Expecting value: line 1 column 1 (char 0)
file is empty
Expecting value: line 1 column 1 (char 0)
file is empty
Expecting value: line 1 column 1 (char 0)
file is empty
Expecting value: line 1 column 1 (char 0)
file is empty
Expecting value: line 1 column 1 (char 0)
file is empty
Expecting value: line 1 column 1 (char 0)
file is empty
Expecting value: line 1 column 1 (char 0)
file is empty
Expecting value: line 1 column 1 (char 0)
file is empty
Expecting value: line 1 column 1 (char 0)
file is empty
Expecting value: line 1 column 1 (char 0)
file i

In [51]:
print(exhibits.info())
print(exhibits.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12059 entries, 0 to 12058
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   exhibition_id  12059 non-null  int64 
 1   art_id         12059 non-null  int64 
 2   display_date   12059 non-null  object
dtypes: int64(2), object(1)
memory usage: 282.8+ KB
None
   exhibition_id  art_id                                       display_date
0             10     292  Tuesday, September 19, 2000 - Friday, October ...
1             10    3868  Tuesday, September 19, 2000 - Friday, October ...
2             10    4036  Tuesday, September 19, 2000 - Friday, October ...
3             10    4515  Tuesday, September 19, 2000 - Friday, October ...
4             10    5130  Tuesday, September 19, 2000 - Friday, October ...


In [52]:
exhibit_art_df = exhibits[['exhibition_id', 'art_id']]
exhibit_df = exhibits.drop_duplicates(subset=['exhibition_id'])
exhibit_df.drop('art_id', inplace=True, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  exhibit_df.drop('art_id', inplace=True, axis=1)


In [55]:
print(exhibit_art_df.nunique())
print(len(exhibit_art_df.unique()))

exhibition_id      707
art_id           11048
dtype: int64


AttributeError: 'DataFrame' object has no attribute 'unique'

In [None]:
print(exhibit_df.info())
print(exhibit_df.exhibition_id.nunique())
print(exhibit_df.head())

In [None]:
exhibit_art_df.drop_duplicates()
print(exhibit_art_df.info())

In [None]:
exhibit_art_df.to_csv('C:\\Users\\henge\\PycharmProjects\\MIA\\final_tables\\exhibit_art.csv')

Ok, awesome! So there isn't any missing data here which makes me happy. We do need to process the display date column some though.

In [None]:
def format_dates(date):
    formatted = ''
    date = date.replace(' to', '')
    try:    
        formatted = pd.to_datetime(date, format='mixed')
    except:
        formatted = np.datetime64('NaT')
    else:
        if date == '':
            formatted = np.datetime64('NaT')
    return formatted
            


In [None]:

expanded_show_dates = pd.DataFrame()
expanded_show_dates[['start', 'end']] = exhibit_df['display_date'].str.split(r'-| to', expand=True)

print(expanded_show_dates.info())
print(expanded_show_dates.head())
expanded_show_dates['end'] = expanded_show_dates['end'].fillna('')
print(expanded_show_dates.info())


In [None]:
expanded_show_dates['start_datetime'] = expanded_show_dates.start.apply(format_dates)
expanded_show_dates['end_datetime'] = expanded_show_dates.end.apply(format_dates)

print(expanded_show_dates[expanded_show_dates['start_datetime'] == 'Incorrect Format'])
print(expanded_show_dates.info())
print(expanded_show_dates[expanded_show_dates['start_datetime'].isnull()])

In [None]:
final_exhibits = pd.concat([exhibit_df, expanded_show_dates], axis = 1)
print(final_exhibits.head())

In [None]:
try:
    final_exhibits['days'] = final_exhibits['end_datetime'] - final_exhibits['start_datetime']
except:
    print('damn')

In [None]:
print(len(final_exhibits))

In [None]:
#final_exhibits['days'] = exhibits['days']
incorrect_format_dates = final_exhibits[final_exhibits['days'].isnull()]
final_exhibits.dropna(subset=['days'], axis=0, how='any', inplace=True)
print(final_exhibits.info())

In [None]:
print(final_exhibits.describe())
print(final_exhibits.sort_values('days', ascending=False))

In [None]:
#final_exhibits.drop(columns=['display_date', 'start', 'end'], inplace=True)
final_exhibits['days'] = final_exhibits['days'].dt.days
print(final_exhibits.head())

In [None]:
print(final_exhibits.info())

In [None]:
final_exhibits.to_csv('C:\\Users\\henge\\PycharmProjects\\MIA\\final_tables\\exhibits.csv')