# Initial research questions

* comparative analysis of gender representation in artwork creation between born digital and analogue art collections
* with also, potentially, some details on medium, location…
* semantic analysis of the narrative about the artworks or what are the keywords associated with different artwork types

### The story could be:
The internet was supposed to revolutionize things, so how did it do when looking at who makes art and who gets included in collections?


A simple way to plan your work is:

 * choose the research question
 * map the question to pieces of information needed to answer the question (e.g. periods, countings)
 * map the data to specific data types (categorical, numerical, ordinal)
 * choose the plot(s) that better help you to visualise some pattern (e.g. a bar chart)
 * get your data in some form (SPARQL query results)
 * filter/ manipulate your data (select the variables that matter, make operations like countings) 
 * create a data structure that fits the plotting requirements (a table, a JSON etc) including the number of variables needed (e.g. one categorical and one numerical)


In [141]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re

In [184]:
#IMPORT DATASETS
artists = pd.read_pickle('MOMA_data/pickle/MoMAartists.pkl')
before80 = pd.read_pickle('MOMA_data/pickle/old_artworks.pkl')
after80 = pd.read_pickle('MOMA_data/pickle/new_artworks.pkl')
rhizome = pd.read_csv('Rhizome_data/rhizome_artworks_extra.csv', index_col=0)
rhizome = pd.DataFrame(rhizome)

In [156]:
len(before80), len(after80), len(rhizome)


(100692, 34154, 2270)

In [185]:
rhizome

Unnamed: 0,ID,URL,Title,Artist,dateAcquired,dateCreated,Nationality,Gender
0,879,https://artbase.rhizome.org/wiki/Q2423,ZUR FARBENLEHRE (THEORY OF COLOURS),Steven Jones,2007,2007,British,M
1,1020,https://artbase.rhizome.org/wiki/Q4089,Zones de Convergence,cicero,2005,2005,missing,missing
2,"243, 701",https://artbase.rhizome.org/wiki/Q1475,Zombie and Mummy,"Dragan Espenschied, Olia Lialina",2004,2002,"German, Russian","M, F"
3,312,https://artbase.rhizome.org/wiki/Q4374,"Zaira, City of Memories",Gokcen Erguven,2004,2004,Turkish,F
4,920,https://artbase.rhizome.org/wiki/Q3972,Z_G [zeitgeist gestalten],Tiago Borges,2008,2007,Angolan,M
...,...,...,...,...,...,...,...,...
2265,1075,https://artbase.rhizome.org/wiki/Q4358,1999,joan escofet,2001,2000,missing,missing
2266,771,https://artbase.rhizome.org/wiki/Q3761,1969,Rhea Myers,2004,2004,British,F
2267,859,https://artbase.rhizome.org/wiki/Q2283,1953,Skye Thorstenson,2003,2002,missing,M
2268,481,https://artbase.rhizome.org/wiki/Q2511,160,Katie Lips,2005,2005,British,F


### Further cleaning

In order to aviod mistakes when an artwork has multiple authors, I decided to divide them into new rows, obtaining a new dataframe having all rows corresponding to an artist. 

In [163]:
def remake_df(df):

    cols = df.columns
    to_clean = df[df.ID.str.contains(',', regex= True, na=False)]

    clean = pd.DataFrame(columns=cols)


    for ind, row in to_clean.iterrows():
        #enter in nationality cell, start counting the irtem loops for indexes
        ids = row.ID.split(', ')
        names = row.Artist.split(', ')
        nationalities = row.Nationality.split(', ')
        genders = row.Gender.split(', ')
        for idx, id in enumerate(ids):
            #make a new row for corresponding artist 
            new_row = row
            #change informations in the cells 
            new_row.ID = id
            new_row.Artist = names[idx]
            new_row.Nationality = nationalities[idx]
            new_row.Gender = genders[idx]
            clean.loc[len(clean)]= new_row
        df = df.drop(ind, axis=0)
    new = df.append(clean)

    return new

In [164]:
before80 =  remake_df(before80)



In [165]:
after80 = remake_df(after80)

In [171]:
cols = rhizome.columns
cols

Index(['Unnamed: 0', 'ID', 'URL', 'Title', 'Artist', 'dateAcquired',
       'dateCreated', 'Nationality', 'Gender'],
      dtype='object')

In [187]:

to_clean = rhizome[rhizome.ID.str.contains(',', regex= True, na=False)]
to_clean

Unnamed: 0,ID,URL,Title,Artist,dateAcquired,dateCreated,Nationality,Gender
2,"243, 701",https://artbase.rhizome.org/wiki/Q1475,Zombie and Mummy,"Dragan Espenschied, Olia Lialina",2004,2002,"German, Russian","M, F"
6,"1214, 1215",https://artbase.rhizome.org/wiki/Q2067,YouTube School for Social Politics,Red76,2011,2011,"American, American","M, M"
33,"702, 1244",https://artbase.rhizome.org/wiki/Q1258,wormholes to oli und ten,"Oliver Kauselmann, Thorsten Kloepfer",2002,2002,"German, missing","M, M"
49,"1186, 1245",https://artbase.rhizome.org/wiki/Q3551,Wish,Boredomresearch,2008,2007,"British, missing","F, missing"
63,"885, 557",https://artbase.rhizome.org/wiki/Q1233,Why + Wherefore,"Summer Guthery, Lumi Tan",2011,2007,"missing, American","F, F"
...,...,...,...,...,...,...,...,...
2216,"1260, 1261",https://artbase.rhizome.org/wiki/Q1380,1.1 Acre Flat Screen,eteam,2005,2005,"German, German","M, F"
2218,"1246, 1247",https://artbase.rhizome.org/wiki/Q3063,1 year performance video (aka samHsiehUpdate),MTAA,2005,2004,"American, American","M, M"
2228,"1270, 1271",https://artbase.rhizome.org/wiki/Q4335,%20wrong (RHIZOME--splash%20_%20JODI%20%_),JODI,2000,1998,"Dutch, Belgian","F, M"
2237,"30, 1202, 1206",https://artbase.rhizome.org/wiki/Q2916,[V]ote-auction,UBERMORGEN,2001,2000,"Italian, Austrian, Swiss/American","M, F, M"


When remaking rhizome df i discovered some issues inside the dataset such as duplicate ids' missing artistnames,etc, i made a brief analisys, looking for lists of different length and this is what I discovered line 777 has issues becauseit has 3 ids, nationalities and genders and only two names, not ordered in an understandable way. I have temporarely changed the row deleting: 
- 926 from ids 
- American from nats
- missing from gender

In [206]:
clean = pd.DataFrame(columns=cols)



for ind, row in to_clean.iterrows():
    #enter in nationality cell, start counting the irtem loops for indexes
    ids = row.ID.split(', ')
    names = row.Artist.split(', ')
    if len(ids)!= len(names):
        print(row)
    nationalities = row.Nationality.split(', ')
    genders = row.Gender.split(', ')

    for idx, id in enumerate(ids):
        #make a new row for corresponding artist 
        new_row = row
        #change informations in the cells 
        new_row.ID = id
        if len(names) == 1:
            new_row.Artist = 'collective member'
        else:
            new_row.Artist = names[idx]

        new_row.Nationality = nationalities[idx]
        new_row.Gender = genders[idx]
        clean.loc[len(clean)]= new_row
        print('END')

rhizome = rhizome.drop(ind, axis=0)
new = rhizome.append(clean)



END
END
ID                                          1214, 1215
URL             https://artbase.rhizome.org/wiki/Q2067
Title               YouTube School for Social Politics
Artist                                           Red76
dateAcquired                                      2011
dateCreated                                       2011
Nationality                         American, American
Gender                                            M, M
Name: 6, dtype: object
END
END
END
END
ID                                          1186, 1245
URL             https://artbase.rhizome.org/wiki/Q3551
Title                                             Wish
Artist                                 Boredomresearch
dateAcquired                                      2008
dateCreated                                       2007
Nationality                           British, missing
Gender                                      F, missing
Name: 49, dtype: object
END
END
END
END
END
END
ID                       

IndexError: list index out of range

In [None]:
workingDf = pd.concat(before80, after80, rhizome, ignore_index=True)
workingDf

### Questions

1. artist nationalities over the two databases

how many artists for country?


How many nationalities in the moma artists datasets and the two of the artworks?

In [137]:
# bar-plot, for plotting we need a new df with columns: 

#count nationalities for the sets
def count_nat(df):
    df = df['Nationality']
    myList = list(df)
    mySet = set(myList)
    final = list(mySet)

    return final

In [139]:
natBefore80 = count_nat(before80)
natAfter80 = count_nat(after80)

len(natBefore80), len(natAfter80)

(94, 99)

Unnamed: 0,Title,Artist,ID,DateCreated,Medium,Department,DateAcquired,URL,ThumbnailURL,Nationality,Gender
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,1987,Paint and colored pencil on print,Architecture & Design,1995-01-17,http://www.moma.org/collection/works/3,http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw...,French,M
3,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,1980,Photographic reproduction with colored synthet...,Architecture & Design,1995-01-17,http://www.moma.org/collection/works/5,http://www.moma.org/media/W1siZiIsIjEyNCJdLFsi...,missing,M
31,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,1980,Photographic reproduction with colored synthet...,Architecture & Design,1995-01-17,http://www.moma.org/collection/works/33,http://www.moma.org/media/W1siZiIsIjIwMCJdLFsi...,missing,M
35,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,1980,Photographic reproduction with colored synthet...,Architecture & Design,1995-01-17,http://www.moma.org/collection/works/38,http://www.moma.org/media/W1siZiIsIjI2NyJdLFsi...,missing,M
40,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,1980,Ink on tracing paper,Architecture & Design,1995-01-17,http://www.moma.org/collection/works/44,http://www.moma.org/media/W1siZiIsIjI5NiJdLFsi...,missing,M
...,...,...,...,...,...,...,...,...,...,...,...
138114,Cóctel (Cocktail),Alejandro Kuropatwa,132939,1996,Chromogenic color print,Photography,2020-10-07,missing,missing,Argentine,missing
138115,Cóctel (Cocktail),Alejandro Kuropatwa,132939,1996,Chromogenic color print,Photography,2020-10-07,missing,missing,Argentine,missing
138116,Cóctel (Cocktail),Alejandro Kuropatwa,132939,1996,Chromogenic color print,Photography,2020-10-07,missing,missing,Argentine,missing
138117,Cóctel (Cocktail),Alejandro Kuropatwa,132939,1996,Chromogenic color print,Photography,2020-10-07,missing,missing,Argentine,missing
