# The Museum of Modern Art (MoMA)
We're going to work with data about the art in the Museum of Modern Art (MoMA). MoMA, a museum in New York City, has one of the largest collections of modern art in the world.

MoMA publishes several data sets in their GitHub repository. We'll work with a version of their artwork data set that has been prepared for this mission. We have prepared the data for teaching purposes and sampled the size of the data set down from over 135,000 rows to a more manageable 17,000 rows.

MoMA doesn't provide a data dictionary for the data, but we have provided an explanation of each column for you:

- Title: The title of the artwork.
- Artist: The name of the artist who created the artwork.
- Nationality: The nationality of the artist.
- BeginDate: The year in which the artist was born.
- EndDate: The year in which the artist died.
- Gender: The gender of the artist.
- Date: The date that the artwork was created.
- Department: The department inside MoMA to which the artwork belongs.

In [41]:
# import the reader function from the csv module
from csv import reader

# use the python built-in function open()
# to open the children.csv file
opened_file = open('artworks.csv')

# use csv.reader() to parse the data from
# the opened file
read_file = reader(opened_file)

# use list() to convert the read file
# into a list of lists format
moma = list(read_file)

# remove the first row of the data, which
# contains the column names
moma = moma[1:]

In [42]:
# removing the parentheses from both the Nationality and Gender columns
for row in moma:
    nationality = row[2]
    nationality = nationality.replace("(", "")
    nationality = nationality.replace(")", "")
    row[2] = nationality
  
    gender = row[5]
    gender = gender.replace("(", "")
    gender = gender.replace(")", "")
    row[5] = gender
    
print(moma[300][2])
print(moma[400][2])
print(moma[500][2])
print("\n")
print(moma[300][5])
print(moma[400][5])
print(moma[500][5])    

American
Dutch
Swiss


Male
Female
Male


The *str.title()* method returns a copy of the string with the first letter of each word transformed to uppercase (also known as title case).

In [43]:

for row in moma:
    gender = row[5]
    
     # convert the gender to title case
    gender = gender.title()
    
    # if there is no gender, set
    # a descriptive value
    if not gender:
        gender = 'Gender Unknown/Other'
    
    row[5] = gender
    
    
    nationality = row[2]
    
    nationality = nationality.title()
    
    if not nationality:
        nationality = 'Nationality Unknown'
    
    row[2] = nationality
   


In [49]:
def clean_and_convert(date):
    # check that we don't have an empty string
    if date != "":
        # move the rest of the function inside
        # the if statement
        date = date.replace("(", "")
        date = date.replace(")", "")
        print(date)
#         date = (date)
    return date
for row in moma:
    birth_date = row[3]
    death_date = row[4]
    
    birth_date = clean_and_convert(birth_date)
    death_date = clean_and_convert(death_date)
    
    row[3] = birth_date
    row[4] = death_date
 

AttributeError: 'int' object has no attribute 'replace'

In [45]:
test_data = ["1912", "1929", "1913-1923",
             "(1951)", "1994", "1934",
             "c. 1915", "1995", "c. 1912",
             "(1988)", "2002", "1957-1959",
             "c. 1955.", "c. 1970's", 
             "C. 1990-1999"]

bad_chars = ["(",")","c","C",".","s","'", " "]

def strip_characters(string):
    for char in bad_chars:
        string = string.replace(char, "")
    return string

stripped_test_data = []

for d in test_data:
    date  = strip_characters(d)
    stripped_test_data.append(date)

print(stripped_test_data)

['1912', '1929', '1913-1923', '1951', '1994', '1934', '1915', '1995', '1912', '1988', '2002', '1957-1959', '1955', '1970', '1990-1999']


In [46]:
test_data = ["1912", "1929", "1913-1923",
             "(1951)", "1994", "1934",
             "c. 1915", "1995", "c. 1912",
             "(1988)", "2002", "1957-1959",
             "c. 1955.", "c. 1970's", 
             "C. 1990-1999"]

bad_chars = ["(",")","c","C",".","s","'", " "]

def strip_characters(string):
    for char in bad_chars:
        string = string.replace(char,"")
    return string

stripped_test_data = ['1912', '1929', '1913-1923',
                      '1951', '1994', '1934',
                      '1915', '1995', '1912',
                      '1988', '2002', '1957-1959',
                      '1955', '1970', '1990-1999']

def process_date(date):
    if '-' in date :                   #Checks if the dash character (-) is in the string so we know if it's a range or not. 
        split_date = date.split("-")   #Splits the string into two strings, before and after the dash character
        date_one = split_date[0] 
        date_two = split_date[1]
        date = (int(date_one) + int(date_two)) / 2  #Converts the two numbers to the integer type and then average them by adding them together and dividing by two
        date = round(date) 
        
    else:
        date = int(date)   #Converts the value to an integer type
    return date

        
processed_test_data = []
for d in stripped_test_data:
    date = process_date(d)
    processed_test_data.append(date)

for row in moma:
    date = row[6]
    date = strip_characters(date)        #remove any bad characters.
    date = process_date(date)       #convert the date.
    row[6] = date
    

In this mission, we'll build on the data cleaning we did with the Museum of Modern Art (MoMA) data set in the previous mission, and get into the fun part: analyzing the data!

These techniques will be extremely valuable as you continue to learn how to be a data expert. You'll not only use them whenever you analyze data, but also when you explore data before performing a more complex task, such as machine learning. We'll learn how to:

- Calculate how old the artist was when they created their artwork.
- Analyze and interpret the distribution of artist ages.
- Create functions which summarize our data.
- Print summaries in an easy-to-read-way.

So that you don't have to re-clean the data, we have prepared a CSV containing all of the data cleaning you performed, called artworks_clean.csv. Even though we converted the numeric columns to integer types in the previous mission, when we saved the results as a CSV, they became text data again.

Even though we don't have to clean the data again, we do have to convert these values to numeric types so we can analyze them. You may remember that some of the rows have missing values, so we'll need to handle those as well.

In [47]:
from csv import reader

# Read the `artworks_clean.csv` file
opened_file = open('artworks_clean.csv')
read_file = reader(opened_file)
moma = list(read_file)
moma = moma[1:]

# Convert the birthdate values to integer
for row in moma:
    birth_date = row[3]
    if birth_date != "":
        birth_date = int(birth_date)
    row[3] = birth_date
    
# Convert the death date values to integer
for row in moma:
    death_date = row[4]
    if death_date != "":
        death_date = int(death_date)
    row[4] = death_date

# Convert date values to integer
for row in moma : 
    date = row[6]
    if date != "":
        date = int(date)
    row[6] = date

We're going to work on calculating the ages at which artists created their pieces of art. We need to subtract the artist's birth year (BeginDate) from the year in which their artwork was created (Date).

While every row has a value for Date, there are some that are missing values for BeginDate. When we cleaned BeginDate, we encountered some missing values and left them as empty strings (""). We'll use a value of 0 for these cases, which we'll replace with something more meaningful later on.

There are a handful of cases where the artist's age (according to our data set) is very low, including some where the age is negative. We could investigate these specific cases one by one, but since we're looking for a summary, we'll take care of these by categorizing artists younger than 20 as "Unknown" also. This has the handy effect of also categorizing the artists without birth years as "Unknown".


In [48]:
ages = []   #create an empty list, to store the artist age data
for row in moma:
    date = row[6]
    birth = row[3]
    if type(birth) == int:
        age = date - birth
    else:
        age = 0
    ages.append(age)
    
final_ages = []  #to store the final age data.
for age in ages:
    if age > 20:
        final_age = age
    else:
        final_age = 'Unknown'
    final_ages.append(final_age)