<a href="https://colab.research.google.com/github/pkaiser8/info-664-final/blob/main/Peter_Kaiser_Final_Cooper_Hewitt_Object_Randomizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Three (or less) Randomized Object Records from the Collection of the Cooper Hewitt, Smithsonian Design Museum

## Topic Overview

During this course, one of the lessons that inspired me the most was our week exploring the PANDAS dictionary and its use in data analysis and visualization. I was also particularly inspired by [the article](https://www.vam.ac.uk/blog/museum-life/designing-for-serendipity-in-the-museum-surprise-encounters-with-objects-and-stories?srsltid=AfmBOoqqoX16ArOT_GsRIH56p799nD84dScbdqHVQvBzU6L6Adwf0Cb0) about researchers examining how to design for serendipity in the Victoria and Albert Museum’s collections and how we can incorporate chance into navigating and exploring massive institutional collections.

I wanted to explore a well-known public database of design objects from the Cooper Hewitt Smithsonian Design Museum. The API section of their collection webpage shows a [spinning wheel](https://apidocs.cooperhewitt.org/explore-the-data/) that randomly selects preselected values for generating an API query, which also inspired me. With the help of the PANDAS dictionary, I wanted to replicate this wheel by writing a program allowing a user to select their keys manually, then let random chance choose the values in those keys and produce three random records that shared those values. The program allows the user to run the code for new results repeatedly, and you rarely get the same set of records twice.

## Challenges

The first challenge I faced was the various values cataloged in the CSV file I found on Cooper Hewitt’s GitHub. The variability of data entry indicates that many museum staff catalogers enter data using different approaches and standards over the years. I saw many slightly different but similar values within the keys. I could condense four or five similar-sounding values into singular ones by refining these cells. I specifically focused on keys like “date,” “type,” and “medium.” I hoped these refinements to the data would create more opportunities for my program to discover similar values. However, given the size of this dataset, my brief refinements and transformations might have only scratched the surface.

Another challenge I faced was how many keys from the DataFrame to factor into the program. First, having the user choose three keys increases the likelihood that fewer records share three values. By choosing just two, there is more chance the program will discover three records to output.

Lastly, the number of key options was too vast, so I decided to curate the list of keys I know will yield some of the best results from the data, given that those keys contain values that are often repeated and don't contain too many unique examples.

## The Program

###1. Import dictionaries, establish file path for .CSV datasheet and request user input
In this first section, we import the required dictionaries (Pandas and Random) to run our code correctly. We then load my refined dataset linked to the ReadMe file of the [GitHub repository](https://github.com/pkaiser8/info-664-final), which this program is nested in. We then have our loaded DataFrame.

Next, we request the user select and input two keys from a curated filtered list of columns in the DataFrame. Then, we use the groupby() method to generate a group from these selections. This feature allows the user to drive their means of random discovery from the data.




In [1]:
import pandas as pd
import random

# Establish file path for .CSV data sheet:
data_filepath = '/content/objects-refined.csv'

def load_data(data_filepath):
    """
    Loads data from a CSV file.

    Inputs:
        data_filepath: The file or URL path to the CSV file.

    Returns:
        The loaded DataFrame.
    """

    # Read the data from the .CSV defined in data_filepath.
    # Use low_memory=False to process entire file at once.
    objects_df = pd.read_csv(data_filepath, low_memory=False)

    return objects_df

def get_user_input(objects_df):
  """Pulls in user input to establish key selection

  Inputs:
      objects_df (DataFrame): The loaded DataFrame.
      user_selected_key_1 (string), user_selected_key_2 (string): Manually entered user input for key selection.

  Returns:
      user_input (list): A list of the two user selected keys.
  """

  print(f'Welcome to the Cooper Hewitt, Smithsonian Design Museum collections object randomizer. \nPlease see below a curated list of keys used to define objects in the collection CSV file:\n')

  # Prints a filtered list of the CSV columns so the user can decide from my curated list of keys to input:
  column_list = objects_df.columns.to_list()

  curated_keys = ['date', 'decade', 'type', 'medium', 'woe:country_name', 'year_acquired']
  filtered_column_list = [column for column in column_list if column in curated_keys]

  print("Curated keys:\n" + "\n".join(f'{column}' for column in filtered_column_list))
  print()
  print(f'Select which keys you want to pair to randomly find values within them.\nWith those randomly selected values the program will find three (or less) records that share those common values.\n')

  # User inputted information for each desired key:
  user_selected_key_1 = input(f'Please enter one of the keys listed above. It is best to copy and paste everything between the quotes:\n')
  print()

  user_selected_key_2 = input(f'Please enter a second key:\n')
  print()

  print(f'You have selected "{user_selected_key_1}" and "{user_selected_key_2}" as your grouped keys.')

  # Create a group of with these two user inputted key selections:
  user_input = [user_selected_key_1, user_selected_key_2]

  return user_input

def group_data(data_filepath, user_input):
  """Loads data from a CSV file and groups it by a specified column.

  Inputs:
    data_filepath (string): The file or URL path to the CSV file.
    user_input (string): The key columns to group the data by.

  Returns:
    grouped_df (string): The grouped DataFrame.
  """

  # Use the groupby() method to group the rows in the DataFrame
  # based on specific values in the two user defined key columns:
  grouped_df = objects_df.groupby(user_input)

  return grouped_df

# Run the functions to get user input for key selection

# Load the data:
objects_df = load_data(data_filepath)

# Get user input:
user_input = get_user_input(objects_df)

# Apply the grouping:
grouped_df = group_data(data_filepath, user_input)

Welcome to the Cooper Hewitt, Smithsonian Design Museum collections object randomizer. 
Please see below a curated list of keys used to define objects in the collection CSV file:

Curated keys:
date
decade
medium
type
woe:country_name
year_acquired

Select which keys you want to pair to randomly find values within them.
With those randomly selected values the program will find three (or less) records that share those common values.

Please enter one of the keys listed above. It is best to copy and paste everything between the quotes:
decade

Please enter a second key:
woe:country_name

You have selected "decade" and "woe:country_name" as your grouped keys.


###2. Selection of random values from grouping and extracting up to three full records

Now, with user-selected grouped keys, the program can start extracting random values contained within those grouped keys. The program seeks to pull at most three records (an entire row) from the group by searching for two sets of shared values.

_I.e., if both records share a similar randomly selected "type" like "Teacup" as well as "date" such as "1925", the program will pull three records that are all Teacups made in 1925._

If one grouped value comes up with "nan" or a null entry, the printed text will ask the user to rerun the cell or return to the previous cell to reselect two new keys. If the program only finds 1-2 records, it still pulls those records the user can continue with.

In [7]:
# This is the ideal number of records the function below should aim to return:
num_records = 3

def select_random_group_and_records(grouped_df, num_records=3):
    """Selects a random group and a specified number of random records from that group.

    Inputs:
        grouped_df: The grouped DataFrame.
        num_records (int): The number of records to select. Defaults to 3.

    Returns:
        tuple: A tuple containing the selected group key and the random records.
        selected_group_value: The key of the selected group.
        random_records: The randomly selected records.
        An error message is printed if no records are found for the selected group.
    """

    # Creates a list of all possible key values from the DataFrame (grouped_df) called group_keys:
    group_key_value = list(grouped_df.groups.keys())

    # The random method pulls a random choice of values from the group_keys variable above:
    selected_group_value = random.choice(group_key_value)

    try:
        # Using pd get_group() method, extracts whole records
        # from the randomized variable selected_group_value
        selected_group_records = grouped_df.get_group(selected_group_value)

    except KeyError or UnboundLocalError:
        # Print an explainer text describing what went wrong if a KeyError occurs:
        print(f"Could not find records since one group value is blank, please run this cell again.\nYou may also reselect your two keys in the cell above and rerun both cells.\n")

        return selected_group_value, pd.DataFrame()

    # Checks if the number of records fall within selected_group_records
    # variable is equal to the desired number of records (num_records):
    if len(selected_group_records) >= num_records:
        # Selects at most three records from the selected_group:
        random_records = selected_group_records.sample(n=num_records)

    else:
        # If less than three records are found, it will still display what was found:
        random_records = selected_group_records

    # Returns the selected grouped valued based on user inputted keys and the
    # random records that share those randomly selected values in those keys:
    return selected_group_value, random_records

# Call the functions directly to execute the code

grouped_df = group_data(data_filepath, user_input)

selected_group_value, random_records = select_random_group_and_records(grouped_df, num_records)

print("Randomly selected group values based on user selected keys:\n")
print(f'{user_input[0]} = {selected_group_value[0]}')
print(f'{user_input[1]} = {selected_group_value[1]}\n')
if not random_records.empty:
  print("Randomly Selected Records:")
  print()
  print(random_records)

Randomly selected group values based on user selected keys:

decade = 1970.0
woe:country_name = Chile

Randomly Selected Records:

       accession_number         creditline  date  decade  department_id  \
136608        1971-37-1  Gift of Inge Duxi  1970  1970.0       35347501   

                                              description  \
136608  Vertical hanging with a cross-like form in bla...   

                                      dimensions  dimensions_raw gallery_text  \
136608  H x W: 99 x 57 cm (39 in. x 22 7/16 in.)             NaN          NaN   

              id  ...     type     type_id  \
136608  18473111  ...  Hanging  35257233.0   

                                                      url videos woe:country  \
136608  https://collection.cooperhewitt.org/objects/18...    NaN  23424782.0   

       woe:country_id  woe:country_name year_acquired  year_end  year_start  
136608     23424782.0             Chile        1971.0       NaN         NaN  

[1 rows x 38 columns]

###3. Data Extraction and HTML Table Generation and Display

In this final section, we take the variable random_records containing 1-3 complete records generated by the functions above and select specific metadata columns to feed into an HTML table display. The table is formatted using CSS for the preferred look and feel of the information presented in a table format.

The following functions format the rows of the table and determine if the 'Image' field contains a proper image. It will default to a Cooper Hewitt logo if it does not. The function also wraps the image URL in a \<a> tag so the user may click to view a larger version of the image in a new browser tab.

The final result is a neatly printed HTML table presenting the image and details of 1-3 randomly selected records based on the user input provided in the first section. The dictionary "HTML" is imported from the IPython.display library which helps implement HTML to the Python code.

In [None]:
# Import the HTML from IPython.display
# Source: https://ipython.readthedocs.io/en/8.26.0/api/generated/IPython.display.html
from IPython.display import HTML

def extract_record_data(random_records):
  """Extracts desired metadata from the records and makes a list of dictionaries.
  Input:
      random_records: The 1-3 randomly selected records from the grouped DataFrame.

  Return:
      all_records_data: A list of dictionaries containing the desired metadata.
  """

  # Create a list of the disred metadata to be displayed in the HTML table:
  all_records_data = []

  # Use a loop to clearly define each column in the HTML table display:
  for column in random_records.values:
    record_data = {
        'Image': column[20],
        'Title': column[24],
        'Date': column[2],
        'Medium': column[17],
        'Dimensions': column[6],
        'Type': column[28],
        'Country': column[34],
        'Accession Number': column[0]
        }
    all_records_data.append(record_data)
  return all_records_data

def create_table_CSS_header():
  """Creates the HTML table header with some CSS styling.
  Input:
      None

  Return:
      table_css_header: CSS styling for the html_table.
  """

  table_css_header = """
  <table style='border-collapse: separate; border-spacing: 10px; border: 2px solid #ddd;'>
    <tr>
        <th style='border: 2px dotted #fff; padding: 8px;'>Image</th>
        <th style='border: 2px dotted #fff; padding: 8px;'>Title</th>
        <th style='border: 2px dotted #fff; padding: 8px;'>Date</th>
        <th style='border: 2px dotted #fff; padding: 8px;'>Medium</th>
        <th style='border: 2px dotted #fff; padding: 8px;'>Dimensions</th>
        <th style='border: 2px dotted #fff; padding: 8px;'>Type</th>
        <th style='border: 2px dotted #fff; padding: 8px;'>Country</th>
        <th style='border: 2px dotted #fff; padding: 8px;'>Accession Number</th>
    </tr>
  """
  return table_css_header

def create_html_table_rows(all_records_data):
  """Creates the HTML table rows with record data.
  Input:
      all_records_data: A list of dictionaries containing the desired metadata.

  Return:
      html_rows: HTML table rows with record data.
  """

  html_rows = ""
  # Generates an HTML table to visually display the data compiled in the "all_records_data" list made above
  # (1-3 records) stored in all_records_data list defined above:
  for record in all_records_data:
    # Open the table bracket in the HTML:
    html_rows += "<tr>"

    # Check to see if the record contains a valid image link:
    image_link = record['Image'] if 'Image' in record and pd.notna(record['Image']) else 'https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Cooper_Hewitt%2C_Smithsonian_Design_Museum_logo.svg/320px-Cooper_Hewitt%2C_Smithsonian_Design_Museum_logo.svg.png'

    # Wrap the image in an <a> tag to create a link in the
    # image thumbnail to view the full size picture in a separate tab:
    html_rows += f"<td style='border: 1px solid #ddd; padding: 8px;'><a href='{image_link}' target='_blank'><img src='{image_link}' width='100'></a></td>"

    for key, value in record.items():
        # Skip 'Image' as it's already handled:
        if key != 'Image':
            html_rows += f"<td style='border: 1px solid #ddd; padding: 8px;'>{value}</td>"

    # Close the table bracket in the HTML:
    html_rows += "</tr>"
  return html_rows

def generate_html_table(random_records):
  """Generates the complete HTML table.
  Input:
      random_records: The 1-3 randomly selected records from the grouped DataFrame.

  Return:
      html_table: The complete HTML table.
  """

  all_records_data = extract_record_data(random_records)
  html_table = create_table_CSS_header()
  html_table += create_html_table_rows(all_records_data)
  html_table += "</table>"
  return html_table

print("Three or less Cooper Hewitt collection objects selected from the database with\nthe following shared group values under the user selected keys:\n")
print(f'{user_input[0]} = {selected_group_value[0]}')
print(f'{user_input[1]} = {selected_group_value[1]}')
print()

html_table = generate_html_table(random_records)
display(HTML(html_table))

Three or less Cooper Hewitt collection objects selected from the database with
the following shared group values under the user selected keys:

medium = Block-printed on machine-made paper
woe:country_name = United States



Image,Title,Date,Medium,Dimensions,Type,Country,Accession Number
,"Sidewall (possibly USA), ca. 1860",Ca. 1860,Block-printed on machine-made paper,Overall: 55 x 48 cm (21 5/8 x 18 7/8 in.),Sidewall,United States,1998-75-157
,"Sidewall (United States or England), 1835–45",1835–45,Block-printed on machine-made paper,104.5 x 63.5 cm (41 1/8 x 25 in.),Sidewall,United States,1976-46-34
,"Sidewall (possibly USA), 1850–70",1850–70,Block-printed on machine-made paper,a) Overall: 39.5 x 21 cm (15 9/16 x 8 1/4 in.) b) 24.5 x 29.5 cm,Sidewall,United States,"1998-75-161-a,b"


## Conclusion

This program represents a refined and fun version of my minimum viable product. I achieved the desired result early but kept refining and updating the code with streamlined features and more documentation. I wanted the user to control what they want to see by allowing them to select their keys to search by. By allowing them to search all options, the user can discover what combination of keys reveals the best results from the Cooper Hewitt object database.

If I keep working on this program, I want the user to be able to discover additional objects collocated above and below the randomly selected objects in the dataset, mimicking the serendipity of discovering assets in a physical archive.

Additionally, I would have liked to add a portion to the code to let users save results that they wanted to a CSV in the hopes that they will encounter those objects if they ever visit the Cooper Hewitt Museum in person at a later date.