# Notebook For Generation of a partially-synthetic dataset for roughness detection

As I write this, on June 27th, the University of Minnesota - Twin Cities' Summer Undergraduate Research Expo, a research symposium for undergraduates and some select high schoolers is around a month and some change away. One of the most central parts of our device's design is based on a Computer-Vision-based appproach to Texture Analyis, taking images of a surface and then using Image Feature Extraction techniques(or potentially in the future -- if enough data is obtained -- a Convolutional Neural Network) and a traditional Machine Learning Regression model to determine the roughness of a surface on a scale of 1-10.

However, obtaining this data is incredibly challening and often requires large surveys which are both time-consuming to distribute. but are also prone to error(likely just as much error as the technique used here). Thus, throughout this journal we explore a method of using Gemini fine-tuning in order to create synthetically rated data in order to somewhat accurately train our Computer Vision Texture-Analyisis Model

In [2]:
import pandas as pd
import json

In [3]:
with open("../data/RRS_Survey.json") as js:
    rrs_dict = json.load(js)
rrs_dict

{'list': [{'index': 0,
   'Bark.jpg': 1,
   'Wall(4).jpg': 2,
   'Leather chair.jpg': 3,
   'Skin.png': 4,
   'Skin(1).jpg': 5,
   'timestamp': '2025-05-20 05:04:34.483005',
   'name': '',
   'gender': 'Male'},
  {'index': 1,
   'Clay sculpture_.jpg': 4,
   'Whiteboard.jpg': 1,
   'Wooden sign.jpg': 2,
   'Rock.jpg': 5,
   'Wall(5).jpg': 3,
   'timestamp': '2025-05-20 05:07:05.432513',
   'name': 'Arianna Lam',
   'gender': 'Female'},
  {'index': 2,
   'Wall(5).jpg': 5,
   'Stuffed animal fabric.jpg': 1,
   'Wooden sign.jpg': 4,
   'Rope.jpg': 2,
   'Styrofoam_.jpg': 3,
   'timestamp': '2025-05-20 11:11:10.468182',
   'name': 'Victoria Wysocki ',
   'gender': 'Female'},
  {'index': 3,
   'Pavement_.jpg': 5,
   'Weave bag(1).jpg': 3,
   'Towel(1).jpg': 2,
   'Whiteboard.jpg': 1,
   'Wall(5).jpg': 4,
   'timestamp': '2025-05-20 11:30:23.112832',
   'name': 'Katelyn',
   'gender': 'Female'},
  {'index': 4,
   'Styrofoam_.jpg': 2,
   'Table(1).jpg': 4,
   'Sidewalk.jpg': 5,
   'Skin.png': 

In [4]:
pd.read_json("../data/RRS_Survey.json")

Unnamed: 0,list
0,"{'index': 0, 'Bark.jpg': 1, 'Wall(4).jpg': 2, ..."
1,"{'index': 1, 'Clay sculpture_.jpg': 4, 'Whiteb..."
2,"{'index': 2, 'Wall(5).jpg': 5, 'Stuffed animal..."
3,"{'index': 3, 'Pavement_.jpg': 5, 'Weave bag(1)..."
4,"{'index': 4, 'Styrofoam_.jpg': 2, 'Table(1).jp..."
...,...
85,"{'index': 85, 'Wall(4).jpg': 3, 'Wood.jpg': 2,..."
86,"{'index': 86, 'Wooden sign.jpg': 3, 'Rock.jpg'..."
87,"{'index': 87, 'Bed headboard fabric_.jpg': 2, ..."
88,"{'index': 88, 'Leather chair.jpg': 3, 'Paper.j..."


During the released version of this notebook and the surrounding software as GitHub repository, it's important to note that the original .csv file will not be available and that only an anonymized version will be present to prevent the unnecessary leakage of Personal Information within this paper.Below indicates the process of this personal information being removed.

In [5]:
#Setting rrs_dict to the contents of the 'list' section of the json which includes all of the important data
rrs_dict = rrs_dict['list']

In [6]:
#Removing Name, Gender, & Timestamp

for item in rrs_dict:
    item.pop('name')
    item.pop('timestamp')
    item.pop('gender')
rrs_dict

[{'index': 0,
  'Bark.jpg': 1,
  'Wall(4).jpg': 2,
  'Leather chair.jpg': 3,
  'Skin.png': 4,
  'Skin(1).jpg': 5},
 {'index': 1,
  'Clay sculpture_.jpg': 4,
  'Whiteboard.jpg': 1,
  'Wooden sign.jpg': 2,
  'Rock.jpg': 5,
  'Wall(5).jpg': 3},
 {'index': 2,
  'Wall(5).jpg': 5,
  'Stuffed animal fabric.jpg': 1,
  'Wooden sign.jpg': 4,
  'Rope.jpg': 2,
  'Styrofoam_.jpg': 3},
 {'index': 3,
  'Pavement_.jpg': 5,
  'Weave bag(1).jpg': 3,
  'Towel(1).jpg': 2,
  'Whiteboard.jpg': 1,
  'Wall(5).jpg': 4},
 {'index': 4,
  'Styrofoam_.jpg': 2,
  'Table(1).jpg': 4,
  'Sidewalk.jpg': 5,
  'Skin.png': 1,
  'Pavement.jpg': 3},
 {'index': 5,
  'Clay sculpture_.jpg': 5,
  'Wall(5).jpg': 4,
  'Wall(1).jpg': 3,
  'Bedsheets_.jpg': 1,
  'Fabric.jpg': 2},
 {'index': 6,
  'Wall(4).jpg': 2,
  'Fan vent.jpg': 5,
  'Fan vent(1).jpg': 4,
  'Rope.jpg': 1,
  'Towel.jpg': 3},
 {'index': 7,
  'Wooden sign.jpg': 2,
  'Hair(2).jpg': 3,
  'Skin(1).jpg': 1,
  'Fan vent(1).jpg': 4,
  'Towel.jpg': 5},
 {'index': 8,
  'Roc

In [7]:
#Saving to JSON file in ../data folder
with open("../data/Anonymized_Relative_Roughness_Survey_Results.json","w") as f:
        f.write(json.dumps(rrs_dict))

## Exploratory Data Analysis 

In order to represent the data collected, which is in a somewhat unique format, I believe it's best to represent this using a data structure like a HashMap, where each value represents another value, and there can only be one of each key. In this case it would be in the type of format of a Map<Tuple, int> type of situation where the Tuple represents the first and second image(_Although this would need me to figure out a method of ordering them so the keys are consistent_) and the integer from [-1,1] represents which of the two is more rough, where in a -1 would mean all votes said the first image was rougher and a 1 means the second image was roguher, while a 0 means an equal amount voted the first as rougher as the second being rougher.

In [8]:
# Create a 'map'(really just a dictionary) which will store each of the ratings with a key being a tuple and the value being [-1,1]
# as previously explained above
#As I later discovered to make this algorithm work it should also store the individual votes
ratings_map = {}
#Method sorts tuples so that they are in alphabetical order to ensure each key matches exactly to one set of ratings.
#Also returns None when they are the exact
def sort_tuple(input_tuple:tuple):
    if input_tuple[0] == input_tuple[1]:
        return None
    if input_tuple[0]>input_tuple[1]:
        return (input_tuple[1],input_tuple[0])
    return input_tuple

In [9]:
#Sequentially going through each item in rrs_dict and then matching each one with one of the images after it in the list

for result in rrs_dict: #iterate through each survey result
    for i, key_i in  enumerate(result.keys()): #iterate through each specific image except the last one, since al images will have already have been matched with it
        for j, key_j in enumerate(result.keys()): #iterate through and match current image with other images, O(n!) algorithm
            if j <= i:
                continue
            image_tuple =  sort_tuple((key_i,key_j))
            if image_tuple is None:
                continue
            if image_tuple not in ratings_map:
                ratings_map[image_tuple] = {'results':[]}
            difference: int

            if(result[image_tuple[0]] - result[image_tuple[1]] > 0 ):
                difference = -1
            else:
                difference = 1
            ratings_map[image_tuple]['results'].append(difference)
            


In [10]:

print(len(ratings_map))

ratings_map


685


{('Bark.jpg', 'index'): {'results': [-1, 1, 1, 1, 1, 1, 1]},
 ('Wall(4).jpg', 'index'): {'results': [-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]},
 ('Leather chair.jpg',
  'index'): {'results': [-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]},
 ('Skin.png', 'index'): {'results': [-1, 1, 1]},
 ('Skin(1).jpg', 'index'): {'results': [-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]},
 ('Bark.jpg', 'Wall(4).jpg'): {'results': [1]},
 ('Bark.jpg', 'Leather chair.jpg'): {'results': [1, -1, -1]},
 ('Bark.jpg', 'Skin.png'): {'results': [1]},
 ('Bark.jpg', 'Skin(1).jpg'): {'results': [1, -1]},
 ('Leather chair.jpg', 'Wall(4).jpg'): {'results': [-1]},
 ('Skin.png', 'Wall(4).jpg'): {'results': [-1]},
 ('Skin(1).jpg', 'Wall(4).jpg'): {'results': [-1, 1, 1]},
 ('Leather chair.jpg', 'Skin.png'): {'results': [1]},
 ('Leather chair.jpg', 'Skin(1).jpg'): {'results': [1, -1]},
 ('Skin(1).jpg', 'Skin.png'): {'results': [-1]},
 ('Clay sculpture_.jpg', 'index'): {'results': [-1, 1, 1, 1, 1, 1]},
 ('Whiteboard.jpg', 'index'): {'results': [1, 

In [11]:
for pair in ratings_map:
    sum = 0
    for result in ratings_map[pair]['results']:
        sum +=  result
        if(len(ratings_map[pair]['results']) == 0):
            ratings_map[pair]['avg'] = 0
            continue    
        ratings_map[pair]['avg'] = sum/len(ratings_map[pair]['results'])

In [12]:
ratings_map

{('Bark.jpg', 'index'): {'results': [-1, 1, 1, 1, 1, 1, 1],
  'avg': 0.7142857142857143},
 ('Wall(4).jpg', 'index'): {'results': [-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
  'avg': 0.8181818181818182},
 ('Leather chair.jpg',
  'index'): {'results': [-1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1], 'avg': 0.8181818181818182},
 ('Skin.png', 'index'): {'results': [-1, 1, 1], 'avg': 0.3333333333333333},
 ('Skin(1).jpg', 'index'): {'results': [-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
  'avg': 0.8333333333333334},
 ('Bark.jpg', 'Wall(4).jpg'): {'results': [1], 'avg': 1.0},
 ('Bark.jpg', 'Leather chair.jpg'): {'results': [1, -1, -1],
  'avg': -0.3333333333333333},
 ('Bark.jpg', 'Skin.png'): {'results': [1], 'avg': 1.0},
 ('Bark.jpg', 'Skin(1).jpg'): {'results': [1, -1], 'avg': 0.0},
 ('Leather chair.jpg', 'Wall(4).jpg'): {'results': [-1], 'avg': -1.0},
 ('Skin.png', 'Wall(4).jpg'): {'results': [-1], 'avg': -1.0},
 ('Skin(1).jpg', 'Wall(4).jpg'): {'results': [-1, 1, 1],
  'avg': 0.333333

In [13]:
from collections import Counter

# Get the length of the results list for each pair
lengths = [len(v['results']) for v in ratings_map.values()]

# Count occurrences of each length
length_counts = Counter(lengths)

# Print counts for each length from 0 to max length
for i in range(0, max(lengths)+1):
    print(f"{i}: {length_counts.get(i, 0)}")

0: 0
1: 431
2: 148
3: 41
4: 15
5: 5
6: 8
7: 6
8: 5
9: 5
10: 8
11: 4
12: 4
13: 3
14: 1
15: 1
