# Problem

Users can submit a list of attributes for which they want to find the closest matching images.  User requests are captured as nested json.  Each image also has a list of attributes that apply to it, also stored as json.


EXAMPLE:

{'sex': 'male',
 'age': 25,
 'skin': {'wrinkles': 1, 'scars': True},
 'eyes': 'blue',
 'hair': {'colour': 'brown', 'texture': 'wavy', 'length': 'short'},
 'emotion': 'happy',
 'ears': 'Vulcan',
 'nose': 'red'
 }


Comparing nested json objects is very slow.  Calculating Levenshtein distance of an input json object with up to 22 attributes to 1 million existing json objects takes a very long time.  We need to speed this up.

# Assumptions

We are dealing with 1 million images, each of which has a json file containing a series of attributes which describe the image.

Each attribute has a set of possible values.

Some attributes may contain nested values.

Some attributes are categorical (colour) others boolean (1 or True)

Not every image has every attribute.



# Idea

Flatten the nested json objects and turn them into higher dimension vectors.

# Problems to resolve

### * most attributes can be resolved and flattened as strings but age will be an integer
    * this is probably easily resolved by just having each age be its own column (i.e. age.24, age.25)
    * depends on how Luc has age encoded
    * age almost certainly doesn't need to be exact - what does an image of a 26 year old look like vs 25?
### * ~~need to figure out how to transform True/False values like {wrinkles: 1}~~
### * is there a definitive list of attributes?

# Key Functions

In [1]:
from collections.abc import MutableMapping
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial import distance
import json
import pickle

### Flatten via generator

Flattening functions totally stolen: https://www.freecodecamp.org/news/how-to-flatten-a-dictionary-in-python-in-4-different-ways/

Using the generator option is much more memory efficient

#### sample\['hair'\]\['color'\] :'brown' becomes 'hair.color.brown'

In [2]:
def flatten_dict_gen(d, parent_key, sep):
    for k, v in d.items():
        new_key = parent_key + sep + k if parent_key else k
        if isinstance(v, MutableMapping):      # testing if the value is itself a mutable key/value object
            yield from flatten_dict(v, new_key, sep=sep).items()
        else:
            yield new_key, v

In [3]:
def flatten_dict(d: MutableMapping, parent_key: str = '', sep: str = '.'):
    return dict(flatten_dict_gen(d, parent_key, sep))

### Convert flat json to string

Once we have a flat dictionary, we need to create combine the key/value pairs into a single string.  

Need to work around k/v pairs where the value is boolean.  The fact that the pair exists indicates that it was true in the old json.

In [4]:
def flat_to_string(in_dict):
    as_str = " ".join([f"{k}.{v}" if v not in [0,1, True, False] else f"{k}" for k,v in in_dict.items()])
    return as_str

# Load test file, flatten, convert to string, and vectorize

#### Load json file containing nested json objects

In [5]:
%%time
with open("sample_json_1000000.json", 'r') as fin:
    dict_list = json.load(fin)

Wall time: 3.59 s


In [6]:
len(dict_list)

1000000

#### Flatten json objects and convert to list of strings

In [7]:
%%time
string_list = []
for item in dict_list:
    flat = flatten_dict(item)
    as_str = flat_to_string(flat)
    #mystring = " ".join([f"{k}.{v}" if v not in [0,1, True, False] else f"{k}" for k,v in flat.items()])
    string_list.append(as_str)

Wall time: 9.24 s


In [8]:
flat

{'sex': 'female',
 'age': 73,
 'hair.colour': 'gray',
 'hair.length': 'medium',
 'ethnicity': 'asian',
 'eyebrows': 'bushy',
 'accessories': 'earrings'}

In [9]:
as_str

'sex.female age.73 hair.colour.gray hair.length.medium ethnicity.asian eyebrows.bushy accessories.earrings'

In [10]:
len(string_list)

1000000

#### Vectorize the list of strings

In [11]:
vectorizer = CountVectorizer(token_pattern='\S+')

In [12]:
%%time
X = vectorizer.fit_transform(string_list)

Wall time: 5.97 s


In [13]:
Y = vectorizer.get_feature_names()

In [14]:
len(Y)

102

#### Complete list of attributes

In [15]:
Y

['accessories.earrings',
 'accessories.glasses',
 'accessories.hat',
 'age.10',
 'age.11',
 'age.12',
 'age.13',
 'age.14',
 'age.15',
 'age.16',
 'age.17',
 'age.18',
 'age.19',
 'age.20',
 'age.21',
 'age.22',
 'age.23',
 'age.24',
 'age.25',
 'age.26',
 'age.27',
 'age.28',
 'age.29',
 'age.30',
 'age.31',
 'age.32',
 'age.33',
 'age.34',
 'age.35',
 'age.36',
 'age.37',
 'age.38',
 'age.39',
 'age.40',
 'age.41',
 'age.42',
 'age.43',
 'age.44',
 'age.45',
 'age.46',
 'age.47',
 'age.48',
 'age.49',
 'age.50',
 'age.51',
 'age.52',
 'age.53',
 'age.54',
 'age.55',
 'age.56',
 'age.57',
 'age.58',
 'age.59',
 'age.60',
 'age.61',
 'age.62',
 'age.63',
 'age.64',
 'age.65',
 'age.66',
 'age.67',
 'age.68',
 'age.69',
 'age.70',
 'age.71',
 'age.72',
 'age.73',
 'age.74',
 'age.75',
 'age.76',
 'age.77',
 'age.78',
 'age.79',
 'age.80',
 'ears.big',
 'ears.droopy',
 'ears.huge',
 'emotion.angry',
 'emotion.happy',
 'emotion.sad',
 'ethnicity.asian',
 'ethnicity.black',
 'ethnicity.cau

#### Save the list of attributes to file

In [16]:
with open("attributes.pickle", "wb") as fout:
    pickle.dump(Y,fout)

# Find similar vector in array

### Get array from vectorized results

In [17]:
x_array = X.toarray()

In [18]:
x_array.shape

(1000000, 102)

### Extract row as test vector

In [19]:
test_index = 375

In [20]:
test_row = x_array[test_index : test_index+1,:]

In [21]:
test_row

array([[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0]], dtype=int64)

## Find the nearest matches in the matrix for the test vector

In [22]:
from scipy.spatial import distance

In [23]:
%%time
distances = distance.cdist(test_row, x_array, "cosine")[0]
five_closest = np.argsort(distances)[:5]  # get N closest matches
#closest_match = np.argmin(distances) # this gives index of closest match



Wall time: 310 ms


In [24]:
five_closest

array([   375, 719164, 698364, 221157, 226679], dtype=int64)

In [25]:
distances[375]

0.0

### You can save the entire 1M record matrix as a numpy array

In [26]:
x_array.shape

(1000000, 102)

In [27]:
with open("image_matrix", 'wb') as fout2:
    np.save(fout2, x_array)

In [28]:
with open("image_matrix", 'rb') as fin3:
    new_x_array = np.load(fin3)

In [29]:
new_x_array.shape

(1000000, 102)

## Take an input json file and check it against the 1M record matrix

Note: this seems hacky.  There may be a better way to compare the vector of the input json (which will only contain a few columns) against the 102 column rows of the matrix.  This sounds like a question for Sir HEALY, Earl of Embedding.

In [30]:
from collections import defaultdict

#### Load list of all attributes from file

In [31]:
with open("attributes.pickle", "rb") as fin2:
    attributes = pickle.load(fin2)

#### Create dict with all attributes as keys with 0 values

In [32]:
attributes_dict = defaultdict.fromkeys(attributes, 0)

#### Get user input as json

In [33]:
input_json = {
    'sex' : 'male',
    'age' : 55,
    'ears' : 'big',
    'hair' : {'colour':'blonde' }
}

#### Flatten user input and convert to string

In [34]:
flat_input = flatten_dict(input_json)

In [35]:
input_str = flat_to_string(flat_input)

In [36]:
input_str

'sex.male age.55 ears.big hair.colour.blonde'

#### Iterate over attributes in the string and change  respective values in attributes_dict to 1

In [37]:
for attribute in input_str.split():
    attributes_dict[attribute] = 1

In [38]:
attributes_dict

defaultdict(None,
            {'accessories.earrings': 0,
             'accessories.glasses': 0,
             'accessories.hat': 0,
             'age.10': 0,
             'age.11': 0,
             'age.12': 0,
             'age.13': 0,
             'age.14': 0,
             'age.15': 0,
             'age.16': 0,
             'age.17': 0,
             'age.18': 0,
             'age.19': 0,
             'age.20': 0,
             'age.21': 0,
             'age.22': 0,
             'age.23': 0,
             'age.24': 0,
             'age.25': 0,
             'age.26': 0,
             'age.27': 0,
             'age.28': 0,
             'age.29': 0,
             'age.30': 0,
             'age.31': 0,
             'age.32': 0,
             'age.33': 0,
             'age.34': 0,
             'age.35': 0,
             'age.36': 0,
             'age.37': 0,
             'age.38': 0,
             'age.39': 0,
             'age.40': 0,
             'age.41': 0,
             'age.42': 0,
          

#### Create a 2D array of the values in the attributes_dict

As mentioned, this is a real hack

In [39]:
input_vector = np.array([list(attributes_dict.values())])

In [40]:
input_vector[0]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0])

In [41]:
input_distances = distance.cdist(input_vector, x_array, "cosine")[0]
five_closest = np.argsort(input_distances)[:5]  # get N closest matches

In [42]:
five_closest

array([771755, 687823, 268132, 355174, 895159], dtype=int64)

In [43]:
input_distances[771755]

0.10557280900008414

In [44]:
x_array[771755]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=int64)