# Similarities for heterogenous objects
### Introduction to Data Mining (CSE 5243)

Updated 2021-01-24 by [Michael Burkhardt](mailto:burkhardt.5@osu.edu)

In [1]:
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist, squareform
from sklearn.preprocessing import StandardScaler

In [2]:
# Funtion to generate a random collection of toys
shapes = ['plane', 'boat', 'car']
colors = ['red', 'blue', 'green']
sizes = [10, 25, 50]
conditions = ['worn', 'fair', 'good', 'new']
N = 3

def make_toys():
    return pd.DataFrame({
        'shape': np.random.choice(shapes, N),
        'color': np.random.choice(colors, N),
        'size' : np.random.choice(sizes, N),
        'condition' : np.random.choice(conditions, N)
    }, index=['A', 'B', 'C'])

### Proximity functions for numeric attributes

For the `size` attribute, let's define dissimilarity as follows:

$$
d(p,q) = \frac{|p-q|}{n_{max}-n_{min}}
$$

And we'll define similarity in terms of dissimilarity:
$$
s(p,q) = 1-d(p,q)
$$

where I've chosen $n_{min}=10$ and $n_{max}=50$.

In [3]:
def dissim_num(p, q, attr=None):
    assert attr is not None, 'You must specify a list of possible values'
    min_val = min(attr) # Minimum possible size
    max_val = max(attr) # Maximum possible size
    assert min_val <= p and p <= max_val, 'p must be between {} and {}'.format(min_val, max_val)
    assert min_val <= q and q <= max_val, 'p must be between {} and {}'.format(min_val, max_val)
    return(abs(p - q) / (max_val - min_val))

def sim_num(p, q, attr=None):
    assert attr is not None, 'You must specify a list of possible values'
    return(1 - dissim_num(p, q, attr))

### Proximity functions for nominal attributes

Because the `shape` and `color` attributes are nominal, we'll define dissimilarity as either 0 or 1:

$$
d(p,q)=\left\{ \begin{array}{rcl}
0 & \textrm{if} & p = q \\
1 & \textrm{if} & p \ne q
\end{array}\right.
$$

And because we know the value of $d(p,q)$ will always be either $0$ or $1$, we can
define similarity as:

$$
s(p,q)=1-d(p,q)
$$

In [4]:
def dissim_nom(p, q):
    return(0 if p==q else 1)

def sim_nom(p, q):
    return(1-dissim_nom(p, q))

### Proximity functions for ordinal attributes
The `condition` attribute is ordinal, so we need to decide how to define the values quantitatively.

Let's use a simple scale with equal intervals: worn (0) < fair (1) < good (2) < new (3)

With that in mind, we'll define dissimilarity as:

$$
d(p,q) = \frac{|p-q|}{n-1}
$$

where $n=4$.

We'll then define similarity as:

$$
s(p,q) = 1 - d(p,q)
$$

In [5]:
def dissim_ord(p, q, attr):
    assert attr is not None, 'You must specify a list of possible values'
    val_p = attr.index(p)
    val_q = attr.index(q)
    return(abs(val_p-val_q)/(len(attr)-1))

def sim_ord(p, q, attr):
    assert attr is not None, 'You must specify a list of possible values'
    return(1-dissim_ord(p, q, attr))

Recall the formula for similary of hetergenous objects (Tan2e, Formula 2.28):

$$
s(\mathbf{x},\mathbf{y}) = \sum_{k=1}^{n}\frac{\delta_k s_k(\mathbf{x},\mathbf{y})}{\delta_k}
$$

where

* $𝑠_𝑘$ is a similarity function for the $k^{th}$ attribute, and the range of which is $0.0$ to $1.0$.
* $\delta_k$ is an indicator variable that is 0 if the $k^{th}$ attribute is asymmetric and both $x_k$ and $y_k$ are 0, and 1 otherwise.

For our purposes, since we know there are no asymmetric attributges, this formula can be simplified as follows:

$$
s(\mathbf{x},\mathbf{y}) = \sum_{k=1}^{n}\frac{s_k(\mathbf{x},\mathbf{y})}{n}
$$

where $n$ is the number of attributes.

In [6]:
# Function to compute the similarity of two toys
def sim_rec(objP, objQ):
    return np.mean([
        sim_nom(objP['shape'], objQ['shape']),
        sim_nom(objP['color'], objQ['color']),
        sim_num(objP['size'], objQ['size'], sizes),
        sim_ord(objP['condition'], objQ['condition'], conditions)
    ])

In [7]:
# Generate a new set of toys
toys = make_toys()
toys

Unnamed: 0,shape,color,size,condition
A,plane,blue,10,fair
B,boat,red,10,worn
C,car,blue,10,new


In [8]:
# Compute and display the similarities
simAB = sim_rec(toys.loc['A'], toys.loc['B'])
simAC = sim_rec(toys.loc['A'], toys.loc['C'])
simBC = sim_rec(toys.loc['B'], toys.loc['C'])
print('s(A,B)={:0.3f}, s(A,C)={:0.3f}, s(B,C)={:0.3f}'.format(simAB, simAC, simBC))

s(A,B)=0.417, s(A,C)=0.583, s(B,C)=0.250
