I have changed the code

# **Setup**

Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries needed to complete the code Professor Melnikov presented in the video.

In [5]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = "all"    # allows multiple outputs from a cell
import pandas as pd, numpy as np, seaborn as sns, matplotlib.pyplot as plt, nltk

np.set_printoptions(linewidth=10000, precision=4, edgeitems=20, suppress=True)
pd.set_option('display.max_rows', 100) # Change 'max_rows' to 'display.max_rows'
pd.set_option('display.max_columns', 100) # Change 'max_columns' to 'display.max_columns'
pd.set_option('display.max_colwidth', 100) # Change 'max_colwidth' to 'display.max_colwidth'
pd.set_option('display.precision', 2) # Change 'precision' to 'display.precision'
pd.set_option('display.max_rows', 8)

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Review**

Review the code Professor Melnikov used to compare strings in the previous video.

## Binary Similarity

**Binary comparison** is a simple method for measuring the similarity between two objects, where the return value `False` (or 0) indicates different objects and `True` (or 1) indicates equivalent objects. Typically, character strings are considered equivalent when they have exactly the same sequence of characters. Since Python is a case-sensitive language, uppercase and lowercase letters are treated as different characters.
    
The examples below demonstrates Python's built-in equivalency, where objects with different characters are considered inequivalent. However, Python does understand arithmetic representations of the same quantity and treats 0, 1-1, 4/2-2, and 2+2-4 as equal.

In [6]:
"Cornell"=="eCornell"       # binary (Boolean) comparison; False=0, True=1
0 == 1-1 == 4/2-2 == 2+2-4  # compares all elements at once

False

True

## Jaccard Similarity

[**Jaccard similarity**](https://scikit-learn.org/stable/modules/model_evaluation.html#jaccard-similarity-score) produces a similarity score in the interval $[0,1]$, where 0 indicates no similarity and 1 indicates perfect similarity. This works on any two sets of numbers, characters, words, objects, etc., and computes an intersection set of distinct shared elements $I$ along with the union set of all distinct elements $U$. The Jaccard similarity is just $|I|/|U|$, or the fraction of the sets' cardinalities. This computes the fraction of overlap of two sets.
    
The example below demonstrates the intersection and union sets of characters in two words.

In [7]:
A, B = set('Cornell'), set('eCornell') # convert to sets of characters
print(f'intersection: {A & B}')        # or A.intersection(B)
print(f'union: {A | B}')               # or A.union(B)

intersection: {'C', 'l', 'r', 'o', 'n', 'e'}
union: {'C', 'l', 'r', 'o', 'n', 'e'}


Note that a high Jaccard similarity (or high overlap in characters) does not necessarily indicate a high semantic similarity. In one of the examples below, `'cat'` and `'act'` have Jaccard similarity of 1, but they are not semantically similar.
    
Jaccard similarity also tends to be biased towards longer words, since more characters results in a higher chance of shared characters. Sets do not store repeated elements, so `'Cornell'` and `'eCornell'` have a Jaccard similarity of 1.

In [8]:
JaccardSim = lambda A={}, B={}: len(A & B) / len(A | B)    # A, B = sets of characters
JaccardDemo = lambda s1='word1', s2='word2': print(f'Jaccard({s1}, {s2}) = {JaccardSim(set(s1), set(s2))}')

JaccardDemo('Cornell', 'eCornell')  # sets remove duplicate characters, so the resulting sets are the same
JaccardDemo('cat', 'dog')           # animal-ness of these terms is not represented in their characters
JaccardDemo('cat', 'act')           # the resulting sets are the same, but meanings are different
JaccardDemo('cat', 'catch')         # high similarity is misleading. The semantic meanings are different as well
JaccardDemo('cats', 'dogs')         # plural form contributes the same "s" to each word

Jaccard(Cornell, eCornell) = 1.0
Jaccard(cat, dog) = 0.0
Jaccard(cat, act) = 1.0
Jaccard(cat, catch) = 0.75
Jaccard(cats, dogs) = 0.14285714285714285


## Correlation

Correlation is another measure of similarity that returns values in the interval $[-1,1]$, but it can only be computed on two numeric vectors of the same size. More precisely, it is a statistical measure of the linear relationship between two numeric vectors.

1. A value of 1 indicates a perfectly linear relation, such as for vectors $x=[1,2,3]$ and $y=[3,5,7]$, where you can express one vector as a linear combination of another: $y=2x+1$
1. A value of -1 indicates a perfectly opposite linear relation, such as for vectors $x=[1,2,3]$ and $y=[0,-1,-2]$, where you can express the relation between corresponding elements via a linear formula: $y=-x+1$
1. A value of 0 indicates no relation between two numeric vectors, such as for vectors $x=[1,2,3]$ and $y=[2,2,2]$, where $y=0\cdot x+2$
    
You cannot compute correlation on characters directly, but you can convert these characters to their numeric ASCII codes. Then you need to pad these numeric vectors (with zeros for example) to be all of the same length so that you can apply the correlation function. Padding from the right is preferred if you have mostly words with suffixes. Padding from the left is preferred if words differ mainly by their suffixes.
    
The code below converts several words into sequences of their numerical ASCII codes and packages them as rows in a dataframe. Here, `NaN` is used to pad the vectors. Note that all words start with the same letter `"c"` and mostly even with the same word `"cat"` (but different ending letter) and all vectors start with the same ASCII code (99). So, the words are right-padded, which is also noted by appended `NaN` values on the right of each vector.

In [15]:
ord("A")

65

In [9]:
VectorizeTerms = lambda LsTerms, lower=True: pd.DataFrame([[ord(ch) for ch in term] for term in LsTerms], index=LsTerms)
LsWords = ['cat', 'cats', 'cows', 'cattle', 'catlike', 'catfish']
df = VectorizeTerms(LsWords)  # conversion of lists automatically right-broadcasts NaN values to make all rows of same length
print(f'df.shape = {df.shape}')
df    # missing characters are right-padded with NaN

df.shape = (6, 7)


Unnamed: 0,0,1,2,3,4,5,6
cat,99,97,116,,,,
cats,99,97,116,115.0,,,
cows,99,111,119,115.0,,,
cattle,99,97,116,116.0,108.0,101.0,
catlike,99,97,116,108.0,105.0,107.0,101.0
catfish,99,97,116,102.0,105.0,115.0,104.0


If left-padding is needed, you could use [`.zfill()`](https://docs.python.org/3/library/stdtypes.html#str.zfill) and [`.rjust()`](https://docs.python.org/3/library/stdtypes.html#str.rjust) string methods as shown below.

In [16]:
nMaxLen = max(len(s) for s in LsWords)  # find the length of the longest word
'cat'.zfill(nMaxLen) # left-pads with zeros to the length of the longest word
'cat'.rjust(nMaxLen) # left-pads with space characters

'0000cat'

'    cat'

Next, use the [`corr()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) method on a dataframe `df` to compute the pairwise correlations of rows. This results in a square symmetrical matrix with ones on the diagonal, which indicate perfect correlation of the given word (represented by the row vector in `df`) to itself. Any other correlation in cell $ij$ represents the correlation of character code vectors of word $i$ and word $j$ in `df`.

In contrast to Jaccard similarity, the positioning of values is very important here; and each two vectors must be in the same-dimensional space (i.e. must have the same number of values).

In [18]:
cm = sns.light_palette("brown", as_cmap=True)
mSim = df.T.fillna(0).corr().values     # numpy (square) correlation matrix
print(f'mSim.shape = {mSim.shape}')
# Applying format before styling
styled_df = pd.DataFrame(mSim, index=df.index, columns=df.index).style.format("{:.3f}")
styled_df.background_gradient(cmap=cm, axis=1) # Applying background gradient to styled df

# Alternatively, you could chain format with background_gradient
# (pd.DataFrame(mSim, index=df.index, columns=df.index)
#  .style.background_gradient(cmap=cm, axis=1)
#  .format("{:.3f}"))

mSim.shape = (6, 6)


Unnamed: 0,cat,cats,cows,cattle,catlike,catfish
cat,1.0,0.707,0.729,0.312,-0.002,-0.092
cats,0.707,1.0,0.997,0.507,0.168,-0.243
cows,0.729,0.997,1.0,0.5,0.13,-0.267
cattle,0.312,0.507,0.5,1.0,0.404,0.162
catlike,-0.002,0.168,0.13,0.404,1.0,0.826
catfish,-0.092,-0.243,-0.267,0.162,0.826,1.0


# **Optional Practice**

Now you will practice some of these basic string manipulation techniques. You will apply some measures discussed above to the quotes of famous mathematicians.

As you work through these tasks, check your answers by running your code in the *#check solution here* cell, to see if you’ve gotten the correct result. If you get stuck on a task, click the See **solution** drop-down to view the answer.

In [19]:
sQuote1 = 'Pure mathematics is, in its way, the poetry of logical ideas.' # Albert Einstein, German theoretical physicist
sQuote2 = 'Mathematics is the most beautiful and most powerful creation of the human spirit.' # Stefan Banach, Polish mathematician
sQuote3 = 'What is mathematics? It is only a systematic effort of solving puzzles posed by nature.' # Shakuntala Devi

## Task 1. Apply Jaccard similarity to characters

Which two (lower cased) quotes are most and least similar with respect to Jaccard similarity applied to its characters?

<b>Hint:</b> Apply <code>JaccardSim()</code> to compute all pairwise similarities of the pairs of sets of characters derived from these quote strings.

In [20]:
A, B, C = set(sQuote1), set(sQuote2), set(sQuote3)
print(f'intersection: {A & B}')        # or A.intersection(B)
print(f'union: {A | B}')               # or A.union(B)

intersection: {'f', 'u', 'a', 'o', '.', ' ', 't', 'w', 'c', 'n', 'i', 'm', 's', 'e', 'l', 'r', 'p', 'h', 'd'}
union: {'u', 'f', ',', 'a', 'o', '.', 'y', ' ', 'b', 'M', 't', 'w', 'c', 'n', 'i', 'm', 'P', 's', 'e', 'g', 'l', 'r', 'p', 'h', 'd'}


In [23]:
JaccardSim = lambda a={}, b={}: len(a & b) / len(a | b)    # A, B = sets of characters

In [21]:
JaccardSim = lambda A={}, B={}: len(A & B) / len(A | B)    # A, B = sets of characters
JaccardDemo = lambda s1='word1', s2='word2': print(f'Jaccard({s1}, {s2}) = {JaccardSim(set(s1), set(s2))}')

JaccardDemo('Cornell', 'eCornell')  # sets remove duplicate characters, so the resulting sets are the same
JaccardDemo('cat', 'dog')           # animal-ness of these terms is not represented in their characters
JaccardDemo('cat', 'act')           # the resulting sets are the same, but meanings are different
JaccardDemo('cat', 'catch')         # high similarity is misleading. The semantic meanings are different as well
JaccardDemo('cats', 'dogs')         # plural form contributes the same "s" to each word

Jaccard(Cornell, eCornell) = 1.0
Jaccard(cat, dog) = 0.0
Jaccard(cat, act) = 1.0
Jaccard(cat, catch) = 0.75
Jaccard(cats, dogs) = 0.14285714285714285


In [24]:
SsQuote1, SsQuote2, SsQuote3 = set(sQuote1.lower()), set(sQuote2.lower()), set(sQuote3.lower())

print('Einstein vs Banach:', JaccardSim(SsQuote1, SsQuote2))
print('Einstein vs Devi:  ', JaccardSim(SsQuote1, SsQuote3))
print('Banach vs Devi:    ', JaccardSim(SsQuote2, SsQuote3))

Einstein vs Banach: 0.8260869565217391
Einstein vs Devi:   0.8076923076923077
Banach vs Devi:     0.8


<font color=#606366>
    <details><summary><font color=#B31B1B>▶ </font>See <b>solution</b>.</summary>
Quotes by Einstein and Banach are most similar because they yield the highest Jaccard similarity. Likewise, the quotes by Banach and Devi are least similar with respect to Jaccard Similarity applied to the character sets.
            <pre>
SsQuote1, SsQuote2, SsQuote3 = set(sQuote1.lower()), set(sQuote2.lower()), set(sQuote3.lower())

print('Einstein vs Banach:', JaccardSim(SsQuote1, SsQuote2))
print('Einstein vs Devi:  ', JaccardSim(SsQuote1, SsQuote3))
print('Banach vs Devi:    ', JaccardSim(SsQuote2, SsQuote3))
</pre>
</details>
</font>
<hr>

## Task 2. Apply Jaccard similarity to words

Which two (lower cased) quotes are most and least similar with respect to Jaccard similarity applied to its words?

<b>Hint:</b> Use <code>nltk.word_tokenize()</code> to split the sentences above into lists of lower cased words. Then use <code>JaccardSim()</code> to compute all pairwise similarities of the pairs of sets of words derived from these lists.

In [26]:
nltk.download('punkt_tab')
SsQuote1 = set(nltk.word_tokenize(sQuote1.lower()))
SsQuote2 = set(nltk.word_tokenize(sQuote2.lower()))
SsQuote3 = set(nltk.word_tokenize(sQuote3.lower()))

print('Einstein vs Banach:', JaccardSim(SsQuote1, SsQuote2))
print('Einstein vs Devi:  ', JaccardSim(SsQuote1, SsQuote3))
print('Banach vs Devi:    ', JaccardSim(SsQuote2, SsQuote3))

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

Einstein vs Banach: 0.25
Einstein vs Devi:   0.16
Banach vs Devi:     0.16666666666666666


<font color=#606366>
    <details><summary><font color=#B31B1B>▶ </font>See <b>solution</b>.</summary>
Quotes by Einstein and Banach are most similar because they yield the highest Jaccard similarity.
            <pre>
SsQuote1 = set(nltk.word_tokenize(sQuote1.lower()))
SsQuote2 = set(nltk.word_tokenize(sQuote2.lower()))
SsQuote3 = set(nltk.word_tokenize(sQuote3.lower()))

print('Einstein vs Banach:', JaccardSim(SsQuote1, SsQuote2))
print('Einstein vs Devi:  ', JaccardSim(SsQuote1, SsQuote3))
print('Banach vs Devi:    ', JaccardSim(SsQuote2, SsQuote3))
</pre>
</details>
</font>
<hr>