Today we're going to use a distance measure, Euclidean distance, to retrieve or find the most similar documents. The idea is that we can represent the content, the style, or the sentiment of a text as a numeric vector. And we can measure the similarity between these vectors as a way to measure to the similarity between the texts they represent. Instead of using a text classifier to distinguish between classes, we'll just find the most similar examples.

Let's start by getting our environment ready.

In [1]:
from text_analytics import TextAnalytics
import os
import pandas as pd

ai = TextAnalytics()
ai.data_dir = os.path.join(".", "data")
print("Done!")

Done!


We're going to start by looking at author-based similarity, using style features. So, let's load our data from Project Gutenberg. We don't need the whole dataset, so we only load part of it.

In [2]:
file = os.path.join(ai.data_dir, "stylistics.authorship_1850.gz")
df = pd.read_csv(file, nrows = 2000)
print(df)

         Author                Title  \
0      abbott_j  alexander_the_great   
1      abbott_j  alexander_the_great   
2      abbott_j  alexander_the_great   
3      abbott_j  alexander_the_great   
4      abbott_j  alexander_the_great   
...         ...                  ...   
1995  bennett_a   the_old_wives_tale   
1996  bennett_a   the_old_wives_tale   
1997  bennett_a   the_old_wives_tale   
1998  bennett_a   the_old_wives_tale   
1999  bennett_a   the_old_wives_tale   

                                                   Text  
0     note project gutenberg also has an html versio...  
1     it will be recollected to epirus where her fri...  
2     it would be best to endeavor to effect a landi...  
3     transport his army across the straits the army...  
4     that the true greatness of the soul of alexand...  
...                                                 ...  
1995  the stipendiary achieved marvellously the illu...  
1996  winter overcoat he directed the vast affair of...

Now, we're going to get ready to find the most similar samples. We're doing to two things here: (1) we transforming the text into vectors representing style and (2) we're choosing a random sample to look at. This means we'll randomly select one chunk and then find the most similar samples to it.

In [3]:
import random

x, vocab_size = ai.get_features(df, "style")
sample = random.randint(0,len(df))
print(x)
print(sample)

  (0, 0)	338
  (0, 1)	273
  (0, 2)	256
  (0, 3)	125
  (0, 4)	88
  (0, 5)	121
  (0, 6)	1
  (0, 7)	115
  (0, 8)	48
  (0, 9)	102
  (0, 10)	25
  (0, 11)	118
  (0, 13)	62
  (0, 14)	28
  (0, 15)	32
  (0, 16)	33
  (0, 17)	18
  (0, 18)	8
  (0, 19)	16
  (0, 20)	16
  (0, 21)	35
  (0, 22)	34
  (0, 23)	4
  (0, 24)	27
  (0, 25)	13
  :	:
  (1999, 209)	2
  (1999, 210)	1
  (1999, 212)	2
  (1999, 215)	1
  (1999, 218)	2
  (1999, 220)	1
  (1999, 221)	4
  (1999, 222)	2
  (1999, 224)	3
  (1999, 225)	1
  (1999, 232)	3
  (1999, 236)	1
  (1999, 237)	1
  (1999, 239)	1
  (1999, 242)	1
  (1999, 244)	1
  (1999, 250)	3
  (1999, 252)	4
  (1999, 253)	1
  (1999, 278)	1
  (1999, 281)	2
  (1999, 285)	1
  (1999, 295)	2
  (1999, 301)	5
  (1999, 303)	1
983


So now we know what we're looking for. Here's the code. As always, if you want to dig deeper, take a look under the hood in our *text_analytics* package. We're using the *linguistic_distance* function to find the 5 other samples that are most similar to ours. 

The *x* variable is passing our linguistic features.

The *y* variable is passing our meta-data (the author names).

The *sample* variable is telling which chunk to use for the similarity search. We can rerun the code block below the generate as many new samples as we'd like.

And the *n* variable is telling how many similar examples to find.

In [4]:
sample = random.randint(0,len(df))
y_sample, y_closest = ai.linguistic_distance(x = x, y = df.loc[:,"Author"].values, sample = sample, n = 5)
print(y_sample, y_closest)

altsheler_j ['altsheler_j', 'altsheler_j', 'altsheler_j', 'altsheler_j', 'altsheler_j']


Let's do it again, with content features. This time we're going to start by loading our pre-trained TF-IDF vectorizer.

In [5]:
file_phrases = os.path.join(".", ai.data_dir, "sociolinguistics.english_all.gz")
ai.phrases = ai.deserialize("phrases", file_phrases + ".phrases.json")
ai.tfidf_vectorizer = ai.deserialize("tfidf_model", file_phrases + ".tfidf.json")

print("Done!")

Done!


Now we can repeat the same code, this time using *content* features. And we'll look at our tweets from different cities.

In [6]:
file = os.path.join(ai.data_dir, "sociolinguistics.english_cities.gz")
df = pd.read_csv(file, nrows = 2000)
print(df)

x, vocab_size = ai.get_features(df, "content")

            City                                               Text
0     washington   you really need to go back to bar tending or ...
1         london   jay finley christ in explains why today is co...
2          lagos   forget if this happened truly it s definitely...
3        toronto   yall i love this skin big thanks to for makin...
4        nairobi   the late brilliant prof ali mazrui explains h...
...          ...                                                ...
1995     atlanta   according to cdc s latest levels of u s flu l...
1996       lagos   list of the roads and bridges that buhari is ...
1997     calgary   instead of condemning the assault of by a uni...
1998     phoenix   also vexing how can we use the experience of ...
1999  washington   just landed in the united kingdom heading to ...

[2000 rows x 2 columns]


In [7]:
sample = random.randint(0,len(df))

y_sample, y_closest = ai.linguistic_distance(x = x, y = df.loc[:,"City"].values, sample = sample, n = 5)
print(y_sample, y_closest)

johannesburg ['atlanta', 'johannesburg', 'lagos', 'johannesburg', 'johannesburg']


And there you go! 

In this lab, we've seen how to find or retrieve the most similar texts using a simple distance measure. We've looked at author style (using books from the 19th century) and content (using tweets). The basic idea here is that texts can be similar to one another in these three different ways.

Do distance metrics give results as accurate as the classifiers we've used? Probably not. But remember that they're a lot simpler and they are unsupervised, which means that there isn't any training data.

In [8]:
file = os.path.join(ai.data_dir, "economic.hotels_as_reviews.gz")
df = pd.read_csv(file, nrows = 5000)
print(df)

x, vocab_size = ai.get_features(df, "sentiment")

                                       Hotel Rating  \
0                 11th Avenue Hotel & Hostel    low   
1                                3 West Club   high   
2                                  414 Hotel   high   
3     70 park avenue hotel - a Kimpton Hotel   high   
4       A Victory Inn & Suites Phoenix North    low   
...                                      ...    ...   
4995                         The Lenox Hotel   high   
4996                    The Listel Vancouver   high   
4997                         The Loden Hotel   high   
4998                            The Lombardy   high   
4999                      The Lonsdale Hotel    low   

                                                   Text  
0     This hostel is in a very good location, close ...  
1     We had 5 nights here and were unsure as to wha...  
2     This is a small boutique hotel with a nice int...  
3     I stayed at 70 Park Ave Hotel the night before...  
4     I made a reservation. Cancelled 2 hours lat

In [9]:
import random
sample = random.randint(0,len(df))

y_sample, y_closest = ai.linguistic_distance(x = x, y = df.loc[:,"Rating"].values, sample = sample, n = 5)
print(y_sample, y_closest)

low ['low', 'low', 'low', 'low', 'low']
