## Ex 3: Cosine similarity for sentences

Read a [blog post about cosine similarity](https://kanoki.org/2018/12/27/text-matching-cosine-similarity/). In this exercise we will implement the same analysis for Anna Karenina text. 

a) First, install [Pandas library](https://pandas.pydata.org/). Then take a look to the 'Numerical Feature Vectors' -section in the blog post and create a corpus variable from the Anna Karenina sentences. Use nltk's `sent_tokenize` -function from Exercise set 1, exercise 3. Take only 500 first sentences to reduce required computation time. Note! You may need to adjust `nlp.max_length`. 1,5p

b) When you have the corpus variable ready, implement TF-IDF transformation from the blog post and finally the cosine similarity.  1,5p

c) Find out with two for-loops, which sentences have the biggest cosine similarity (cosine similarity is defined in a range (0, 1) so the biggest cosine similarity should be 1). You should first end up with a result where the biggest cosine similarity is when it's calculated with two same sentences (cosine similarity 1). Then add a code which ignores the case where sentences are same.   2p

In [79]:
import pandas as pd

f = open("anna_karenina.txt", "r").read()
f = f.replace("\n", " ")


In [80]:
from spacy.lang.en import English

nlp = English()
nlp.add_pipe("sentencizer")
nlp.max_length = 1984305 # length of the document

doc = nlp(f)
sentences = [sentence for num, sentence in enumerate(doc.sents) if num < 500]
sentences = list(map(lambda x: str(x), sentences))


In [81]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

trsfm=vectorizer.fit_transform(sentences)

pd.DataFrame(trsfm.toarray(),columns=vectorizer.get_feature_names())

Unnamed: 0,01,1399,1998,2020,22,_come,_her_,_il,_kammerjunker_,_stranger_,...,yesterday,yet,you,young,younger,youngest,your,yourself,youth,zahar
0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0
1,0.117145,0.117145,0.117145,0.117145,0.117145,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.055479,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0
2,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.159908,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0
496,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0
497,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.408903,0.0,0.0,0.0,0.000000,0.431702,0.0,0.0
498,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.223639,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0


In [82]:
from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(trsfm)

In [83]:
# Prints all sentences, where the cosine similarity = 1

def same():
    biggest_similarity_same = []

    for num, sent in enumerate(cos_sim):
        for n, sen in enumerate(sent):
            if sen == 1.0:
                biggest_similarity_same.append((num, n))

    print(biggest_similarity_same)

same()

[(0, 0), (2, 2), (4, 4), (9, 9), (17, 17), (18, 18), (19, 19), (20, 20), (21, 21), (24, 24), (24, 455), (25, 25), (29, 29), (30, 30), (31, 31), (31, 73), (35, 35), (36, 36), (38, 38), (43, 43), (44, 44), (45, 45), (46, 46), (50, 50), (54, 54), (55, 55), (56, 56), (58, 58), (59, 59), (60, 60), (60, 288), (62, 62), (66, 66), (68, 68), (69, 69), (71, 71), (73, 31), (73, 73), (75, 75), (77, 77), (78, 78), (82, 82), (83, 83), (85, 85), (88, 88), (91, 91), (92, 92), (94, 94), (95, 95), (97, 97), (100, 100), (101, 101), (110, 110), (111, 111), (114, 114), (115, 115), (116, 116), (118, 118), (119, 119), (119, 316), (122, 122), (125, 125), (126, 126), (127, 127), (128, 128), (130, 130), (131, 131), (133, 133), (134, 134), (136, 136), (137, 137), (138, 138), (142, 142), (144, 144), (145, 145), (148, 148), (149, 149), (155, 155), (158, 158), (159, 159), (161, 161), (165, 165), (170, 170), (171, 171), (174, 174), (175, 175), (176, 176), (179, 179), (180, 180), (181, 181), (183, 183), (184, 184), (

In [84]:
# This code prints all of the same sentences with their cosine similarities to show why not all 500 sentences
# were printed in the example above.

def all_same():
    biggest_similarity_allsame = []

    for num, sent in enumerate(cos_sim):
        similarity = 0 # variable for storing the biggest cosine similarity
        sents = tuple() # variable for storing the indexes of the sentences
        for n, sen in enumerate(sent):
            if sen > similarity: # find the biggest similarity in each sentence
                similarity = sen
                sents = (num, n, sen)
        biggest_similarity_allsame.append(sents)

    for result in biggest_similarity_allsame:
        print(result)

all_same()

(0, 0, 1.0)
(1, 1, 1.0000000000000002)
(2, 2, 1.0)
(3, 3, 1.0000000000000002)
(4, 4, 1.0)
(5, 5, 1.0000000000000004)
(6, 6, 1.0000000000000004)
(7, 7, 1.0000000000000002)
(8, 8, 1.0000000000000002)
(9, 9, 1.0)
(10, 10, 1.0000000000000002)
(11, 11, 1.0000000000000002)
(12, 12, 1.0000000000000002)
(13, 13, 1.0000000000000002)
(14, 14, 1.0000000000000002)
(15, 15, 1.0000000000000002)
(16, 16, 1.0000000000000004)
(17, 17, 1.0)
(18, 18, 1.0)
(19, 19, 1.0)
(20, 20, 1.0)
(21, 21, 1.0)
(22, 22, 0.9999999999999999)
(23, 23, 0.9999999999999998)
(24, 24, 1.0)
(25, 25, 1.0)
(26, 26, 1.0000000000000002)
(27, 27, 1.0000000000000002)
(28, 28, 1.0000000000000002)
(29, 29, 1.0)
(30, 30, 1.0)
(31, 31, 1.0)
(32, 32, 1.0000000000000004)
(33, 33, 1.0000000000000002)
(34, 34, 1.0000000000000002)
(35, 35, 1.0)
(36, 36, 1.0)
(37, 37, 1.0000000000000002)
(38, 38, 1.0)
(39, 39, 1.0000000000000004)
(40, 40, 1.0000000000000002)
(41, 41, 1.000000000000001)
(42, 42, 1.0000000000000002)
(43, 43, 1.0)
(44, 44, 1.0)
(

In [85]:
# Omits same sentences and prints two sentences with biggest cosine similarity out of all sentences.
# As we know from before, some same sentences have not a similarity of 1, so I used 0.9 instead of 1 in my code

def two_biggest():
    similarity = 0
    sents = tuple()

    for num, sent in enumerate(cos_sim):
        for n, sen in enumerate(sent):
            if sen > similarity and sen < 0.9:
                similarity = sen
                sents = (num, n)

    print('Sentences:', sents, '\nCosine similarity:', similarity)
    
two_biggest()

Sentences: (46, 74) 
Cosine similarity: 0.8968316867493057
