## SOFT COSINE SIMILARITY FOR DOCUMENT MATCHING
***A soft cosine or ("soft" similarity) between two vectors considers similarities between pairs of features. The traditional cosine similarity considers the vector space model (VSM) features as independent or completely different, while the soft cosine measure proposes considering the similarity of features in VSM, which help generalize the concept of cosine (and soft cosine) as well as the idea of (soft) similarity.*** 

~ Wikipedia (https://en.wikipedia.org/wiki/Cosine_similarity)

In [1]:
import numpy as np
import pandas as pd

In [2]:
from gensim.corpora import Dictionary
from gensim.models import Word2Vec, WordEmbeddingSimilarityIndex
from gensim.similarities import SoftCosineSimilarity, SparseTermSimilarityMatrix
from nltk import word_tokenize
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

***Read-in data***

The Dataset utilized in this notebook is derived from the African Conflicts dataset found on Kaggle:

https://www.kaggle.com/jboysen/african-conflicts

In [7]:
#df = pd.read_csv(r'data/african_conflicts.csv', nrows=4000, encoding="ANSI")
df = pd.read_csv(r'C:\Users\joneszc\Documents\Python_Scripts\Spellcheck\data\african_conflicts.csv', nrows=2000, encoding="ANSI")

#create dataframe from df that contains only the 'NOTES' column for text analysis
df_notes = df[['NOTES']]
df_notes['index'] = df_notes.index
df_notes.head()
#print(df_notes.head(), df_notes.info())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_notes['index'] = df_notes.index


Unnamed: 0,NOTES,index
0,A Berber student was shot while in police cust...,0
1,Riots were reported in numerous villages in Ka...,1
2,Students protested in the Amizour area. At lea...,2
3,"Rioters threw molotov cocktails, rocks and bur...",3
4,"Rioters threw molotov cocktails, rocks and bur...",4


In [8]:
# here we get 50 % row from the df put into another dataframe df_test 
df_test = df_notes.sample(frac =.5) 

df_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 353 to 91
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   NOTES   1000 non-null   object
 1   index   1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 23.4+ KB


In [9]:
#Df = df_test.dropna() # drop all the rows with NaN values using df.dropna() AND keep original index (to reset index use: df.reset_index(drop=True))
Df = df_notes.dropna()
#Df.head()

***Establish word embedding model; rather than creating our own with Gensim's word2vec, we can download a pretrained model:***

In [10]:
#To get the word vectors, we need a word embedding model so we download the FastText model using gensim’s downloader api.
import gensim.downloader as api

In [11]:
# Download the FastText model
# https://radimrehurek.com/gensim/auto_examples/howtos/run_downloader_api.html
w2v_model = api.load('fasttext-wiki-news-subwords-300')

***Use Gensim's WordEmbeddingSimilarityIndex() to create a term similarity index for our word embedding***

In [12]:
# Prepare the similarity index

similarity_index = WordEmbeddingSimilarityIndex(w2v_model) #Here the w2v_model is the fasttext_model300

***Prepare the text data for numerical analysis***

In [13]:
#Prepare and Clean the text documents in the 'NOTES' column
documents = Df['NOTES'].tolist() # Converge text data from 'NOTES' column to a list 
documents_token = [x.lower().split() for x in documents] # apply lowercase and tokenize
documents_token = [[w for w in doc if not w in stop_words] for doc in documents_token] # Remove stopwords
documents_token = [[w for w in doc if w.isalpha()] for doc in documents_token] # Remove numbers and special characters

print(documents_token[0:7],"\n")
print("number of documents: ", len(documents_token))

[['berber', 'student', 'shot', 'police', 'custody', 'police', 'station', 'beni', 'later', 'died'], ['riots', 'reported', 'numerous', 'villages', 'resulting', 'dozens', 'wounded', 'clashes', 'protesters', 'police', 'significant', 'material'], ['students', 'protested', 'amizour', 'least', 'later', 'arrested', 'allegedly', 'insulting'], ['rioters', 'threw', 'molotov', 'rocks', 'burning', 'tires', 'gendarmerie', 'stations', 'beni'], ['rioters', 'threw', 'molotov', 'rocks', 'burning', 'tires', 'gendarmerie', 'stations', 'beni'], ['rioters', 'threw', 'molotov', 'rocks', 'burning', 'tires', 'gendarmerie', 'stations', 'beni'], ['protesters', 'attacked', 'gendarmerie', 'detachment', 'rocks', 'set', 'fire', 'two', 'gendarmerie', 'vehicle', 'well', 'registry', 'office', 'court']] 

number of documents:  2000


***Use Gensim to create a Dictionary from our documents, then use SparseTermSimilarityMatrix() to create a similarity matrix***

In [14]:
#Using the document corpus we construct a dictionary,  and a term similarity matrix.
dictionary = Dictionary(documents_token)
bow_docs = [dictionary.doc2bow(document) for document in documents_token]
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary) # Use termsim_index for custom model

***Create and save the index matrix of Soft Cosine Similarity scores against our documents***

In [15]:
#compute Soft Cosine Similarity against documents and store the index matrix 
doc_similarity_index = SoftCosineSimilarity(bow_docs, similarity_matrix, num_best=10)
#doc_similarity_index.save('models/gensim_docims_index') #Optional, can save to disk for later usage

***Time to query the documents with new information and hopefully find a match***

In [23]:
#To use the docsim index, we search a query string against the index to find the most similar documents. 

query = input("Query Entered: ")
query = query.lower().split() # apply lowercase and tokenize
query = [w for w in query if not w in stop_words] # Remove stopwords
query = [w for w in query if w.isalpha()] # Remove numbers and special characters

print("\n")

similarities = doc_similarity_index[dictionary.doc2bow(query)]
results = [documents[i] for i in [a[0] for a in similarities]]
score_list = [a[1] for a in similarities]


print ('Input key terms : {}\n'.format(' '.join(query)))
print("\n")
#results = [' '.join(each) for each in result_list]
for score, result in zip(score_list, results):
    print('{:.3f} : {}'.format(score, result))

Query Entered: Rweru Hill Military position


Input key terms : rweru hill military position



0.914 : A soldier from the Rweru Hill Military position was tied up by Imbonerakure and beaten severely. 
0.683 : 2 Congolese soldiers were killed in an attack by FRPI militiamen on their position in Kaswara village in Ituri. There was also an unconfirmed report of the death of a child of one of the soldiers. A military spokesman said the attack was actually an ambush of a patrol.
0.535 : Dozens of youth set up road blocks on roads leading to voting centres in Raffour and threw rocks at offices of the local military police to prevent people from voting in the national legislative elections. The action started the day before the elections, and continued on election day. They clashed with members of the military police on several occasions during the riot. At least 3 were arrested.
0.535 : Dozens of youth set up road blocks on roads leading to voting centres in Raffour and threw rocks at offic

  Y = np.multiply(Y, 1 / np.sqrt(Y_norm))
  Y = np.multiply(Y, 1 / np.sqrt(Y_norm))


LRA leaving peace talks in Juba and regrouping in Garamba.

LRA invades village, abducts civilians

Municipal workers protest late payment of wages