# Project Text Mining (Part 2) 

This notebook was made as the Text Mining Project for Master Data Mining of Université Lyon 2. 

## Objective:

The main goal is to pratice the concepts seen in class and apply text mining tecniques in a dataset. 

## Plan:

This notebook will be separated into the following sections

$\rightarrow$. Mise en place d’un moteur de recherche


**Owners**: Lia Furtado and Hugo Vinision 


---

In [93]:
import json
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec
from transformers import BertTokenizer, TFBertModel
import os

In [94]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import re
import nltk 
from nltk.stem import PorterStemmer
from nltk import word_tokenize
from nltk.corpus import stopwords
import unicodedata
from collections import Counter
import spacy
import string
import spacy
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances
import math
import pickle

 
---
## Exercice 3 : Mise en place d’un moteur de recherche


#### Key-word Search Engine
This section goal is to create a search engine where we give some key words and the most similar documents to these words are returned. We used the Cossine Similarity metric

$\rightarrow$ **Cossine Similarity**

This mesure is made to see the distance of two vectors in a space this is calculated by getting the cosine of the angle between them, that is, the dot product of the vectors divided by the product of their lengths. 

![image.png](attachment:image.png)

Additionnaly, we tunned the vectorizer parameters to see the differences of the results.
We did experiments with:

* Testing several different document embedding techniques 

#### Title Search Engine

Moreover we did another search engine that when you search by the title it proposes the most similar articles 
to that  article.
We use the following informations to compare the articles:

* Text content similar by comparing with cossine similarity
* Authors that were co-writers in the  paper
* Articles that were published in the same conferences 



In [153]:
data = pd.read_csv('dblp_2016_cleaned.csv')

In [154]:
#Most 40 used words in bag of words
def clean_words(msg):
    erase_words = ['based','data', 'proposed', 'paper', 'model','method','results','time','algorithm','using','problem', \
                   'two', 'system','performance','approach','network','show','also','information','analysis','new', \
                   'used','systems', 'different','study','methods','networks','number','one','order','set','algorithms',\
                   'high','control','models','propose','learning','use','image','problems']
    
    return " ".join(char for char in word_tokenize(msg) if char not in erase_words)

### Key-word search engine

**Finding most similar documents to the key-words**

In [155]:
#Setting the query
query = ["decision", "tree", "classification"]

In [156]:
def fit_method(method, data):
    
    if method  == "tfidf":
        print("Training  TF-IDF...")
        vectorizer = TfidfVectorizer(min_df=2)
        vectorizer.fit(data)
        print("Done")

    if method == 'doc2vec':
        if not os.path.exists('do2vec_tokenizer.pkl'):
            
            text_clean_tokenized = data.apply(lambda x:nltk.word_tokenize(x))

            print("Training  Doc2Vec...")
            tagged_docs = []

            for i, list_tokens in enumerate(text_clean_tokenized):
                tagged_docs.append(TaggedDocument(words=list_tokens, tags=[str(i+1)]))

            vectorizer = Doc2Vec(vector_size=100, window=10, min_count=5, workers=11,alpha=0.025)
            vectorizer.build_vocab(tagged_docs)
            vectorizer.train(tagged_docs,total_examples=vectorizer.corpus_count, epochs=20)
            print("Saving Doc2Vec Model")
            pickle.dump(vectorizer, open('do2vec_tokenizer.pkl', 'wb'))
        else:
            print("Loading Doc2Vec...")
            vectorizer = np.load(open('do2vec_tokenizer.pkl', 'rb'), allow_pickle=True)
        print("Done")
    
    if method == 'bert':
        print("Loading BERT Model...")
        vectorizer = BertTokenizer.from_pretrained('bert-base-uncased')
    
    return vectorizer


In [165]:
def encode_text_bert(method,index, query, vectorizer):
    if not os.path.exists('bert_vectors'):
        vector  = encode_text(method, query, vectorizer)
    else:
        bert_vectors = np.load(open('bert_vectors', 'rb'), allow_pickle=True)
        vector = bert_vectors[index, :]
    return vector

In [166]:
def encode_text_doc2vec(index, vectorizer):
    vector  = vectorizer.dv[index]
    return vector

In [171]:
def encode_text(method,text, vectorizer):
    
    if method  == "tfidf":
        query_string = " ".join(text)
        query_cleaned_string = clean_words(query_string)
        query = [query_cleaned_string]
        #turning the query into a vector with the same shape as the vectorizer     
        vector = vectorizer.transform(query)
    if method == 'doc2vec':
        #getting the query and cleaning words that are not in the vocabulary
        query_string = " ".join(text)
        query = word_tokenize(query_string)
        #turning the query into a vector with the same shape as the vectorizer 
        vector = vectorizer.infer_vector(query)
    if method == 'bert':   
        query_string = " ".join(text)
        model = TFBertModel.from_pretrained("bert-base-uncased")
        encoded_input = vectorizer(query_string, return_tensors='tf')
        output = model(encoded_input)
        bert_output = output.last_hidden_state.numpy()
        vector = np.apply_along_axis(lambda x: np.mean(x), 1, bert_output)
    
    return vector


In [172]:
#Function to get the cossine similarity between vectors
def get_cosine_similarity(feature_vec_1, feature_vec_2):    
    return cosine_similarity(feature_vec_1.reshape(1, -1), feature_vec_2.reshape(1, -1))[0][0]

In [173]:
def test_query_cossine_similarity(query, vectorizer, method):
    old_value_similars = 0
    query_vec = encode_text(method, query, vectorizer)
    most_similars_doc = []
    #Comparing the query with every document in the dataset
    for index, row in data.iterrows():
        if method == 'bert':
            document_vec = encode_text_bert(method,index, query, vectorizer)
        elif method == 'doc2vec':
            document_vec = encode_text_doc2vec(index, vectorizer)
        else:
            document_vec = encode_text(method,[data['text_clean'].iloc[index]], vectorizer)
            
        new_value_similars = get_cosine_similarity(document_vec, query_vec)
        #Chosing the documents that have a cossine similarity higher than 0.45 with the query
        if (new_value_similars > old_value_similars):
            old_value_similars = new_value_similars
            print(old_value_similars)
            index_new = index
            
    most_similars_doc.append({"index" : index_new, "document" :data['text_clean'].iloc[index_new], 
                              "cossine_similarity": old_value_similars})
        
    return most_similars_doc
    

In [151]:
#getting the most similars documents with the query
vectorizer = fit_method("tfidf", data['text_clean'])
most_similars_doc = test_query_cossine_similarity(query, vectorizer, "tfidf" )
most_similars_df = pd.DataFrame.from_dict(most_similars_doc, orient='columns')
most_similars_df.style.highlight_max(subset= ['cossine_similarity'] , color = 'green')

Training  TF-IDF...
Done
Query vec
  (0, 17446)	0.6425180639897455
  (0, 4077)	0.5442665378163174
  (0, 2608)	0.5393926892907508


Unnamed: 0,index,document,cossine_similarity
0,2,classification datasets mining decision tree important induction research mining mainly classification prediction widely decision tree far shortcoming inclining choose attributes many values discussed decision tree improved version attributes divided groups apply selection measure groups gain good divide attributes values groups steps done get good classification misclassification ratio classify sets accurately efficiently,0.498959
1,990,mflexdt multi flexible fuzzy decision tree stream classification many real world applications instances arrive sequentially form streams processing poses challenges machine adhering line strategies extend flexible fuzzy decision tree flexdt multiple partitioning makes possible carry automatic line fuzzy classification aimed balance accuracy tree size stream mining objective classification predict true class incoming instances real terms evaluation accuracy tree depth significant factors influencing series experiments demonstrate produces optimal trees numeric nominal features variables,0.464133


**Comments:** `In the dataframe we can see the documents that had a cossine similarity bigger then 0.45 compared to the key words. We can see the main text and the results `

#### Compare several different document embedding techniques 

In [162]:
#Setting the query
query = ["decision", "tree", "classification"]

In [174]:
methods = ["tfidf", "doc2vec", "bert"]

results = pd.DataFrame(columns = ['index', 'document', 'cossine_similarity', 'vectorizer'])

#Chosing different vectorizers to do the text similarity from the search engine
for method in methods:
    print('Calculating for method')
    print(method)
    vectorizer = fit_method(method, data['text_clean'])
    most_similars_doc = test_query_cossine_similarity(query, vectorizer, method)
    
    results = results.append(pd.json_normalize(most_similars_doc)) 
    results.reset_index(drop=True, inplace=True)
    
    for index, row in results.iterrows():
        if (pd.isnull(results['vectorizer'].iloc[index])):
            results.loc[index, 'vectorizer'] = vectorizer

Calculating for method
bert
Loading BERT Model...


Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


(1, 768)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768

(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)
(768,)


KeyboardInterrupt: 

In [175]:
results

Unnamed: 0,index,document,cossine_similarity,vectorizer


**Comments:** `The simple count vector had a higher cossine similarity then TF-IDF, so we choose it to continue the search engine. `

### Title search engine

In [None]:
df_connections = data[['title', 'text_clean', 'text_clean_tokenized','authors_clean', 'venue_clean']]

In [None]:
def build_representation_bert(title, n):
    query = df_connections["text_clean"][df_connections["title"] == title].iloc[0]
    df = df_connections[df_connections["title"] != title]
    df.reset_index(inplace=True)
    
    for i, text in enumerate(df['title']):
        list_bert_data = encode_bert(text)
    
    query = encode_bert(query)
    
    list_similar_values = []
    #Comparing the query with every document in the dataset
    for bert_data in list_bert_data:
        list_similar_values.append(get_cosine_similarity(bert_data, query))

    #get the top 5 similar documents
    top_ids = np.argsort(list_similar_values)[-n:]
    most_similars_doc = []
    for ids in top_ids:
        most_similars_doc.append({"title" : df['title'].iloc[ids], "cossine_similarity": list_similar_values[ids]})
    return most_similars_doc
    

In [None]:
query = 'Width of Points in the Streaming Model'
list_similar = build_representation_bert(query, 10)
print(list_similar)

In [None]:
def text_document_similarity(title, n):
    text = df_connections["text_clean"][df_connections["title"] == title].iloc[0]
    df = df_connections[df_connections["title"] != title]
    df.reset_index(inplace=True)
    
    tagged_docs = []

    for i, list_tokens in enumerate(df['text_clean_tokenized']):
        tagged_docs.append(TaggedDocument(words=list_tokens, tags=[str(i+1)]))

    print('Training Doc2Vec...')

    d2v_model = Doc2Vec(vector_size=100, window=10, min_count=5, workers=11,alpha=0.025)
    d2v_model.build_vocab(tagged_docs)
    d2v_model.train(tagged_docs,total_examples=d2v_model.corpus_count, epochs=20)
    
    #getting the query and cleaning words that are not in the vocabulary
    query = word_tokenize(text)
    #turning the query into a vector with the same shape as the vectorizer 
    query_vec = d2v_model.infer_vector(query)
    
    print('Comparing text documents...')

    list_similar_values = []
    #Comparing the query with every document in the dataset
    for index, row in df.iterrows():
        document_vec = d2v_model.infer_vector(df['text_clean_tokenized'].iloc[index])
        list_similar_values.append(get_cosine_similarity(document_vec, query_vec))

    #get the top 5 similar documents
    top_ids = np.argsort(list_similar_values)[-n:]
    most_similars_doc = []
    for ids in top_ids:
        most_similars_doc.append({"title" : df['title'].iloc[ids], "cossine_similarity": list_similar_values[ids]})
    return most_similars_doc

In [None]:
def authors_or_venues_in_common(title, most_similars_doc):
    authors = df_connections["authors_clean"][df_connections["title"] == title].iloc[0]
    venues = df_connections["venue_clean"][df_connections["title"] == title].iloc[0]
    #taking out the query author
    updated_docs = []
    
    print('Checking common authors and venues...')

    for doc in most_similars_doc:
        article = df_connections[df_connections["title"] == doc['title']]
            
        for author in authors:
            if (author in article['authors_clean']):
                doc['author'] = author
            else:
                doc['author'] = 'No authors in common'

        for venue in venues:
            if (venue in article['venue_clean']):
                doc['venue'] = venue
            else:
                doc['venue'] = 'No venues in common'
        
        updated_docs.append(doc)
        
    return updated_docs

In [None]:
def find_similiar_documents(title, number_of_documents):
    
    most_similars_doc = text_document_similarity(title, number_of_documents)
    updated_docs = authors_or_venues_in_common(title, most_similars_doc)
    
    results =  pd.DataFrame(updated_docs[::-1])
    print("\n")
    print("Similar documents of: "+ str(title) + "\n")
    print(results)

    return results

In [None]:
dropdown = widgets.Dropdown(options = df_connections.title)

def dropdown_eventhandler(change):
    title = change.new
    results = find_similiar_documents(title, 10)


In [None]:
dropdown.observe(dropdown_eventhandler, names='value')

print("Choose from the listbox the Title you want to find similar documents: ")
display(dropdown)

**Results**
```
From the tests the best search engine that found documents similars to the query used a TF model with cossine similarity limiting the vocabulary size, taking out the less used words, the stop words and the most commom words. 
```