# Project Text Mining (Part 1) 

This notebook was made as the Text Mining Project for Master Data Mining of Université Lyon 2. 

## Objective:

The main goal is to pratice the concepts seen in class and apply text mining tecniques in a dataset. 

## Plan:

This notebook will be separated into the following sections

$\rightarrow$ Acquisition des données 

$\rightarrow$. Construction d’un index sur les mots

$\rightarrow$. Regrouper les documents par cluster et/ou thématique


**Owners**: Lia Furtado and Hugo Vinision 


---

In [1]:
import json
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import re
import nltk 
from nltk.stem import PorterStemmer
from nltk import word_tokenize
from nltk.corpus import stopwords
import unicodedata
from collections import Counter
import string
from sklearn.metrics.pairwise import cosine_similarity
from itertools import combinations

from yellowbrick.text import FreqDistVisualizer
import seaborn as sns
import math
import networkx as nx
from pyvis.network import Network
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec
from sklearn.cluster import KMeans
from sklearn.manifold import Isomap
from plotly.subplots import make_subplots
import plotly.express as px


---
## Acquisition des données

This part consists in loading the dataset from https://www.aminer.org/citation and creating a usable dataset with smaller size and better structure. 



In [3]:
#Loading the json dataset retrieved from the website 
data = []
with open('dblp-ref/dblp-ref-0.json') as f:
    for line in f:
        data.append(json.loads(line))
        
df = pd.json_normalize(data)

#Get a small subset of data by filtering only the citations from 2016

data_2016 = df[(df['year'] == 2016)]

#Remove movies with abstract null
data_2016 = data_2016[~data_2016['abstract'].isnull()] 
#Reindex the dataframe
data_2016.reset_index(drop=True, inplace=True)
data = data_2016

FileNotFoundError: [Errno 2] No such file or directory: 'dblp-ref/dblp-ref-0.json'

In [None]:
data.head()

---
## Construction d’un index sur les mots


This section is to first preprocess the texts by removing stop-words, frequent words, rare words, perform stemming or lemmatization etc. Then to vizualize some common words and understand the data. Finally, to vectorize the words by performing Count Vectorizer or TF-IDF transformation. This two methods are explained more bellow:


* Text Cleaning and pre-processing Authors and Conferences information
* Data Exploratory analysis (by visualization)
* Text Vectorization 


### Text Cleaning and pre-processing Authors and Conferences information

In [None]:
#Joining the title and abstract text
data['text'] = data['title'] + ' ' + data['abstract']



In [None]:
#Loading the stopwords
stop_words = stopwords.words('english')
stopwords_en = set(stop_words)


In [None]:
def cleanup_authors(msg):
    #removing []
    sentence = re.sub(r"[\([{})\]]",r'', msg)
    #removing '' and .
    sentence = sentence.replace("'", "").replace(".", "") 
    sentence = sentence.replace("\\", "").replace("/", "") 
    sentence = sentence.replace('"','') 
    #remove all non latin caracters
    sentence = re.sub(r'[^\x00-\x7f]',r'', sentence)
    #removing digits
    sentence = re.sub("\S*\d+\S*", "", sentence)
    #remove diactric accents and greek letters
    sentence = ''.join(c for c in unicodedata.normalize('NFD', sentence)
                  if unicodedata.category(c) != 'Mn')
    #hyphen the authors
    sentence = list(sentence.split(", "))
    sentence = [word.lower().replace(' ','-') for word in sentence]
    return sentence

In [None]:
def cleanup_venue(msg):
    #removing pontuation
    No_Punctuation = [char if char not in string.punctuation else ' ' for char in msg ]
    sentence = ''.join(No_Punctuation)
    #removing []
    sentence = re.sub(r"[\([{})\]]",r'', msg)
    #removing '' and .
    sentence = sentence.replace(":", "") 
    sentence = sentence.replace(".", "")
    sentence = sentence.replace("\\", "").replace("/", "") 
    sentence = sentence.replace('"','') 
    sentence = sentence.replace("& ", "")
    sentence = sentence.replace("and ", "")
    #hyphen the authors
    sentence = list(sentence.split(", "))
    sentence = [word.lower().replace(' ','-') for word in sentence]
    return sentence

In [None]:
def cleanup_text(msg):
    #removing pontuation
    No_Punctuation = [char if char not in string.punctuation else ' ' for char in msg ]
    sentence = ''.join(No_Punctuation)
    #remove all non latin caracters
    sentence = re.sub(r'[^\x00-\x7f]',r'', sentence)
    #removing digits
    sentence = re.sub("\S*\d+\S*", "", sentence)
    #remove diactric accents and greek letters
    sentence = ''.join(c for c in unicodedata.normalize('NFD', sentence)
                  if unicodedata.category(c) != 'Mn')
    #### Word tokenization is the process of splitting up “sentences” into “words”
    sentence = nltk.word_tokenize(sentence)
    #Stemming the words
    #stemmer = PorterStemmer()
    return " ".join(word.lower() for word in sentence if word.lower() not in stopwords_en and len(word.lower())>1)


In [None]:
#Most 40 used words in bag of words
def clean_words(msg):
    erase_words = ['based','data', 'proposed', 'paper', 'model','method','results','time','algorithm','using','problem', \
                   'two', 'system','performance','approach','network','show','also','information','analysis','new', \
                   'used','systems', 'different','study','methods','networks','number','one','order','set','algorithms',\
                   'high','control','models','propose','learning','use','image','problems']
    
    return " ".join(char for char in word_tokenize(msg) if char not in erase_words)

In [None]:
data.head()

In [None]:
data['text_clean'] = data['text'].apply(lambda x:cleanup_text(x))
data['text_clean'] = data['text_clean'].apply(lambda x:clean_words(x))

In [None]:
data['text_clean_tokenized'] = data['text_clean'].apply(lambda x:nltk.word_tokenize(x))
data = data[~data['venue'].isnull()] 
data.reset_index(inplace=True, drop=True)

In [None]:
data['authors_clean'] = data['authors'].apply(lambda x:cleanup_authors(str(x)))
data['venue_clean'] = data['venue'].apply(lambda x:cleanup_venue(x))

In [None]:
data.head()

In [None]:
data.to_csv('dblp_2016_cleaned.csv', index=False)

### Data Exploratory analysis (by visualization)

**Most Common words** 

In [None]:
docs = [[w.lower() for w in word_tokenize(text)] 
            for text in list(data['text_clean'])]
bag_of_words = [item for sublist in docs for item in sublist]

In [None]:
x=[]
y=[]
#Most common words
counter = Counter(bag_of_words)
most = counter.most_common()

for word,count in most[:30]:
        x.append(word)
        y.append(count)
plt.figure(figsize=(25,10))
plt.xticks(fontsize=18, rotation=90)

plt.bar(x,y)
plt.show()

In [None]:
# Generating the wordcloud with the values under the category dataframe
plt.figure(figsize=(10,8))
word_cloud = WordCloud(background_color='black',
                          max_font_size = 80
                         ).generate(" ".join(bag_of_words))
plt.imshow(word_cloud)
plt.axis('off')
plt.show()


**Most Common Authors and Venues** 

In [None]:
bag_of_authors = [item for sublist in data['authors_clean'] for item in sublist]

bag_of_venues = [item for sublist in data['venue_clean'] for item in sublist]

In [None]:
# Top 10 authors
plt.figure(figsize=(8, 10)) 

plt.subplot(2, 1, 1)

top10authors = pd.DataFrame.from_records(
    Counter(bag_of_authors).most_common(10), columns=["Author", "Count"]
)
sns.barplot(x="Count", y="Author", data=top10authors, palette="RdBu_r")
plt.title("Top 10 Authors")

plt.subplot(2, 1, 2)

# TOP 10 Venues
top10venues = pd.DataFrame.from_records(
    Counter(bag_of_venues).most_common(10),
    columns=["Venues", "Count"],
)

sns.barplot(x="Count", y="Venues", data=top10venues)
plt.title("Top 10 Venues")

**Network  Vizualization of the top 20 authors**

In [None]:
df_connections = data[['title', 'text_clean', 'text_clean_tokenized','authors_clean', 'venue_clean']]

In [None]:
top20authors = pd.DataFrame.from_records(
    Counter(bag_of_authors).most_common(20), columns=["Name", "Count"])

articles_20_common_authors = []
for index, connection in df_connections.iterrows():
    for author in list(top20authors['Name']):
        if (author in connection['authors_clean']):
            articles_20_common_authors.append(connection)
            
df_20 = pd.DataFrame(articles_20_common_authors)

In [None]:
df_20['authors_combination'] = df_20['authors_clean'].apply(lambda x: list(combinations(x[::-1], 2)))
df_20 = df_20.explode('authors_combination','venue_clean')

df_20 = df_20[~df_20['authors_combination'].isnull()] 
df_20.reset_index(inplace=True, drop=True)
df_20['From'], df_20['To'] = zip(*df_20.authors_combination)

df_graph = df_20[['From', 'To', 'title']]

In [None]:
MG= nx.from_pandas_edgelist(df_graph, 'From', 'To', edge_attr=['title'], 
                                 create_using=nx.MultiGraph())

MG.edges(data=True)

In [None]:
net = Network(height='650px', width='100%', font_color='black', notebook =True)
net.from_nx(MG)
net.show("example.html")

### Text Vectorization

$\rightarrow$ **Term Frequency Inverse Document Frequency Vectorizer**
```
This vectorizer considers in the overall documents the weight of words.

Tfidf is equals to number of word appears in a document times the inverse document frequency of the word across the set of 
documents  
```

$\rightarrow$ **Doc2Vec**


$\rightarrow$ **BERT**




**TFIDF**

In [None]:
vectorizer = TfidfVectorizer(min_df=2)
vectors = vectorizer.fit_transform(data['text_clean'])
feature_names = vectorizer.get_feature_names_out()
tfidf_vectors = vectors.todense()

tfid = pd.DataFrame(tfidf_vectors, columns=feature_names)

In [None]:
visualizer = FreqDistVisualizer(features=feature_names, orient='v', size=(1080, 560))
visualizer.fit(vectors)
visualizer.show()

In [None]:
tfid

**Doc2Vec**

In [None]:
tagged_docs = []

for i, list_tokens in enumerate(data['text_clean_tokenized']):
    tagged_docs.append(TaggedDocument(words=list_tokens, tags=[str(i+1)]))

In [None]:
print('Training Doc2Vec...')

d2v_model = Doc2Vec(vector_size=100, window=10, min_count=5, workers=11,alpha=0.025)
d2v_model.build_vocab(tagged_docs)
d2v_model.train(tagged_docs,total_examples=d2v_model.corpus_count, epochs=20)

In [None]:
# Parse documents and print some info
print('Parsing documents...')

doc2vec_vectors = []

for index, list_tokens in enumerate(data['text_clean_tokenized']):
    doc2vec_vectors.append(d2v_model.dv[index])
        
doc2vec_vectors = np.array(doc2vec_vectors)

print('Total number of documents parsed: {}'.format(len(doc2vec_vectors)))
print('Size of vector embeddings: ', doc2vec_vectors.shape[1])
print('Shape of vectors embeddings matrix: ', doc2vec_vectors.shape)

---
## Regrouper les documents par cluster et/ou thématique


This section is to first preprocess the texts by removing stop-words, frequent words, rare words, perform stemming or lemmatization etc. Then to vizualize some common words and understand the data. Finally, to vectorize the words by performing Count Vectorizer or TF-IDF transformation. This two methods are explained more bellow:

In [None]:

k = 10
representations = {"Doc2Vec" : doc2vec_vectors, \
                   "TF-IDF" : np.asarray(tfidf_vectors)}

print('Clustering documents...')

results = []
for key, value in representations.items():
    print("Representation " + str(key))
    labels = KMeans(n_clusters=k, random_state=0).fit_predict(value)
    results.append({"Representation" : key, "labels" :list(labels)})


**Clustering vizualization in each embedding**

In [None]:
fig = make_subplots(rows=3, cols=1, subplot_titles=("Doc2Vec", "Word2Vec", "TF-IDF"))
i = 1
for res in results:
    
    print('Isomap for ' + str(res['Representation']) +'...')

    #embedding = TSNE(n_components=2, init='pca')
    embedding = Isomap(n_components=2)

    components = embedding.fit_transform(representations[res['Representation']])
    labels = res['labels']
    
    print('Plot for ' + str(res['Representation']))

    X = components[:, 0]
    y = components[:, 1]
    
    fig1 = px.scatter(X,y,color=labels)
    trace1 = fig1['data'][0]

    fig.add_trace(trace1, row=i, col=1)
    
    i = i + 1

fig.update_layout(height=1500, width=600, title_text="Embedding x Vizu")
fig.show()

In [None]:
mapper_lda = umap.UMAP(metric='cosine').fit_transform(lda_vectors)
mapper_lda.shape

In [None]:
plt.scatter(
    mapper_lda[:, 0],
    mapper_lda[:, 1])
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of the Penguin dataset', fontsize=24)