# Open Avenues - Week 5 
## Word2Vec Model

The main idea behind Word2Vec is to learn word embeddings by training a shallow neural network on a large corpus of text. The training process is typically done using either the Skip-gram model or the Continuous Bag of Words (CBOW) model.

Skip-gram model: This model predicts the context words (surrounding words) given a target word. It tries to maximize the probability of predicting the context words accurately based on the target word.

Continuous Bag of Words (CBOW) model: This model, on the other hand, predicts the target word given its context words. It aims to maximize the probability of predicting the target word based on its surrounding context.

In [18]:
import nltk
from nltk.corpus import stopwords
from gensim.models import Word2Vec,KeyedVectors
from gensim.test.utils import datapath
import re
import unicodedata
from tqdm import tqdm
import gensim
import multiprocessing
import random
import pandas as pd

Let's import the dataset and create a list for each column.

In [31]:

df = pd.read_csv("open_ave_data.csv")
df.fillna("nan",inplace=True)

findings = df["findings"].values.tolist()
clinical = df["clinicaldata"].values.tolist()
exam = df["ExamName"].values.tolist()
impression = df["impression"].values.tolist()


Next, we want to clean the dataset. Removing any unnecessary words or characters.

In [20]:
stopwords_list=stopwords.words('english')

def clean_data(w):
    w = w.lower()
    w=re.sub(r'[^\w\s]','',w)
    w=re.sub(r"([0-9])", r" ",w)
    words = w.split() 
    clean_words = [word for word in words if (word not in stopwords_list) and len(word) > 2]
    return clean_words


Next, we'll create a function to process all the sentences from each corpus. It uses the clean_data function from above to clean the corpus by removing the stop words and any non alphanumeric characters.

In [30]:
def get_sentences(corpus):
    sent=list(map(clean_data,corpus))               # Clean the sentences
    return sent

### Creating the Vocabulary

Let's create our Word2Vec Model

In [22]:
cores= multiprocessing.cpu_count()
model = Word2Vec(min_count=5,window=5,vector_size=300,workers=cores-1,max_vocab_size=100000)

Before we train the model, we must create the vocabulary for the corpus. Which represents all the unique tokens in our corpus.

In [32]:
corpus = get_sentences(findings)

# Building the vocabulary using entire dataset
model.build_vocab(corpus)

# This will be a dictionary with words as a key and keyedvectors object as the value.
word_dict = model.wv.key_to_index

# To see the words in the vocabulary
keys = model.wv.index_to_key
print(keys)


['findings', 'normal', 'pleural', 'pneumothorax', 'effusion', 'heart', 'mediastinum', 'within', 'focal', 'lungspleura', 'cardiomediastinal', 'opacities', 'lungs', 'none', 'evident', 'unremarkable', 'mediastinal', 'size', 'exam', 'contour', 'contours', 'limitations', 'clear', 'acute', 'limits', 'pulmonary', 'silhouette', 'osseous', 'abnormality', 'volumes', 'consolidation', 'structures', 'stable', 'significant', 'tube', 'effusions', 'lung', 'cardiac', 'infiltrate', 'right', 'chest', 'adenopathy', 'left', 'visualized', 'bony', 'vessels', 'atelectasis', 'seen', 'bone', 'tip', 'mild', 'bones', 'evidence', 'process', 'endotracheal', 'skeletal', 'appear', 'abnormalities', 'vascularity', 'vasculature', 'demonstrate', 'mass', 'vascular', 'edema', 'unchanged', 'identified', 'congestion', 'bilateral', 'either', 'fluid', 'catheter', 'noted', 'cardiomegaly', 'visible', 'changes', 'tubes', 'interstitial', 'portable', 'airspace', 'carina', 'enteric', 'upper', 'central', 'soft', 'intact', 'prior', 'i

In [26]:
sample =random.sample(corpus,300)

### Training the Model

In [27]:
model.train(corpus,total_examples=model.corpus_count,epochs=50)


(841471, 1798300)

Using the Model

This is a way to see similar words based on an input word.

In [34]:
model.wv.most_similar(positive=["lungs"],topn=5)

[('clear', 0.6960206031799316),
 ('effusions', 0.6200727224349976),
 ('pulmonary', 0.5714744329452515),
 ('osseous', 0.5662592649459839),
 ('lungspleura', 0.5508614778518677)]