Setting up the environment:
python==3.7

Libraries used:

xlrd==1.1.0: https://pypi.org/project/xlrd/
spaCy==2.0.12: https://spacy.io/usage/
gensim==3.4.0: https://radimrehurek.com/gensim/install.html
scikit-learn==0.19.1: http://scikit-learn.org/stable/install.html
seaborn==0.8: https://seaborn.pydata.org/installing.html

In [1]:
import re  # For preprocessing
import pandas as pd  # For data handling
from time import time  # To time our operations
from collections import defaultdict  # For word frequency

import spacy  # For preprocessing

import logging  # Setting up the loggings to monitor gensim
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

#### Loading the data

In [2]:
import os
cws = os.getcwd()
print(cws)

C:\Users\gourav.kumar


In [3]:
df = pd.read_csv('Final_combined_data_2.csv')
df.shape

(331907, 2)

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,data
0,0,I have posted a few pictures from the first 5 ...
1,1,Your UHP implants look great! really like the ...
2,2,Hey! You are looking great already! Can't wait...
3,3,I'm only 1.5 out and for the most part been tr...
4,4,Sit and rest. And see if you can find some one...


In [5]:
comment = []
comment  = df.data

In [6]:
len(comment)

331907

#### Preprocessing

lemmatizing and removing the stopwords and non-alphabetic characters for each line

In [7]:
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser']) # disabling Named Entity Recognition for speed

def cleaning(doc):
    # Lemmatizes and removes stopwords
    # doc needs to be a spacy Doc object
    txt = [token.lemma_ for token in doc if not token.is_stop]
    # Word2Vec uses context words to learn the vector representation of a target word,
    # if a sentence is only one or two words long,
    # the benefit for the training is very small
    if len(txt) > 2:
        return ' '.join(txt)

Remove non alphabetic characters

In [8]:
brief_cleaning = (re.sub("[^A-Za-z']+", ' ', str(row)).lower() for row in comment)


In [9]:
brief_cleaning

<generator object <genexpr> at 0x000001D861A4F930>

Taking advantage of spaCy .pipe() attribute to speed-up the cleaning process:

In [10]:
t = time()

txt = [cleaning(doc) for doc in nlp.pipe(brief_cleaning, batch_size=20000, n_threads=-1)]

print('Time to clean up everything: {} mins'.format(round((time() - t) / 60, 2)))

Time to clean up everything: 39.68 mins


Put results in a dataframe to remove missing values and duplicates

In [11]:
df_clean = pd.DataFrame({'clean': txt})
df_clean = df_clean.dropna().drop_duplicates()
df_clean.shape

(325094, 1)

#### Bigrams

We are using Gensim Phrases package to automatically detect common phrases (bigrams) from a list of sentences. https://radimrehurek.com/gensim/models/phrases.html

In [12]:
from gensim.models.phrases import Phrases, Phraser

INFO - 18:15:52: 'pattern' package found; tag filters are available for English


In [13]:
sent = [row.split() for row in df_clean['clean']]

In [14]:
len(sent)

325094

Create relevent phrases from list of sentences

In [15]:
phrases = Phrases(sent,min_count=30,progress_per=1000)

INFO - 18:16:58: collecting all words and their counts
INFO - 18:16:58: PROGRESS: at sentence #0, processed 0 words and 0 word types
INFO - 18:16:58: PROGRESS: at sentence #1000, processed 34628 words and 24027 word types
INFO - 18:16:59: PROGRESS: at sentence #2000, processed 68707 words and 44029 word types
INFO - 18:16:59: PROGRESS: at sentence #3000, processed 100886 words and 60553 word types
INFO - 18:16:59: PROGRESS: at sentence #4000, processed 131960 words and 75132 word types
INFO - 18:16:59: PROGRESS: at sentence #5000, processed 161161 words and 89177 word types
INFO - 18:16:59: PROGRESS: at sentence #6000, processed 194844 words and 104134 word types
INFO - 18:16:59: PROGRESS: at sentence #7000, processed 226247 words and 116801 word types
INFO - 18:16:59: PROGRESS: at sentence #8000, processed 254592 words and 127817 word types
INFO - 18:16:59: PROGRESS: at sentence #9000, processed 284764 words and 140149 word types
INFO - 18:16:59: PROGRESS: at sentence #10000, processe

Transform the corpus based on the bigrams detected:

In [16]:
bigram_sentences = phrases[sent]



Most Frequent Words:
Mainly a sanity check of the effectiveness of the lemmatization, removal of stopwords, and addition of bigrams.

In [18]:
word_freq = defaultdict(int)
for sent in bigram_sentences:
    for i in sent:
        word_freq[i] += 1
len(word_freq)

70931

In [19]:
sorted(word_freq, key=word_freq.get, reverse=True)

['feel',
 'look',
 'like',
 'day',
 'go',
 'think',
 'week',
 'ps',
 'know',
 'get',
 'want',
 'post',
 'surgery',
 'time',
 'boob',
 'implant',
 'say',
 'right',
 'bra',
 'tell',
 'big',
 'post_op',
 'good',
 'month',
 'ba',
 'breast',
 'help',
 'quote_originally',
 'start',
 'wear',
 'size',
 'hope',
 'cc',
 'thing',
 'take',
 'lol',
 'pain',
 'try',
 'little',
 'need',
 'thank',
 'not',
 'work',
 'sure',
 'way',
 'small',
 'lot',
 'great',
 'well',
 'girl',
 'normal',
 'love',
 'bad',
 'have',
 'come',
 'ask',
 'drop',
 'left',
 'be',
 'long',
 'find',
 'happy',
 'change',
 'wait',
 'today',
 'sleep',
 'pretty',
 'bit',
 'different',
 'massage',
 'muscle',
 'hard',
 'maybe',
 'lady',
 'recovery',
 'hurt',
 'good_luck',
 'body',
 'new',
 'away',
 'incision',
 'able',
 'lift',
 'high',
 'nipple',
 'year',
 'see',
 'surgeon',
 'pic',
 'happen',
 'hear',
 'tight',
 'let',
 'end',
 'actually',
 'night',
 'notice',
 "'",
 'wonder',
 'soon',
 'use',
 'worry',
 'heal',
 'definitely',
 'gues

#### Traning the model

Gensim Word2Vec Implementation:
We use Gensim implementation of word2vec: https://radimrehurek.com/gensim/models/word2vec.html

In [20]:
import multiprocessing

from gensim.models import Word2Vec

1. Word2Vec():
In this first step, set up the parameters of the model one-by-one. 
Do not supply the parameter sentences, and therefore leave the model uninitialized, purposefully.
2. .build_vocab():
Here it builds the vocabulary from a sequence of sentences and thus initialized the model. 
With the loggings, you can follow the progress and even more important, the effect of min_count and sample on the word corpus.Noticed that these two parameters, and in particular sample, have a great influence over the performance of a model. Displaying both allows for a more accurate and an easier management of their influence.
3. .train():
Finally, trains the model.
The loggings here are mainly useful for monitoring, making sure that no threads are executed instantaneously.

In [21]:
cores = multiprocessing.cpu_count() # Count the number of cores in a computer

The parameters:
min_count = int - Ignores all words with total absolute frequency lower than this - (2, 100)
window = int - The maximum distance between the current and predicted word within a sentence. E.g. window words on the left and window words on the left of our target - (2, 10)
size = int - Dimensionality of the feature vectors. - (50, 300)
sample = float - The threshold for configuring which higher-frequency words are randomly downsampled. Highly influencial. - (0, 1e-5)
alpha = float - The initial learning rate - (0.01, 0.05)
min_alpha = float - Learning rate will linearly drop to min_alpha as training progresses. To set it: alpha - (min_alpha * epochs) ~ 0.00
negative = int - If > 0, negative sampling will be used, the int for negative specifies how many "noise words" should be drown. If set to 0, no negative sampling is used. - (5, 20)
workers = int - Use these many worker threads to train the model (=faster training with multicore machines)

In [22]:
w2v_model = Word2Vec(min_count=20,
                     window=2,
                     size=300,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=cores-1)

#### Building the vocabulary table

In [23]:
t = time()

w2v_model.build_vocab(bigram_sentences, progress_per=10000)

print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

INFO - 18:19:48: collecting all words and their counts
INFO - 18:19:48: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - 18:19:50: PROGRESS: at sentence #10000, processed 295437 words, keeping 11712 word types
INFO - 18:19:51: PROGRESS: at sentence #20000, processed 578570 words, keeping 16396 word types
INFO - 18:19:53: PROGRESS: at sentence #30000, processed 836714 words, keeping 19641 word types
INFO - 18:19:54: PROGRESS: at sentence #40000, processed 1096616 words, keeping 22440 word types
INFO - 18:19:56: PROGRESS: at sentence #50000, processed 1362669 words, keeping 24963 word types
INFO - 18:19:58: PROGRESS: at sentence #60000, processed 1632533 words, keeping 27279 word types
INFO - 18:19:59: PROGRESS: at sentence #70000, processed 1904732 words, keeping 29551 word types
INFO - 18:20:01: PROGRESS: at sentence #80000, processed 2186421 words, keeping 31740 word types
INFO - 18:20:02: PROGRESS: at sentence #90000, processed 2460629 words, keeping 33762 wor

Time to build vocab: 0.96 mins


#### Training the model

In [25]:
t = time()

w2v_model.train(bigram_sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

INFO - 18:53:21: training model with 7 workers on 11654 vocabulary and 300 features, using sg=0 hs=0 sample=6e-05 negative=20 window=2
INFO - 18:53:22: EPOCH 1 - PROGRESS: at 2.35% examples, 91585 words/s, in_qsize 0, out_qsize 0
INFO - 18:53:23: EPOCH 1 - PROGRESS: at 4.79% examples, 86919 words/s, in_qsize 0, out_qsize 2
INFO - 18:53:24: EPOCH 1 - PROGRESS: at 7.42% examples, 86849 words/s, in_qsize 0, out_qsize 0
INFO - 18:53:25: EPOCH 1 - PROGRESS: at 10.00% examples, 85478 words/s, in_qsize 0, out_qsize 0
INFO - 18:53:26: EPOCH 1 - PROGRESS: at 12.78% examples, 85792 words/s, in_qsize 0, out_qsize 0
INFO - 18:53:27: EPOCH 1 - PROGRESS: at 15.79% examples, 88292 words/s, in_qsize 0, out_qsize 0
INFO - 18:53:29: EPOCH 1 - PROGRESS: at 18.02% examples, 86017 words/s, in_qsize 0, out_qsize 0
INFO - 18:53:30: EPOCH 1 - PROGRESS: at 20.19% examples, 84571 words/s, in_qsize 0, out_qsize 0
INFO - 18:53:31: EPOCH 1 - PROGRESS: at 21.63% examples, 80871 words/s, in_qsize 0, out_qsize 1
INFO

Time to train the model: 29.36 mins


As we do not plan to train the model any further, we are calling init_sims(), which will make the model much more memory-efficient:

In [26]:
w2v_model.init_sims(replace=True)

INFO - 19:29:33: precomputing L2-norms of word weight vectors


Exploring the model
Most similar to:
Here, we will ask our model to find the word most similar to some of complications related to breast augmentation

In [27]:
w2v_model.wv.most_similar(positive=["pain_killer"])

[('pain_med', 0.7719273567199707),
 ('painkiller', 0.6681125164031982),
 ('med', 0.6607716679573059),
 ('pain_pill', 0.6564115285873413),
 ('percocet', 0.6370357275009155),
 ('tylenol', 0.6269216537475586),
 ('vicodin', 0.6085245609283447),
 ('medication', 0.5994677543640137),
 ('pain_me', 0.5520679950714111),
 ('narcotic', 0.5439748764038086)]

In [28]:
w2v_model.wv.most_similar(positive=["scar_tissue"])

[('capsule', 0.6500141620635986),
 ('pocket', 0.5868897438049316),
 ('tissue', 0.4843810200691223),
 ('massage', 0.46953046321868896),
 ('collagen', 0.4461927115917206),
 ('pocket_open', 0.44118815660476685),
 ('suture', 0.4394689202308655),
 ('scar', 0.4298346936702728),
 ('internal_suture', 0.4245665371417999),
 ('stitch', 0.4167569875717163)]

In [29]:
w2v_model.wv.most_similar(positive=["crease_incision"])

[('incision', 0.7371639609336853),
 ('crease', 0.7074934244155884),
 ('areola_incision', 0.6257808208465576),
 ('incision_site', 0.6030789613723755),
 ('areola', 0.578480064868927),
 ('nipple', 0.5502513647079468),
 ('inframammary', 0.534883975982666),
 ('scar', 0.5214579105377197),
 ('transax', 0.5091649889945984),
 ('insicion', 0.5056347846984863)]

In [30]:
w2v_model.wv.most_similar(positive=["pec_muscle"])

[('pecs', 0.6888641119003296),
 ('muscle', 0.6768326163291931),
 ('pectoral_muscle', 0.6529515981674194),
 ('chest', 0.5200837850570679),
 ('pec', 0.504664421081543),
 ('stretch', 0.48584020137786865),
 ('muscle_relax', 0.47358646988868713),
 ('pectoral', 0.458560585975647),
 ('tight', 0.44817256927490234),
 ('peck_muscle', 0.43367624282836914)]

In [31]:
w2v_model.wv.most_similar(positive=["sharp_pain"])

[('pain', 0.567895770072937),
 ('stinging', 0.5662801861763),
 ('shooting_pain', 0.5640618801116943),
 ('sharp', 0.5628449320793152),
 ('sharp_shooting', 0.5623195171356201),
 ('zinger', 0.5591235160827637),
 ('stab_pain', 0.5555984973907471),
 ('sore', 0.5460119247436523),
 ('burn_sensation', 0.5433788895606995),
 ('ache', 0.535645604133606)]

In [32]:
w2v_model.wv.most_similar(positive=["flex_distortion"])

[('muscle_distortion', 0.5728683471679688),
 ('distortion', 0.5126267671585083),
 ('switch_over', 0.4412931203842163),
 ('flex', 0.41555944085121155),
 ('over', 0.40872031450271606),
 ('flexion', 0.3854365944862366),
 ('flex_pec', 0.37905266880989075),
 ('direct_chest', 0.3698456287384033),
 ('under', 0.3685559630393982),
 ('flex_pecs', 0.36781829595565796)]

In [33]:
w2v_model.wv.most_similar(positive=["capsular_contracture"])

[('contracture', 0.4695882499217987),
 ('complication', 0.45382338762283325),
 ('bottom', 0.43589794635772705),
 ('capsular_contraction', 0.43111127614974976),
 ('lateral_displacement', 0.38797900080680847),
 ('rupture', 0.387267529964447),
 ('double_bubble', 0.3838402032852173),
 ('hematoma', 0.3813149929046631),
 ('cc', 0.3726479113101959),
 ('baker_grade', 0.37199491262435913)]

In [34]:
w2v_model.wv.most_similar(positive=["lateral_displacement"])

[('bottom', 0.5596328377723694),
 ('ld', 0.5282319784164429),
 ('symmastia', 0.47798413038253784),
 ('pocket', 0.4620778560638428),
 ('laterally', 0.4500505328178406),
 ('bottoming', 0.42339855432510376),
 ('revision', 0.42185282707214355),
 ('double_bubble', 0.4204094111919403),
 ('fix', 0.4203122854232788),
 ('synmastia', 0.41652417182922363)]

In [35]:
w2v_model.wv.most_similar(positive=["mondor_cord"])

[('cord', 0.544319748878479),
 ('inflamme', 0.451033353805542),
 ('mondor', 0.4292127192020416),
 ('warm_compress', 0.4287552237510681),
 ('tendon', 0.4252537488937378),
 ('inflamed', 0.4051695764064789),
 ('mc', 0.3953125476837158),
 ('mondor_chord', 0.3803102374076843),
 ('vein', 0.3756590485572815),
 ('bruise', 0.3591253161430359)]

In [36]:
w2v_model.wv.most_similar(positive=["uneven"])

[('asymmetry', 0.5620050430297852),
 ('asymmetrical', 0.5591391324996948),
 ('symmetrical', 0.5486911535263062),
 ('lopsided', 0.5470714569091797),
 ('drop', 0.537290096282959),
 ('left', 0.533637523651123),
 ('high', 0.5246310234069824),
 ('right', 0.5123416185379028),
 ('righty', 0.5049702525138855),
 ('lefty', 0.504669725894928)]

In [37]:
w2v_model.wv.most_similar(positive=["mentor"])

[('natrelle', 0.6718122959136963),
 ('saline', 0.6259774565696716),
 ('silicone', 0.6253339648246765),
 ('high_profile', 0.6120752692222595),
 ('allergan', 0.6118178367614746),
 ('sientra', 0.6036064624786377),
 ('hp', 0.5920089483261108),
 ('cc', 0.5918722748756409),
 ('mod', 0.5825388431549072),
 ('mentor_smooth', 0.5639688968658447)]

In [38]:
w2v_model.wv.most_similar(positive=["infection"])

[('bacterial_infection', 0.5802803635597229),
 ('infect', 0.5197092294692993),
 ('bacteria', 0.5177251100540161),
 ('hematoma', 0.49822643399238586),
 ('fever', 0.4968625009059906),
 ('antibiotic', 0.49263250827789307),
 ('bleed', 0.48705923557281494),
 ('staph_infection', 0.4795267581939697),
 ('abscess', 0.4742536246776581),
 ('infected', 0.46035993099212646)]

In [39]:
w2v_model.wv.most_similar(positive=["bleeding"])

[('bleed', 0.6283478736877441),
 ('blood', 0.4303320050239563),
 ('hematoma', 0.42663756012916565),
 ('bruise', 0.40917515754699707),
 ('infection', 0.40609481930732727),
 ('bruising', 0.4022197723388672),
 ('drain', 0.3894175887107849),
 ('seepage', 0.3595198690891266),
 ('drainage', 0.3547538220882416),
 ('seroma', 0.3515244126319885)]

In [40]:
w2v_model.wv.most_similar(positive=["deflation"])

[('deflate', 0.4429897964000702),
 ('rupture', 0.4291231632232666),
 ('ripple', 0.38684412837028503),
 ('redo', 0.3580423593521118),
 ('rippling', 0.35363781452178955),
 ('bottom', 0.3520503044128418),
 ('slow_leak', 0.34148913621902466),
 ('saline', 0.33866196870803833),
 ('replace', 0.3322773575782776),
 ('redone', 0.314984530210495)]

In [41]:
w2v_model.wv.most_similar(positive=["seroma"])

[('hematoma', 0.5244783163070679),
 ('infection', 0.4332888126373291),
 ('drain', 0.39634811878204346),
 ('develop_hematoma', 0.3937504291534424),
 ('leak', 0.3854773938655853),
 ('hemotoma', 0.3613206148147583),
 ('fluid', 0.35510432720184326),
 ('bleeding', 0.3515244126319885),
 ('staph_infection', 0.3507139980792999),
 ('leakage', 0.34809410572052)]

In [42]:
w2v_model.wv.most_similar(positive=["nerve"])

[('nerve_ending', 0.6059892773628235),
 ('nerve_regenerate', 0.5731019973754883),
 ('nerve_regeneration', 0.5478879809379578),
 ('sensation', 0.5098955035209656),
 ('zinger_nerve', 0.5046253800392151),
 ('numbness', 0.5033354759216309),
 ('zinger', 0.49731311202049255),
 ('numb', 0.4952511787414551),
 ('regenerating', 0.4948093295097351),
 ('regenerate', 0.48305097222328186)]

In [43]:
w2v_model.wv.most_similar(positive=["rash"])

[('red_rash', 0.5656169652938843),
 ('blister', 0.5238522291183472),
 ('itchy_rash', 0.5080475807189941),
 ('hive', 0.5042273998260498),
 ('heat_rash', 0.4852822721004486),
 ('red_bump', 0.4628583788871765),
 ('allergic_reaction', 0.4571758210659027),
 ('itch', 0.4446898400783539),
 ('pimple', 0.43949031829833984),
 ('itchy', 0.4363400936126709)]

In [44]:
w2v_model.wv.most_similar(positive=["necrosis"])

[('infection', 0.3688545227050781),
 ('hematomas', 0.3366979658603668),
 ('hematoma', 0.315030038356781),
 ('wound', 0.3081358075141907),
 ('thicken', 0.30756449699401855),
 ('blood_clot', 0.30197346210479736),
 ('calcification', 0.300001323223114),
 ('develop_hematoma', 0.29688704013824463),
 ('yellow_bruise', 0.29528290033340454),
 ('lymphoma', 0.2944490313529968)]

In [45]:
w2v_model.wv.most_similar(positive=["immune"])

[('candida', 0.36207014322280884),
 ('resistant', 0.34940803050994873),
 ('magnesium', 0.3370727300643921),
 ('flora', 0.3255401849746704),
 ('tocopherol', 0.32351985573768616),
 ('immune_system', 0.3210991322994232),
 ('probiotic', 0.3204610347747803),
 ('fungal', 0.31527093052864075),
 ('bacterial', 0.30764222145080566),
 ('dehydration', 0.3071424663066864)]

In [46]:
w2v_model.wv.most_similar(positive=["breast_feed"])

[('breastfeed', 0.6608566045761108),
 ('breastfe', 0.6179194450378418),
 ('breast_feeding', 0.5206254720687866),
 ('breastfee', 0.5018883347511292),
 ('breastfeeding', 0.48939743638038635),
 ("bf'e", 0.462465763092041),
 ('breastfed', 0.45577239990234375),
 ('pregnancy', 0.42248737812042236),
 ('pregnancy_breastfeeding', 0.4187733232975006),
 ('child', 0.4060478210449219)]

In [47]:
w2v_model.wv.most_similar(positive=["discoloration"])

[('bruise', 0.5153228044509888),
 ('bruising', 0.4760998487472534),
 ('redness', 0.465229868888855),
 ('red', 0.41518741846084595),
 ('hot_touch', 0.38376861810684204),
 ('yellow_bruise', 0.3819354772567749),
 ('fading', 0.3801232576370239),
 ('dark_purple', 0.37573766708374023),
 ('reddish', 0.3738534450531006),
 ('pinkish', 0.37351131439208984)]

In [48]:
w2v_model.wv.most_similar(positive=["mental"])

[('emotional', 0.39350593090057373),
 ('psychological', 0.3814128041267395),
 ('emotion', 0.37410619854927063),
 ('physically_emotionally', 0.3231189250946045),
 ('mentally', 0.3160097002983093),
 ('emotionally', 0.30673450231552124),
 ('brain', 0.2844589352607727),
 ('boobie_blue', 0.2736678719520569),
 ('prepare', 0.26804351806640625),
 ('depressed', 0.26563483476638794)]

In [49]:
w2v_model.wv.most_similar(positive=["depression"])

[('boobie_blue', 0.45089906454086304),
 ('anxiety', 0.4337421655654907),
 ('blue', 0.40142595767974854),
 ('emotion', 0.3916175067424774),
 ('emotional', 0.3762742280960083),
 ('sadness', 0.35681992769241333),
 ('constipation', 0.3499300479888916),
 ('disorder', 0.346127986907959),
 ('emotional_rollercoaster', 0.34565383195877075),
 ('depressed', 0.33530735969543457)]

In [50]:
w2v_model.wv.most_similar(positive=["fatigue"])

[('tired', 0.4214109182357788),
 ('tiredness', 0.37359175086021423),
 ('stamina', 0.35196539759635925),
 ('energy', 0.3400428891181946),
 ('sluggish', 0.3396419882774353),
 ('pain_med', 0.33576294779777527),
 ('exhaust', 0.3322882056236267),
 ('energy_level', 0.32993096113204956),
 ('exhaustion', 0.3265444040298462),
 ('nausea', 0.3223051130771637)]

In [51]:
w2v_model.wv.most_similar(positive=["ptosis"])

[('sag', 0.4873615801334381),
 ('grade_ptosis', 0.39859527349472046),
 ('pseudoptosis', 0.386099249124527),
 ('sagging', 0.38246479630470276),
 ('sagginess', 0.3807069659233093),
 ('droop', 0.35905319452285767),
 ('tuberous_breast', 0.3504652678966522),
 ('laxity', 0.343325138092041),
 ('lift', 0.33654820919036865),
 ('cc', 0.32274481654167175)]

In [52]:
w2v_model.wv.most_similar(positive=["syndrome"])

[('arthritis', 0.3387957215309143),
 ('disease', 0.2901279330253601),
 ('symptom', 0.28212952613830566),
 ('sufferer', 0.2633492648601532),
 ('cervical', 0.2590476870536804),
 ('diabete', 0.24795088171958923),
 ('stem', 0.24393150210380554),
 ('alcl', 0.24265354871749878),
 ('pectus', 0.23527023196220398),
 ('disc', 0.23274171352386475)]

In [53]:
w2v_model.wv.most_similar(positive=["derma"])

[('bio_oil', 0.37564828991889954),
 ('roller', 0.37177008390426636),
 ('mederma', 0.3676777184009552),
 ('kelo_cote', 0.34892141819000244),
 ('lotion', 0.33939167857170105),
 ('coating', 0.33935442566871643),
 ('cocoa', 0.3321106433868408),
 ('maderma', 0.32200783491134644),
 ('foam', 0.31623896956443787),
 ('steri_strip', 0.3149036467075348)]

In [54]:
w2v_model.wv.most_similar(positive=["cancer"])

[('breast_cancer', 0.47749531269073486),
 ('lymphoma', 0.47525709867477417),
 ('cancerous', 0.4169241189956665),
 ('alcl', 0.38720428943634033),
 ('survivor', 0.31787538528442383),
 ('disease', 0.3174636662006378),
 ('health_issue', 0.31698015332221985),
 ('autoimmune_disease', 0.31366193294525146),
 ('health', 0.3057671785354614),
 ('tumor', 0.29919642210006714)]

In [62]:
w2v_model.wv.most_similar(positive=["replacement"])

[('replace', 0.445171594619751),
 ('revision', 0.4066496193408966),
 ('reoperation', 0.40225672721862793),
 ('redo', 0.3876747488975525),
 ('exchange', 0.3786391615867615),
 ('warranty', 0.37226665019989014),
 ('cost', 0.3680764436721802),
 ('capsulectomy', 0.3384048640727997),
 ('eligible', 0.3381170332431793),
 ('bilateral', 0.33667948842048645)]

In [63]:
w2v_model.wv.most_similar(positive=["revision"])

[('redo', 0.7754956483840942),
 ('fix', 0.6805518269538879),
 ('revise', 0.6261372566223145),
 ('complication', 0.551436185836792),
 ('original', 0.5338786840438843),
 ('bottom', 0.5155431628227234),
 ('repair', 0.511853039264679),
 ('internal_bra', 0.5076748132705688),
 ('exchange', 0.5000610947608948),
 ('redone', 0.4951501786708832)]

In [55]:
# Read a text file and tokenize the sentences into a list of words
import nltk 
from nltk.tokenize import word_tokenize 
file = open("breast_implant_complications.txt",newline='')
result = file.read()
words = word_tokenize(result)

In [56]:
for i in words:
    print(i)

pain
fibromyalgia
arthiritis
mammogram
scar_tissue
pectoral_muscle
sharp_pain
shooting_pain
sharp_stabbing
burn_sensation
flex_distortion
muscle_distortion
capsule
lateral_discplacement
ld
mondor_cord
mondor_chord
internal_bleeding
nerve_regeneration
itchy_rash
heat_rash
allergic_reaction
bacterial_infection
emotional_rollercoaster
mood_swing
tire_easily
energy_level
nausea_vomiting
infection
bacterial
abcess
staph
infect
bacteria
redness
discolouration
bleeding
seep
clot
rupture
ruptured
leak
deflation
deflate
void
overfilled
capsular
contracture
capsular_contracture
capsular_contraction
contraction
hematoma
haematome
hematomas
evacuate
seroma
displacement
lateral
displace
horrible
mild
chronic
discomfort
soreness
sore
tighness
painful
distorted
distort
sligth
shape
palpability
firmness
improper
malposition
calcium
deposits
dimpling
puckering
pucker
pcukered
benelli
wrinkling
asymmetrical
deformity
scarring
sloshing
slosh
gurgling
crackling
squeaking
squishing
toxic
shock
shocked
synd

In [57]:
df = pd.DataFrame(columns = ['Word','Similar_words'])

In [58]:
df

Unnamed: 0,Word,Similar_words


In [59]:
len(words)

123

In [60]:
# df = pd.DataFrame(columns = ['Word','Similar_words'])

def similar_words(words,topn):
    df_word = []
    df_similar_words = []
    for i in words:
        word = i.lower()
        df_word.append(word)
        try :
            #print(i)
            #print(w2v_model.wv.most_similar(positive=word,topn=topn))
            df_similar_words.append(w2v_model.wv.most_similar(positive=word,topn=topn))
        except :
            #print("Word not in Dictionary")
            df_similar_words.append("Word not in Dictionary")
    return df_word,df_similar_words

In [61]:
similar_words(words,5)

(['pain',
  'fibromyalgia',
  'arthiritis',
  'mammogram',
  'scar_tissue',
  'pectoral_muscle',
  'sharp_pain',
  'shooting_pain',
  'sharp_stabbing',
  'burn_sensation',
  'flex_distortion',
  'muscle_distortion',
  'capsule',
  'lateral_discplacement',
  'ld',
  'mondor_cord',
  'mondor_chord',
  'internal_bleeding',
  'nerve_regeneration',
  'itchy_rash',
  'heat_rash',
  'allergic_reaction',
  'bacterial_infection',
  'emotional_rollercoaster',
  'mood_swing',
  'tire_easily',
  'energy_level',
  'nausea_vomiting',
  'infection',
  'bacterial',
  'abcess',
  'staph',
  'infect',
  'bacteria',
  'redness',
  'discolouration',
  'bleeding',
  'seep',
  'clot',
  'rupture',
  'ruptured',
  'leak',
  'deflation',
  'deflate',
  'void',
  'overfilled',
  'capsular',
  'contracture',
  'capsular_contracture',
  'capsular_contraction',
  'contraction',
  'hematoma',
  'haematome',
  'hematomas',
  'evacuate',
  'seroma',
  'displacement',
  'lateral',
  'displace',
  'horrible',
  'mild'

In [None]:
len(df_word)

In [None]:
len(df_similar_words)