# In this notebook I will:
* Go through and remove reviews that only have advertisements? (NOT AT THIS TIME)
* Tokenize, lemmatize, remove stop words, and remove instances of words that only show up once that aren't special (words that indicate a condition, medication, side effect, or caregiver role)
* Rejoin processed review into a string for BOW analysis

In [1]:
import pandas as pd
import numpy as np
import glob

# Haven't decided whether I like nltk or spacy better yet
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet#, stopwords
#stops = stopwords.words('english')
import spacy
from spacy.tokenizer import Tokenizer
import en_core_web_lg
nlp = en_core_web_lg.load()

# Magical gensim module
from gensim import corpora
from gensim.models import LsiModel, LdaModel
from gensim.models.coherencemodel import CoherenceModel

# A method to process text in nltk:
# https://pythonhealthcare.org/2018/12/14/101-pre-processing-data-tokenization-stemming-and-removal-of-stop-words/

# same process in spacy
# https://spacy.io/usage/linguistic-features

In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [3]:
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
# https://stackoverflow.com/questions/13928155/spell-checker-for-python/48280566
from autocorrect import Speller
spell = Speller(lang='en')

In [5]:
# Adjusting stop words in spacy to not lose a bunch of negatives for the sentiment analysis
# for word in [u'nor',u'none',u'not',u'alone',u'no',u'never',u'cannot',u'always']:
#     nlp.vocab[word].is_stop = False
# nlp.vocab[u'thing'].is_stop = True
tokenizer = Tokenizer(nlp.vocab)

# Working on processing text data

In [6]:
def get_wordnet_pos(treebank_tag):
    # https://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
#     elif treebank_tag.startswith('NN'):
#         return wordnet.ADJ # Considering ADJ_SET to be same as ADJ
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

def check_PoS(word):
    return get_wordnet_pos(nltk.pos_tag([word])[0][1])

def useful_synonyms(word):
    # Finding PoS of word
    to_pos = check_PoS(word)
    
    # Finding all synonyms in all the parts of speech
    words = []
    syns = wordnet.synsets(word)

    # Chopping down to most common versions of words...this works for side effects more than words like 'cat'
    if len(syns) >= 2:
        synList = syns[:2]
    else:
        synList = syns
    #     if len(syns)%2 and (len(syns) != 1):
#         synList = syns[:len(syns)//2]
#     else:
#         synList = syns[:len(syns)//2+1]

    # Finding all the forms of a word
    for syn in synList:
        for l in syn.lemmas():
            form = l.derivationally_related_forms()
            words.append(l.name())
            for f in form:
                words.append(f.name())
                
    # Getting all the unique words that match the desired part of speech
    words = list(np.unique(words))
    pos = nltk.pos_tag(words)
    return_words = [word.replace('_',' ') for word, word_pos in pos if get_wordnet_pos(word_pos)==to_pos]

    # Getting around weirdness with somehow dropping PoS for original word if matches to_pos (e.g., with weight)
    if get_wordnet_pos(nltk.pos_tag([word])[0][1]) == to_pos and word not in return_words: return_words.append(word)
        
    return return_words

In [7]:
# Magic tokenizer thing
def spacyTokenizer(s: str)-> list:
    doc = tokenizer(s.lower().strip())
    tokens = []
    for token in doc:
        if not token.is_stop and token.is_alpha and token.lemma_ != '-PRON-':
            tokens.append(token.lemma_)
        
    return tokens

In [73]:
def parseRevnew(file, return_df=False):
    reviews = pd.read_csv(file, sep='$')['Comment']
    clean_reviews = [spacyTokenizer(rev.replace('/', ' ')) for rev in reviews]
    cleaner_reviews = findTop(clean_reviews, 50)
    cleaner_reviews = [[spell(word.lower()) for word in rev] for rev in cleaner_reviews]
    
    if return_df:
        return cleaner_reviews, reviews
    else:
        return cleaner_reviews#consider, ignore

def parseSEorig(file):
    sideEff = np.genfromtxt(file, delimiter='$', dtype=str)
    clean_SEs = [[spell(word) for word in spacyTokenizer(SE)] for SE in sideEff]
    cleaner_reviews = findTop(clean_SEs, 3)
    
    return clean_SEs
    
def parseSE_FAERs(file, meds):
    sideEff = pd.read_csv(file, sep='$').set_index('Concept ID').dropna(subset=['Percentage observed'])
    sideEff = sideEff.fillna(value='')

    # Looking for the medication name that has the side effects
    meds_obs = sideEff.copy(deep=True)
    meds_obs['Medications observed'] = [obs.split(', ') for obs in meds_obs['Medications observed']]
    
    medList = []
    for obs in meds_obs['Medications observed']: medList += obs
        
    to_check = np.unique(medList)
    
    Found = False
    for med in meds.lower().split(', '):
        if med in to_check: 
            Found = True
            break # Stop when I've found the name

    sideEff = sideEff[[med in obs for obs in meds_obs['Medications observed']]]
    
    sideEff['Joined'] = sideEff['Definition'] + sideEff['Synonyms']
    check_both = lambda combo: sum([c in nlp.vocab for c in combo.split(' ')]) == len(combo.split(' '))
    sideEff['Joined'] = [', '.join([word for word in words.split(', ') if word.find('-') == -1 and check_both(word)]) for words in sideEff['Joined']]
    clean_SEs = [list(set([spell(word) for word in spacyTokenizer(SE)])) for SE in sideEff['Joined']]
    clean_SEs = [[word for word in SE if len(word) > 3] for SE in clean_SEs]
    clean_SEs = findTop(clean_SEs,5)
    
#     ignore = [SE for SE in clean_SEs if len(SE) <= 2]
#     consider = [SE for SE in clean_SEs if len(SE) > 2]
    
#     # Testing effect of just adding in more language to work with
#     new_consider = []
#     for chunk in consider:
#         extended = []
#         for w in chunk:
#             extended += [s for s in useful_synonyms(w) if s.find('_') == -1]
#         new_consider.append(extended)
    
    return clean_SEs#consider, ignore

# TFIDF section
https://buhrmann.github.io/tfidf-analysis.html

In [25]:
# Not super useful to take the mean across columns, instead look at top 10 scoring words in each side effect
def findTop(strList, keeptop=10):
    tfidf_vectr = TfidfVectorizer()
    corpus = [' '.join(SE) for SE in strList]
    tfidf_score = tfidf_vectr.fit_transform(corpus).toarray()
    features = np.array(tfidf_vectr.get_feature_names())
    
    words = []
    for row in tfidf_score:
        inds = row.argsort()[::-1][:keeptop]
        row_words = []
        for ind in inds:
            if row[ind].round(2) != 0:
                row_words.append(features[ind])
                #print(features[ind],' '*(50-len(features[ind])), row[ind].round(2))
        #print('\n')
        if not row_words:
            row_words = list(features[inds][:5])
        words.append(row_words)
    return words

# LSA/LDA section

In [26]:
# https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python
def genDictandDocMatrix(cleaned_text):
    dictionary = corpora.Dictionary(cleaned_text)
    matrix = [dictionary.doc2bow(doc) for doc in cleaned_text]
    return dictionary, matrix

def formatLSAresult(topics:list)->list:
    for topic in topics:
        title = "Topic {:g}: \n".format(topic[0])
        term_cluster = [term.strip().split('*')[1][1:-1] for term in topic[1].split('+')]
        term_weight = [term.strip().split('*')[0] for term in topic[1].split('+')]

        print(title, ', '.join(term_cluster),'\n',', '.join(term_weight))
        
def produceLSA(n_topics, cleanText, n_word_report=10):
    dictionary, matrix = genDictandDocMatrix(cleanText)
    lsamodel = LsiModel(matrix, num_topics=n_topics, id2word=dictionary)
    result = lsamodel.print_topics(num_topics=n_topics, num_words=n_word_report)

    return result, lsamodel

def produceLDA(n_topics, cleanText, n_word_report=10):
    dictionary, matrix = genDictandDocMatrix(cleanText)
    ldamodel = LdaModel(matrix, num_topics=n_topics, id2word=dictionary)
    result = ldamodel.print_topics(num_topics=n_topics, num_words=n_word_report)

    return result, ldamodel

#result, model = produceLSA(10, reviews)
#formatLSAresult(result)

In [27]:
# Testing out idea of randomly joining side effects and pulling out concepts
from random import shuffle

#test = cleanSEs.copy()

# Joining test results randomly
def try_shuffled_LSA(test, numjoin=5):
    joined_test = []
    inds = np.arange(len(test))
    shuffle(inds)

    if inds.size % numjoin:
        extras = inds[-(inds.size % numjoin):]
        evendiv = inds[:-(inds.size % numjoin)]
        inds = evendiv.reshape((-1,numjoin))
    else:
        extras = None
        inds = inds.reshape((-1,numjoin))


    for ind_set in inds:
        new_join = []
        for ind in ind_set: new_join += test[ind]
        joined_test.append(new_join)

    if type(extras) != type(None):
        for i,ind in enumerate(extras):
            joined_test[i] += test[ind]

    result, model = produceLDA(len(test)//5, joined_test, 10)
    formatLSAresult(result)
#     topics = {}
#      for topic in result:
#         title = "Topic {:g}: \n".format(topic[0])
#         term_cluster = [term.strip().split('*')[1][1:-1] for term in topic[1].split('+')]
#         term_weight = [term.strip().split('*')[0] for term in topic[1].split('+')]
        
#         topics[topic[0]] = term_cluster
        
#     return topics
    
    
def process_shuffled_results(topic_dict_list):
    for topics_dict in topic_dict_list:
        word_pile = []
        for key in topics_dict:
            word_pile.append(topics_dict[key])
        word_pile = np.array(word_pile)

In [28]:
#try_shuffled_LSA(test)

# Now checking for side effects in WebMD reviews

In [74]:
def find_sideEffects_inReviews_FAERsinformed(revFile, sefile1, sefile2, faers=True):

    # Parsing reviews
    reviews, fullrevs = parseRevnew(revFile, return_df=True)
    if faers:
        cond = sefile1[sefile1.find('faers_results/')+14:sefile1.rfind('/')]
        medication = revFile[revFile.rfind('/')+1:revFile.find('_'+cond)]
        meds = pd.read_csv('UniqueMedications/Medications_unique_{:s}.csv'.format(cond), sep='$')['All names']
        meds = [allnames for allnames in meds if medication.lower() in allnames.lower().split(', ')][0]

    # Parsing side effects
    if faers:
        listSEs = parseSE_FAERs(sefile1, meds)
    else:
        listSEs = parseSEorig(sefile2)
        
    #new attempt
    listSEs1 = parseSE_FAERs(sefile1, meds)
    listSEs2 = parseSEorig(sefile2)
    listSEs = listSEs1 + listSEs2
    
    BagOSE = ' '.join([' '.join(SE) for SE in listSEs])

    # Finding review words that exist in the list of side effects
    found = [[word for word in rev if BagOSE.lower().find(' '+word.lower()+' ')] for rev in reviews]
    found = []
    for ind, rev in enumerate(reviews):
        item = {}
        for SE in listSEs:
            # Match words in reviews to side effects and then add them to found, build dataframe with this info
            item[', '.join(SE)] = len([word for word in rev if word.lower() in SE])
        found.append(item)
    
    SE_match = pd.DataFrame(found)
    SE_match['Full Review'] = fullrevs.values
    
    # Return the master product
    return SE_match

In [97]:
df = find_sideEffects_inReviews_FAERsinformed('ProcessedReviews/Bipolar-Disorder/Lamictal_Bipolar-Disorder_parsed_reviews.csv', 
                                              'NERstuff/faers_results/Bipolar-Disorder/SideEffectsExtracted.csv',
                                              'SideEffects/Bipolar-Disorder_SideEffects.csv')

# df = find_sideEffects_inReviews_FAERsinformed('ProcessedReviews/Bipolar-Disorder/Lamictal_Bipolar-Disorder_parsed_reviews.csv', 
#                                               'moddedSideEffects/Bipolar-Disorder_SideEffects_denormed.csv')

In [127]:
newdf = df.drop(columns='Full Review')
review_inds = []

# Allowing for two item side effects UNLESS they contain two very generic words
colLens = np.array([len(col.split(', ')) + 2*((col.find('feel') != -1)|(col.find('pain')!= -1)|(col.find('abnormal')!=-1)|(col.find('change')!=-1)|(col.find('disorder')!=-1)|(col.find('problem')!=-1)|(col.find('decrease')!=-1)|(col.find('increase')!=-1)|(col.find('loss')!=-1)) for col in newdf.columns])

for ind in newdf.index:
    if (((colLens < 3) & newdf.loc[ind].gt(0)) | newdf.loc[ind].gt(1)).sum(): 
        review_inds.append(ind)

In [128]:
for ind in review_inds:
    print(df.loc[ind]['Full Review'], '\n\n', 
          newdf.loc[ind][np.logical_or(np.logical_and((colLens < 3), newdf.loc[ind].gt(0)), newdf.loc[ind].gt(1))],
          '\n'*5)

I sleep just fine, I started at 25 mg and slowly went up to 100 mg currently. I'm tired and sleep great.no rash, no unbearable side effects. 

 trouble, sleep    1
Name: 0, dtype: int64 





I did not get much sleep while on this medicine. The insomnia side effect is horrendous. Even with adding Ambien to the mix, I still would watch the sun rise. Also, the depersonalization side effect is pretty bad. I just didn't care about anything, and my passion for art was completely gone.

I won't ever take this medicine again.  

 trouble, sleep    1
Name: 1, dtype: int64 






 like, kidney, pressure, high, failure    2
Name: 3, dtype: int64 





I've been taking this medication for about a year and it's been very effective for me. It helps me feel more steady and stable, and avoid the constant highs and lows of bipolar. My doctor upped my dose when I began a severe bout of sleepwalking after coming off a strong sleep med. After increasing lamotrigine I stopped sleepwalking. My experience

headache and mood change 

 headache    1
Name: 151, dtype: int64 





I have tried a variety of Meds. Most seem to help in the beginning then eventually taper off, causing my doctor to "switch" meds. I have been on lamictal about 6 months now, in combination with effexor, klonopin and trazodone. I felt a drastic improvement with the lamictal @ about 8 weeks time. Continued to be extremely satisfied with the med untill the generic form was released. I still take the medication but it makes me nauseated, have noticed a decline in effectiveness. My doctor is happy to require brand name but I do not want the HASSLE OF THE INSURANCE COMPANY! Brand name lamictal, I reccommened 100% for anyone suffering from bi-polar depression. Generic ...I hope you have a hardy stomach!!! 

 cause    1
Name: 153, dtype: int64 





I am a 48 year old male. I had been taking brand name Lamictal since 2005. I was switched to generic Lamotrigine in August 2008. In late September, my left breast became very s

I'm on Adderall 20mg & Lamictal 250mg. I have always had depression, but after 7 yrs in bed due to the death of my mother in 2003 & only child in 2007 at the age of 23 I finally went to a Psyc 1 1/2 yr ago & was dx with Bipolar Depression. I fought dx because of no manic (whoo hoo, happy) moods, then was told it was due to my anger & the depression was so deep it was very hard to come out of. I was put on Adderall with no side effects. It does get me out of bed in the am. My husband calls Lamictal the miracle drug. That's because he was on the receiving end of my anger rants & they were really bad! I was doing great with my medication for a while, but I can't afford therapy & it is a daily struggle to keep myself motivated. I do get unscheduled motivation, what I call "A bug up my ..." & I go until I can't, but they are far & few in between. I think I need an increase in my Lamictal. I have noticed that I am having a hard time with my anger lately. All in all I am satisfied with Lamict

Did not notice any changes at first.  Then noticed an uncontollable rage I have never felt before.  My body started  reacting to my menstral cycles, symptoms lasted longer til I was actually on my period for 10 days at a time with heavy bleeding and large clotting.  I went to see doctor to find out I had gained 20 lbs in less than 3 months and he didnt think it had anything to do with med but said I could stop taking and did not wean me off even though I was very slowly brought up to dose. Suffered withdrawel of headaches, weepiness, and insomnia.  But my menstration is back to normal. And I do not feel rage, but still am 20 lbs heavier! 

 bleed                           1
gain, weight, body, increase    2
withdrawal, drug                1
heavy, bleed                    2
stop, menstrual, bleed          2
weight, gain                    1
Name: 440, dtype: int64 





I have bipolar I disorder. Been on it for over 3 months now. I should have seen some change in my moods by now. I see

This drug has given me my life back! I am up to 450mg and love it. I did have a problem however for a little while, there is a generic out now and its doesnt react as well as the name brand. My therapist stated this is the case in %50 of her practice. Seems to me they need to do more research before the keep it out. 
Other than that it is wonderful. My racing thoughts have gone down, my anxiety, in conjunction with effexor, is almost non existant and the mood swings have stabilized. I never knew what being "normal" was till I started taking this drug. Thanks!  

 withdrawal, drug                             1
interaction, take, medical, certain, drug    2
wide, mood, swing                            2
Name: 582, dtype: int64 





I have been struggling with my schizoaffective-bipolar for five or six years. I'm currently on a drug cocktail right now (300 mg Lamictal, 650 mg Seroquel, 300 mg Wellbutrin, and 1500 mg Metformin), but Lamictal is the only medicine prescribed that addresses

I also am experiencing swollen lymph glands.  Overall though it has helped a lot, and the side effects are very mild compared to other medicines I have tried. 

 node, lymph, swell, enlarge    2
Name: 721, dtype: int64 





I developed swollen lymph glands. 

 node, lymph, swell, enlarge    2
Name: 722, dtype: int64 





Exactly 2 weeks after starting lamictal, I developed a serious tremor in my left hand, headache and dizziness, and suicidal thoughts. Noone would listen to me about it being a reaction to the med. 3 different practitioners made me feel I was losing my mind. 5 days after stopping it symptoms got better and improved daily until gone. Never again will I be made to feel like I don't know my own body. 

 headache    1
Name: 723, dtype: int64 





Do not take with Acetaminophen.
NO Darvocet, NO Lorcet, No Vicodin.
Tylenol increases the rate Lamictal is broken down. Very bad mood swing and fatigue.
OK to TAKE Vicoprofen.
ALSO, adjust dosage with Depakote.
I love Lamic

Have been on 800mgs of lamictal for yrs. Only had one hospitalization due to MDs not knowing that blurry vision is a side effect. I also take 450 msg of seroquel after zyprexa failed. Also take 300 msg of welbutrin and 1.5 mg of klonopin. This has been my cocktail for years and seems to work well together. I've had to deal with BP for over 40 yrs. Very light sensitive with SAD in winter. This year had a bout of mania from too much sun. Mania is not fun at all. Have had 4 rounds of ECT during my times when the depression was very serious. It's not bad and doesn't hurt. Highly recommend it. Stops depression on the spot. Lamictal when used with other meds can be a real life saver and it keeps the mood swings farther apart. Nothing can ever cure BP so the best thing to do is structure your life around eating,exercising for those with weight gain plus it gets the endorphins firing off which feels good, getting regular sleep no matter what. The brain relies on structured environments when it

I finally was diagnoised with bpd/depression when I was 50 yrs old. Up til then i was only treated for depression. I of course tried many bpd medications and lamictal was the only one to finally help and did almost immediately now after getting up to 75 mg and 6 months of 20 mg of celexa I'm moving up to 100 of lamictal and 40 of celexa.  I'm look forward to the balance of emotions that I need especially with my husband being terminally ill its hard to control depression right now. But the mania is so much better!!!! 

 action, control             1
uncontrolled, especially    1
Name: 1011, dtype: int64 





Took the maintenance dose for half a year. The following side effects were experienced: yawning constantly/eyes watering, fatigue/tired/ emotional numbing, easily bruising/petechiae down both of my arms quite often, weight gain.  You might think that "emotional numbing" would be an appropriate result for this indication (suspected bipolar); however, just a couple weeks after all t

In [129]:
len(review_inds), newdf.shape

(729, (1124, 344))