## Discovering impact of the Series 'Euphoria' through NLP
### Analysis based on posts and comments on the `r/euphoria` subreddit  

#### 3. LDA - Topic Modeling

= *Every documents is probability dist of topics*

*goal*: LDA learns the topic mix in each doc, then words in each topic   

*how*: LDA randomly assigns topics to words (will be wrong). Then, iterativly, looks for how often the topic occus in the doc and how often the word occurs in the topic overall. Based on this infor, assign the word a new topic.

`k = 2` is a good starting part for number of topics  

*input*: TDM, K, num iterations  
*output*: top words in each topic - figure out if they make sense

*tools*:  
`gensim`

alternate factorization methods: 
- NMF
- LSI

#### 3 posts in the r/euphoria subreddit  
N ~ 1,709 comments
1. Question: Does euphoria make you less likely to try drugs? Or are you more curious than you were before?
2. Not enough people are talking about Elliot's response to Rue telling him about her plan to get "free" drugs from Laurie
3. As an ex-opioid addict, Zendaya's withdrawal scenes are the most realistic portrayal I've ever seen before. her acting is phenomenal.

**Try 1**

In [1]:
# bring in data
import pandas as pd
import pickle

data = pd.read_pickle('../dat/tdm_stop.pkl')
data

Unnamed: 0_level_0,aa,aana,ab,aback,abby,abhorrence,ability,able,abroad,absolute,...,zealand,zendaya,zendayas,zero,zoloft,zombie,zone,zoo,zooming,zs
post,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
smur2x,1,0,0,0,1,0,0,9,0,4,...,1,8,1,0,0,0,1,0,1,1
sn2vpk,2,1,1,1,0,0,2,6,0,0,...,0,2,0,1,0,1,0,0,0,0
sqhl33,3,0,1,0,0,1,0,14,1,4,...,0,4,0,3,1,1,1,1,0,0


In [2]:
from gensim import matutils, models
import scipy.sparse as sp

In [3]:
tdm = data.transpose()
tdm.head()

post,smur2x,sn2vpk,sqhl33
aa,1,2,3
aana,0,1,0
ab,0,1,1
aback,0,1,0
abby,1,0,0


In [5]:
# put tdm in gensim format
sparse_counts = sp.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [6]:
# a dictionary of all terms - required by gensim
# cv contains the whole vocabulary of the corpus
cv = pickle.load(open('../dat/cv_Stop.pkl', 'rb'))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

In [7]:
# we have corpus and id2word, now we can create the lda model
# specify other parameters
# more passes, more it may make sense
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)

In [8]:
lda.print_topics()

[(0,
  '0.011*"weed" + 0.010*"makes" + 0.009*"try" + 0.007*"opiates" + 0.006*"season" + 0.006*"watching" + 0.006*"euphoria" + 0.006*"drug" + 0.006*"life" + 0.006*"feel"'),
 (1,
  '0.009*"think" + 0.009*"elliot" + 0.007*"addict" + 0.007*"jules" + 0.006*"going" + 0.006*"drug" + 0.005*"got" + 0.004*"said" + 0.004*"laurie" + 0.004*"good"')]

In [9]:
# lda for 3 topics
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

[(0,
  '0.001*"try" + 0.001*"think" + 0.000*"makes" + 0.000*"weed" + 0.000*"drug" + 0.000*"addict" + 0.000*"opiates" + 0.000*"going" + 0.000*"season" + 0.000*"doing"'),
 (1,
  '0.012*"weed" + 0.011*"makes" + 0.010*"try" + 0.007*"opiates" + 0.007*"season" + 0.007*"watching" + 0.006*"euphoria" + 0.006*"life" + 0.006*"drug" + 0.006*"feel"'),
 (2,
  '0.010*"think" + 0.009*"elliot" + 0.008*"addict" + 0.007*"jules" + 0.006*"going" + 0.006*"drug" + 0.005*"got" + 0.005*"said" + 0.005*"laurie" + 0.004*"good"')]

In [10]:
# lda for 4 topics
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

[(0,
  '0.001*"weed" + 0.001*"try" + 0.001*"makes" + 0.000*"think" + 0.000*"going" + 0.000*"opiates" + 0.000*"watching" + 0.000*"make" + 0.000*"drug" + 0.000*"got"'),
 (1,
  '0.008*"withdrawal" + 0.006*"addict" + 0.006*"yawning" + 0.006*"going" + 0.005*"episode" + 0.004*"feel" + 0.004*"clean" + 0.004*"withdrawals" + 0.004*"life" + 0.003*"bad"'),
 (2,
  '0.000*"got" + 0.000*"addict" + 0.000*"think" + 0.000*"going" + 0.000*"feel" + 0.000*"makes" + 0.000*"life" + 0.000*"weed" + 0.000*"good" + 0.000*"drug"'),
 (3,
  '0.009*"think" + 0.009*"weed" + 0.008*"makes" + 0.007*"drug" + 0.007*"try" + 0.006*"addict" + 0.006*"elliot" + 0.005*"going" + 0.005*"life" + 0.005*"got"')]

**TRY 2**  

only nouns - `nltk`

In [11]:
from nltk import word_tokenize, pos_tag

def nouns(text):
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)]
    return ' '.join(all_nouns)

In [13]:
# read clean data
data_clean = pd.read_pickle('../dat/corpus.pkl')
data_clean

Unnamed: 0_level_0,body,post_q
post,Unnamed: 1_level_1,Unnamed: 2_level_1
smur2x,first congrats and continued success on your s...,Likely to try drugs
sn2vpk,i think the difference between elliot and rue ...,Elliots response to free drugs
sqhl33,i already smoke weed and do psychedelics but i...,Realistic portrayal of withdrawal


In [15]:
# filter so only nouns are left
data_nouns = pd.DataFrame(data_clean.body.apply(nouns))
data_nouns

Unnamed: 0_level_0,body
post,Unnamed: 1_level_1
smur2x,congrats success sobriety show exaddict downsi...
sn2vpk,i difference elliot rue health issues life ell...
sqhl33,i weed psychedelics i opiates everything i add...


In [17]:
# create new dtm
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text

# add additional stop words
add_stop_words = ['i', 'like','just','rue','did','really','people','way','know','use',
                  'time','drugs','want','does','addiction']
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# dtm
cvn = CountVectorizer(stop_words=stop_words)
data_cvnn = cvn.fit_transform(data_nouns.body)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names_out())
data_dtmn.index = data_nouns.index
data_dtmn


Unnamed: 0_level_0,aa,aana,aback,abhorrence,ability,absolute,absolutes,absorption,abuse,abuser,...,yuck,yum,yup,zendaya,zendayas,zombie,zone,zoo,zooming,zs
post,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
smur2x,1,0,0,0,0,2,0,1,0,0,...,1,1,0,6,1,0,0,0,1,1
sn2vpk,0,1,1,0,2,0,1,0,4,0,...,0,0,0,2,0,1,0,0,0,0
sqhl33,1,0,0,1,0,2,0,0,6,1,...,0,0,1,4,0,1,1,1,0,0


In [18]:
# create gensim corpus
corpusn = matutils.Sparse2Corpus(sp.csr_matrix(data_dtmn.transpose()))
# vocab
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [19]:
# 2 topics
ldan = models.ldamodel.LdaModel(corpus=corpusn, id2word=id2wordn, num_topics=2, passes=10)
ldan.print_topics()

[(0,
  '0.001*"drug" + 0.001*"life" + 0.001*"weed" + 0.001*"episode" + 0.001*"opiates" + 0.001*"thing" + 0.001*"season" + 0.001*"pain" + 0.001*"jules" + 0.001*"day"'),
 (1,
  '0.015*"drug" + 0.012*"addict" + 0.012*"life" + 0.010*"season" + 0.010*"weed" + 0.009*"episode" + 0.009*"opiates" + 0.009*"jules" + 0.008*"years" + 0.007*"lot"')]

In [20]:
# 3 topics
ldan = models.ldamodel.LdaModel(corpus=corpusn, id2word=id2wordn, num_topics=3, passes=10)
ldan.print_topics()

[(0,
  '0.014*"season" + 0.014*"life" + 0.014*"weed" + 0.013*"drug" + 0.013*"opiates" + 0.011*"addict" + 0.011*"euphoria" + 0.010*"episode" + 0.010*"pain" + 0.009*"years"'),
 (1,
  '0.001*"drug" + 0.001*"life" + 0.001*"addict" + 0.001*"jules" + 0.001*"pain" + 0.001*"years" + 0.001*"episode" + 0.001*"thing" + 0.001*"season" + 0.001*"elliot"'),
 (2,
  '0.021*"jules" + 0.018*"elliot" + 0.016*"drug" + 0.015*"addict" + 0.009*"plan" + 0.008*"heroin" + 0.008*"rues" + 0.008*"thing" + 0.007*"life" + 0.007*"laurie"')]

In [21]:
# 4 topics
ldan = models.ldamodel.LdaModel(corpus=corpusn, id2word=id2wordn, num_topics=4, passes=10)
ldan.print_topics()

[(0,
  '0.022*"jules" + 0.019*"elliot" + 0.017*"drug" + 0.016*"addict" + 0.010*"plan" + 0.009*"heroin" + 0.009*"rues" + 0.008*"thing" + 0.008*"life" + 0.007*"laurie"'),
 (1,
  '0.002*"elliot" + 0.002*"drug" + 0.001*"addict" + 0.001*"jules" + 0.001*"life" + 0.001*"heroin" + 0.001*"thing" + 0.001*"rues" + 0.001*"episode" + 0.001*"things"'),
 (2,
  '0.015*"season" + 0.014*"life" + 0.014*"weed" + 0.014*"drug" + 0.013*"opiates" + 0.011*"addict" + 0.011*"euphoria" + 0.011*"episode" + 0.010*"pain" + 0.010*"years"'),
 (3,
  '0.001*"drug" + 0.001*"life" + 0.001*"pain" + 0.001*"weed" + 0.001*"episode" + 0.001*"addict" + 0.001*"opiates" + 0.001*"euphoria" + 0.001*"season" + 0.001*"things"')]

**Try 3 - nouns and adjectives**

In [24]:
def nouns_adj(text):
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)]
    return ' '.join(nouns_adj)

In [25]:
data_nouns_adj = pd.DataFrame(data_clean.body.apply(nouns_adj))
data_nouns_adj

Unnamed: 0_level_0,body
post,Unnamed: 1_level_1
smur2x,congrats continued success sobriety incredible...
sn2vpk,i difference elliot rue mental health issues m...
sqhl33,i weed psychedelics i opiates everything i add...


In [36]:
# tdm
cvna = CountVectorizer(stop_words=stop_words, max_df=0.8)
data_cvna = cvna.fit_transform(data_nouns_adj.body)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names_out())
data_dtmna.index = data_nouns_adj.index
data_dtmna


Unnamed: 0_level_0,aa,aana,ab,aback,abby,abhorrence,ability,absolute,absolutes,absorption,...,youtube,yuck,yum,yup,zendayas,zombie,zone,zoo,zooming,zs
post,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
smur2x,1,0,0,0,1,0,0,4,0,1,...,0,1,1,0,1,0,0,0,1,1
sn2vpk,0,1,1,1,0,0,2,0,1,0,...,0,0,0,1,0,1,0,0,0,0
sqhl33,3,0,0,0,0,1,0,4,0,0,...,1,0,0,1,0,1,1,1,0,0


In [None]:
# create gensim corpus
corpusna = matutils.Sparse2Corpus(sp.csr_matrix(data_dtmna.transpose()))
# vocab
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [39]:
# 2 topics
ldana = models.ldamodel.LdaModel(corpus=corpusna, id2word=id2wordna, num_topics=2, passes=10)
ldana.print_topics()

[(0,
  '0.007*"yawning" + 0.005*"realistic" + 0.004*"opioids" + 0.004*"wd" + 0.004*"cold" + 0.003*"dopamine" + 0.002*"portrayal" + 0.002*"track" + 0.002*"ep" + 0.002*"acting"'),
 (1,
  '0.017*"weed" + 0.012*"elliot" + 0.006*"fun" + 0.005*"psychedelics" + 0.005*"idea" + 0.004*"kid" + 0.004*"shrooms" + 0.004*"addictive" + 0.004*"party" + 0.004*"interested"')]

In [40]:
# 3 topics
ldana = models.ldamodel.LdaModel(corpus=corpusna, id2word=id2wordna, num_topics=3, passes=10)
ldana.print_topics()

[(0,
  '0.001*"weed" + 0.001*"elliot" + 0.000*"psychedelics" + 0.000*"fun" + 0.000*"idea" + 0.000*"shrooms" + 0.000*"addictive" + 0.000*"opioids" + 0.000*"kid" + 0.000*"party"'),
 (1,
  '0.020*"weed" + 0.007*"psychedelics" + 0.005*"shrooms" + 0.005*"fun" + 0.005*"opioids" + 0.004*"addictive" + 0.004*"interested" + 0.003*"party" + 0.003*"couple" + 0.003*"desire"'),
 (2,
  '0.028*"elliot" + 0.009*"kid" + 0.007*"idea" + 0.007*"fez" + 0.006*"line" + 0.005*"fun" + 0.005*"hes" + 0.005*"social" + 0.004*"character" + 0.004*"moment"')]

In [41]:
# 4 topics
ldana = models.ldamodel.LdaModel(corpus=corpusna, id2word=id2wordna, num_topics=4, passes=10)
ldana.print_topics()

[(0,
  '0.026*"weed" + 0.009*"psychedelics" + 0.007*"shrooms" + 0.006*"fun" + 0.006*"addictive" + 0.005*"interested" + 0.004*"party" + 0.004*"opioids" + 0.004*"desire" + 0.003*"scary"'),
 (1,
  '0.001*"weed" + 0.001*"elliot" + 0.000*"psychedelics" + 0.000*"fun" + 0.000*"idea" + 0.000*"party" + 0.000*"interested" + 0.000*"addictive" + 0.000*"shrooms" + 0.000*"opioids"'),
 (2,
  '0.031*"elliot" + 0.009*"kid" + 0.008*"idea" + 0.008*"fez" + 0.007*"line" + 0.006*"fun" + 0.006*"hes" + 0.005*"social" + 0.005*"character" + 0.004*"moment"'),
 (3,
  '0.010*"yawning" + 0.007*"realistic" + 0.005*"opioids" + 0.005*"cold" + 0.005*"wd" + 0.004*"dopamine" + 0.003*"portrayal" + 0.003*"track" + 0.003*"ep" + 0.003*"acting"')]

**ID topics**

last model makes the most sense - 4topics with nouns and adjectives

In [42]:
# 4 topics - more passes
ldana = models.ldamodel.LdaModel(corpus=corpusna, id2word=id2wordna, num_topics=4, passes=100)
ldana.print_topics()

[(0,
  '0.026*"weed" + 0.009*"psychedelics" + 0.007*"shrooms" + 0.006*"fun" + 0.006*"addictive" + 0.005*"interested" + 0.004*"party" + 0.004*"opioids" + 0.004*"desire" + 0.003*"scary"'),
 (1,
  '0.000*"bloodstream" + 0.000*"blessing" + 0.000*"shoot" + 0.000*"tough" + 0.000*"lie" + 0.000*"psychopath" + 0.000*"prevalent" + 0.000*"thread" + 0.000*"powerful" + 0.000*"bot"'),
 (2,
  '0.031*"elliot" + 0.010*"kid" + 0.008*"idea" + 0.008*"fez" + 0.007*"line" + 0.006*"hes" + 0.006*"fun" + 0.005*"social" + 0.005*"character" + 0.004*"moment"'),
 (3,
  '0.010*"yawning" + 0.007*"realistic" + 0.005*"cold" + 0.005*"wd" + 0.005*"opioids" + 0.004*"dopamine" + 0.003*"ep" + 0.003*"track" + 0.003*"portrayal" + 0.003*"incredible"')]

**Topics**

- topic 0: types of drugs, party
- topic 1: drug actions (?)
- topic 2: characters in the show
- topic 3: effects of drugs (?)

In [43]:
# what is the topic dist of the document

corpus_transformed = ldana[corpusna]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))

[(3, 'smur2x'), (2, 'sn2vpk'), (0, 'sqhl33')]

In [66]:
# make gensim Dictionary from the corpus
dict_na = corpora.Dictionary([data_dtmna.columns])

In [70]:
from gensim import corpora, models
coh_model_lda = models.CoherenceModel(model=ldana, 
                                      texts=data_dtmna.columns,
                                      coherence='c_v',
                                      dictionary=dict_na)

In [71]:
coh_lda = coh_model_lda.get_coherence()
print('coherence score: ', coh_lda)

coherence score:  nan


  m_lr_i = np.log(numerator / denominator)
  return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))


In [44]:
# coherence analysis to find optimal number of topics
from gensim.models import CoherenceModel

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics
    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics
    start : Initial num of topics
    step : Increment between each topic number
    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, passes=10)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values


In [45]:
compute_coherence_values(dictionary=id2wordna, corpus=corpusna, texts=data_nouns_adj.body, start=2, limit=40, step=1)

AttributeError: 'dict' object has no attribute 'id2token'

In [None]:
# implement LDA
def lda_model(corpus, dictionary, num_topics=10, passes=20):
    """
    Create LDA model
    """
    lda_model = LdaMulticore(corpus=corpus,
                             id2word=dictionary,
                             num_topics=num_topics,
                             passes=passes,
                             workers=2)
    return lda_model