## Discovering impact of the Series 'Euphoria' through NLP
### Analysis based on posts and comments on the `r/euphoria` subreddit  

#### 3. LDA - Topic Modeling

= *Every documents is probability dist of topics*

*goal*: LDA learns the topic mix in each doc, then words in each topic   

*how*: LDA randomly assigns topics to words (will be wrong). Then, iterativly, looks for how often the topic occus in the doc and how often the word occurs in the topic overall. Based on this infor, assign the word a new topic.

`k = 2` is a good starting part for number of topics  

*input*: TDM, K, num iterations  
*output*: top words in each topic - figure out if they make sense

*tools*:  
`gensim`

alternate factorization methods: 
- NMF
- LSI

#### 3 posts in the r/euphoria subreddit  
N ~ 1,709 comments
1. Question: Does euphoria make you less likely to try drugs? Or are you more curious than you were before?
2. Not enough people are talking about Elliot's response to Rue telling him about her plan to get "free" drugs from Laurie
3. As an ex-opioid addict, Zendaya's withdrawal scenes are the most realistic portrayal I've ever seen before. her acting is phenomenal.

**Try 1**

In [1]:
# bring in data
import pandas as pd
import pickle

data = pd.read_pickle('../dat/tdm_stop.pkl')
data

Unnamed: 0_level_0,aa,aana,ab,aback,abby,abhorrence,ability,able,abroad,absolute,...,zealand,zendaya,zendayas,zero,zoloft,zombie,zone,zoo,zooming,zs
post,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
smur2x,1,0,0,0,1,0,0,9,0,4,...,1,8,1,0,0,0,1,0,1,1
sn2vpk,2,1,1,1,0,0,2,6,0,0,...,0,2,0,1,0,1,0,0,0,0
sqhl33,3,0,1,0,0,1,0,14,1,4,...,0,4,0,3,1,1,1,1,0,0


In [2]:
from gensim import matutils, models
import scipy.sparse as sp

In [3]:
tdm = data.transpose()
tdm.head()

post,smur2x,sn2vpk,sqhl33
aa,1,2,3
aana,0,1,0
ab,0,1,1
aback,0,1,0
abby,1,0,0


In [5]:
# put tdm in gensim format
sparse_counts = sp.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [6]:
# a dictionary of all terms - required by gensim
# cv contains the whole vocabulary of the corpus
cv = pickle.load(open('../dat/cv_Stop.pkl', 'rb'))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

In [7]:
# we have corpus and id2word, now we can create the lda model
# specify other parameters
# more passes, more it may make sense
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)

In [8]:
lda.print_topics()

[(0,
  '0.011*"weed" + 0.010*"makes" + 0.009*"try" + 0.007*"opiates" + 0.006*"season" + 0.006*"watching" + 0.006*"euphoria" + 0.006*"drug" + 0.006*"life" + 0.006*"feel"'),
 (1,
  '0.009*"think" + 0.009*"elliot" + 0.007*"addict" + 0.007*"jules" + 0.006*"going" + 0.006*"drug" + 0.005*"got" + 0.004*"said" + 0.004*"laurie" + 0.004*"good"')]

In [9]:
# lda for 3 topics
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

[(0,
  '0.001*"try" + 0.001*"think" + 0.000*"makes" + 0.000*"weed" + 0.000*"drug" + 0.000*"addict" + 0.000*"opiates" + 0.000*"going" + 0.000*"season" + 0.000*"doing"'),
 (1,
  '0.012*"weed" + 0.011*"makes" + 0.010*"try" + 0.007*"opiates" + 0.007*"season" + 0.007*"watching" + 0.006*"euphoria" + 0.006*"life" + 0.006*"drug" + 0.006*"feel"'),
 (2,
  '0.010*"think" + 0.009*"elliot" + 0.008*"addict" + 0.007*"jules" + 0.006*"going" + 0.006*"drug" + 0.005*"got" + 0.005*"said" + 0.005*"laurie" + 0.004*"good"')]

In [10]:
# lda for 4 topics
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

[(0,
  '0.001*"weed" + 0.001*"try" + 0.001*"makes" + 0.000*"think" + 0.000*"going" + 0.000*"opiates" + 0.000*"watching" + 0.000*"make" + 0.000*"drug" + 0.000*"got"'),
 (1,
  '0.008*"withdrawal" + 0.006*"addict" + 0.006*"yawning" + 0.006*"going" + 0.005*"episode" + 0.004*"feel" + 0.004*"clean" + 0.004*"withdrawals" + 0.004*"life" + 0.003*"bad"'),
 (2,
  '0.000*"got" + 0.000*"addict" + 0.000*"think" + 0.000*"going" + 0.000*"feel" + 0.000*"makes" + 0.000*"life" + 0.000*"weed" + 0.000*"good" + 0.000*"drug"'),
 (3,
  '0.009*"think" + 0.009*"weed" + 0.008*"makes" + 0.007*"drug" + 0.007*"try" + 0.006*"addict" + 0.006*"elliot" + 0.005*"going" + 0.005*"life" + 0.005*"got"')]

**TRY 2**  

only nouns - `nltk`

In [11]:
from nltk import word_tokenize, pos_tag

def nouns(text):
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)]
    return ' '.join(all_nouns)

In [13]:
# read clean data
data_clean = pd.read_pickle('../dat/corpus.pkl')
data_clean

Unnamed: 0_level_0,body,post_q
post,Unnamed: 1_level_1,Unnamed: 2_level_1
smur2x,first congrats and continued success on your s...,Likely to try drugs
sn2vpk,i think the difference between elliot and rue ...,Elliots response to free drugs
sqhl33,i already smoke weed and do psychedelics but i...,Realistic portrayal of withdrawal


In [15]:
# filter so only nouns are left
data_nouns = pd.DataFrame(data_clean.body.apply(nouns))
data_nouns

Unnamed: 0_level_0,body
post,Unnamed: 1_level_1
smur2x,congrats success sobriety show exaddict downsi...
sn2vpk,i difference elliot rue health issues life ell...
sqhl33,i weed psychedelics i opiates everything i add...


In [16]:
# create new dtm
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text

# add additional stop words
add_stop_words = ['i', 'like','just','rue','did','really','people','way','know','use',
                  'time','drugs','want','does','addiction']
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# dtm
cvn = CountVectorizer(stop_words=stop_words)
data_cvn = cvn.fit_transform(data_nouns.body)
data_dtm = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names_out())
data_dtm.index = data_nouns.index
data_dtm


Unnamed: 0_level_0,aa,aana,aback,abhorrence,ability,absolute,absolutes,absorption,abuse,abuser,...,yuck,yum,yup,zendaya,zendayas,zombie,zone,zoo,zooming,zs
post,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
smur2x,1,0,0,0,0,2,0,1,0,0,...,1,1,0,6,1,0,0,0,1,1
sn2vpk,0,1,1,0,2,0,1,0,4,0,...,0,0,0,2,0,1,0,0,0,0
sqhl33,1,0,0,1,0,2,0,0,6,1,...,0,0,1,4,0,1,1,1,0,0


In [None]:
# implement LDA
def lda_model(corpus, dictionary, num_topics=10, passes=20):
    """
    Create LDA model
    """
    lda_model = LdaMulticore(corpus=corpus,
                             id2word=dictionary,
                             num_topics=num_topics,
                             passes=passes,
                             workers=2)
    return lda_model