## Discovering impact of the Series 'Euphoria' through NLP
### Analysis based on posts and comments on the `r/euphoria` subreddit  

#### 3. LDA - Topic Modeling

= *Every documents is probability dist of topics*

*goal*: LDA learns the topic mix in each doc, then words in each topic   

*how*: LDA randomly assigns topics to words (will be wrong). Then, iterativly, looks for how often the topic occus in the doc and how often the word occurs in the topic overall. Based on this infor, assign the word a new topic.

`k = 2` is a good starting part for number of topics  

*input*: TDM, K, num iterations  
*output*: top words in each topic - figure out if they make sense

*tools*:  
`gensim`

alternate factorization methods: 
- NMF
- LSI
- look at BERT TM

#### all posts in the r/euphoria subreddit that match 'rue' and are between jan9 and feb 2 2022 for season 2 

**1st experiment**
N ~ 20,823 post + comments (100%)
Some notable comments:  
- "I was curious, so I read the /r/opiates discussion on the episode and they agree it's accurate. There's not a single negative comment."
- "Having been the partner/girlfriend of an addict, I cannot re-watch several parts of Season 2 because they feel so real to me."
- "Absolutely spot on. I've had my struggles w Vicodin and there were points when I was dope sick, I would do anything to get those damn pills. I'm still feeling some type of a way after last night's episode. That's how accurate it was"
- "Y’all attach too much of your selves to the characters."

**EXP 1**

In [1]:
# bring in data
import pandas as pd
import pickle

data = pd.read_pickle('../dat/tdm_stop.pkl')
data

Unnamed: 0,ab,aback,abaedefabdfef,abafbfbedbada,abandon,abandoned,abandoning,abandonment,abandons,abashed,...,zorbcaps,zoted,zouabi,zoya,zqcsrpwsge,zrue,zs,zsuzsana,zurich,zy
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19278,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19279,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19280,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19281,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [2]:
from gensim import matutils, models
import scipy.sparse as sp

In [3]:
tdm = data.transpose()
tdm.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19272,19273,19274,19275,19277,19278,19279,19280,19281,19282
ab,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aback,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abaedefabdfef,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abafbfbedbada,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abandon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# put tdm in gensim format
sparse_counts = sp.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [5]:
# a dictionary of all terms - required by gensim
# cv contains the whole vocabulary of the corpus
cv = pickle.load(open('../dat/cv_Stop.pkl', 'rb'))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

In [6]:
# we have corpus and id2word, now we can create the lda model
# specify other parameters
# more passes, more it may make sense
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)

In [7]:
lda.print_topics()

[(0,
  '0.019*"jules" + 0.018*"like" + 0.014*"think" + 0.010*"ml" + 0.009*"season" + 0.008*"going" + 0.008*"box" + 0.008*"nate" + 0.008*"know" + 0.008*"really"'),
 (1,
  '0.007*"people" + 0.006*"amp" + 0.004*"like" + 0.004*"addiction" + 0.004*"time" + 0.003*"xb" + 0.003*"life" + 0.003*"day" + 0.003*"wel" + 0.003*"know"')]

In [8]:
# lda for 3 topics
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

[(0,
  '0.026*"jules" + 0.022*"like" + 0.015*"think" + 0.012*"people" + 0.011*"nate" + 0.011*"season" + 0.009*"really" + 0.008*"cassie" + 0.008*"does" + 0.007*"know"'),
 (1,
  '0.012*"fez" + 0.010*"like" + 0.009*"going" + 0.009*"know" + 0.008*"did" + 0.007*"think" + 0.007*"time" + 0.006*"drugs" + 0.005*"episode" + 0.005*"got"'),
 (2,
  '0.024*"ml" + 0.019*"box" + 0.015*"niche" + 0.012*"tester" + 0.010*"amp" + 0.007*"designer" + 0.006*"new" + 0.006*"vintage" + 0.005*"cap" + 0.005*"xb"')]

In [9]:
# lda for 4 topics
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

[(0,
  '0.034*"jules" + 0.021*"like" + 0.017*"think" + 0.014*"people" + 0.010*"does" + 0.009*"really" + 0.009*"elliot" + 0.008*"know" + 0.008*"feel" + 0.007*"relationship"'),
 (1,
  '0.027*"ml" + 0.021*"box" + 0.017*"niche" + 0.014*"tester" + 0.011*"amp" + 0.008*"designer" + 0.007*"vintage" + 0.007*"new" + 0.006*"cap" + 0.006*"xb"'),
 (2,
  '0.020*"season" + 0.018*"like" + 0.013*"nate" + 0.011*"cassie" + 0.011*"episode" + 0.010*"lexi" + 0.010*"character" + 0.010*"think" + 0.008*"characters" + 0.008*"maddy"'),
 (3,
  '0.012*"fez" + 0.008*"going" + 0.007*"did" + 0.007*"time" + 0.007*"know" + 0.006*"like" + 0.006*"drugs" + 0.005*"drug" + 0.005*"got" + 0.005*"think"')]

**TRY 2**  

only nouns - `nltk`

In [10]:
from nltk import word_tokenize, pos_tag

def nouns(text):
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)]
    return ' '.join(all_nouns)

In [11]:
# read clean data
data_clean = pd.read_pickle('../dat/corpus.pkl')
data_clean

Unnamed: 0,0
0,it is a turning point for the series fs
1,i am glad we finally get a whole episode dedic...
2,my thoughts exactly all i could think of while...
3,i honestly thought with the end of ep the seas...
4,same i feel like we have not really seen much ...
...,...
19278,the which extremely out franchised culver in t...
19279,so this hapened back in late november and i on...
19280,i want to use the face claim of alexis bledel ...
19281,i just ordered some seeds and i am wanting to ...


In [12]:
# filter so only nouns are left
data_nouns = pd.DataFrame(data_clean[0].apply(nouns))
data_nouns

Unnamed: 0,0
0,point series fs
1,i glad episode protagonist
2,thoughts i watching show shitshow
3,i end ep season space kind mediocre
4,season episode
...,...
19278,culver family wel desserts etc culvers culvers...
19279,november i work dimsum restaurant city folowin...
19280,i face claim bledel character specificaly seas...
19281,i seeds way


In [16]:
# create new dtm
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text

# add additional stop words - HOW TO ACCOUNT FOR PEOPLE BANGING ON KEYBOARDS?!?!?!?
add_stop_words = ['i', 'just','did', 'ab', 'abc', 'abcb', 'abcny', 'abd', 'abdabca', 'fs', 
                  'zpqxhxhzanapjsjbf', 'zqcsrpwsge', 'zqnuhckwdqwrhkuo', 'zs', 'zshwbhethehenozxfyqg',
                  'zsmkbrmwngzsibrntkt', 'zy', 'zwhnrmujykdxmntiub', 'afqjcnguytghbsuvixmglpwzqbg', 'ebecadcbdfcbafbdb',
                  'abfbmltmqspf', 'abfafebfbad']
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# dtm
cvn = CountVectorizer(stop_words=stop_words)
data_cvn = cvn.fit_transform(data_nouns[0])
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names_out())
data_dtmn.index = data_nouns.index
data_dtmn


Unnamed: 0,aback,abandon,abandonment,abandons,abdomen,abducts,abe,abey,abfafebfbad,abfbmltmqspf,...,zomer,zomg,zone,zongao,zora,zorbcaps,zouabi,zoya,zrue,zurich
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19278,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19279,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19280,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19281,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
# create gensim corpus
corpusn = matutils.Sparse2Corpus(sp.csr_matrix(data_dtmn.transpose()))
# vocab
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [None]:
# 2 topics
ldan = models.ldamodel.LdaModel(corpus=corpusn, id2word=id2wordn, num_topics=2, passes=10)
ldan.print_topics()

In [None]:
# 3 topics
ldan = models.ldamodel.LdaModel(corpus=corpusn, id2word=id2wordn, num_topics=3, passes=10)
ldan.print_topics()

In [18]:
# 4 topics
ldan = models.ldamodel.LdaModel(corpus=corpusn, id2word=id2wordn, num_topics=4, passes=10)
ldan.print_topics()

[(0,
  '0.030*"box" + 0.022*"niche" + 0.020*"tester" + 0.011*"designer" + 0.010*"vintage" + 0.009*"amp" + 0.008*"cap" + 0.006*"chanel" + 0.006*"xb" + 0.004*"city"'),
 (1,
  '0.091*"rue" + 0.089*"jules" + 0.023*"episode" + 0.019*"relationship" + 0.015*"elliot" + 0.015*"way" + 0.013*"people" + 0.012*"drugs" + 0.011*"rues" + 0.010*"season"'),
 (2,
  '0.019*"drugs" + 0.017*"people" + 0.016*"fez" + 0.015*"rue" + 0.015*"drug" + 0.015*"time" + 0.015*"life" + 0.009*"way" + 0.008*"family" + 0.007*"house"'),
 (3,
  '0.040*"season" + 0.027*"cassie" + 0.026*"nate" + 0.026*"character" + 0.024*"people" + 0.022*"characters" + 0.015*"maddy" + 0.015*"lexi" + 0.013*"lot" + 0.012*"story"')]

**Try 3 - nouns and adjectives**

In [20]:
def nouns_adj(text):
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)]
    return ' '.join(nouns_adj)

In [21]:
data_nouns_adj = pd.DataFrame(data_clean[0].apply(nouns_adj))
data_nouns_adj

Unnamed: 0,0
0,turning point series fs
1,i glad whole episode protagonist
2,thoughts i watching show shitshow
3,i end ep season space kind mediocre real narative
4,same i season whole episode
...,...
19278,culver famous branched restaurant family wel d...
19279,late november i most i work nice chinese ameri...
19280,i face claim alexis bledel character specifica...
19281,i seeds best way


In [25]:
# tdm
cvna = CountVectorizer(stop_words=stop_words, max_df=0.8)
data_cvna = cvna.fit_transform(data_nouns_adj[0])
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names_out())
data_dtmna.index = data_nouns_adj.index
data_dtmna


Unnamed: 0,ab,aback,abandon,abandonment,abandons,abc,abcb,abcny,abd,abdabca,...,zpqxhxhzanapjsjbf,zqcsrpwsge,zqnuhckwdqwrhkuo,zrue,zs,zshwbhethehenozxfyqg,zsmkbrmwngzsibrntkt,zurich,zwhnrmujykdxmntiub,zy
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19286,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19287,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19288,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19289,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
# create gensim corpus
corpusna = matutils.Sparse2Corpus(sp.csr_matrix(data_dtmna.transpose()))
# vocab
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [27]:
# 2 topics
ldana = models.ldamodel.LdaModel(corpus=corpusna, id2word=id2wordna, num_topics=2, passes=10)
ldana.print_topics()

[(0,
  '0.036*"rue" + 0.025*"jules" + 0.014*"people" + 0.012*"season" + 0.010*"nate" + 0.009*"episode" + 0.008*"drugs" + 0.008*"way" + 0.007*"time" + 0.007*"cassie"'),
 (1,
  '0.031*"ful" + 0.031*"movie" + 0.018*"movies" + 0.013*"free" + 0.012*"online" + 0.009*"amp" + 0.009*"man" + 0.008*"home" + 0.008*"watch" + 0.007*"scream"')]

In [28]:
# 3 topics
ldana = models.ldamodel.LdaModel(corpus=corpusna, id2word=id2wordna, num_topics=3, passes=10)
ldana.print_topics()

[(0,
  '0.052*"rue" + 0.036*"jules" + 0.016*"people" + 0.015*"season" + 0.014*"nate" + 0.013*"episode" + 0.011*"drugs" + 0.011*"cassie" + 0.010*"way" + 0.008*"character"'),
 (1,
  '0.008*"fez" + 0.008*"time" + 0.006*"school" + 0.006*"people" + 0.006*"season" + 0.005*"euphoria" + 0.004*"sam" + 0.004*"high" + 0.004*"year" + 0.004*"wel"'),
 (2,
  '0.033*"ful" + 0.033*"movie" + 0.019*"movies" + 0.014*"free" + 0.012*"online" + 0.010*"amp" + 0.009*"man" + 0.009*"home" + 0.008*"watch" + 0.008*"scream"')]

In [29]:
# 4 topics
ldana = models.ldamodel.LdaModel(corpus=corpusna, id2word=id2wordna, num_topics=4, passes=10)
ldana.print_topics()

[(0,
  '0.041*"rue" + 0.028*"jules" + 0.016*"people" + 0.013*"season" + 0.011*"nate" + 0.010*"episode" + 0.009*"drugs" + 0.008*"time" + 0.008*"cassie" + 0.008*"way"'),
 (1,
  '0.023*"king" + 0.021*"hero" + 0.020*"video" + 0.012*"academia" + 0.011*"heroes" + 0.009*"sound" + 0.009*"site" + 0.008*"tv" + 0.007*"films" + 0.007*"good"'),
 (2,
  '0.044*"movie" + 0.029*"ful" + 0.027*"movies" + 0.018*"free" + 0.017*"online" + 0.011*"watch" + 0.011*"scream" + 0.007*"resolution" + 0.007*"hd" + 0.007*"quality"'),
 (3,
  '0.049*"ful" + 0.039*"man" + 0.030*"ml" + 0.029*"home" + 0.029*"xb" + 0.028*"box" + 0.026*"way" + 0.024*"spider" + 0.022*"amp" + 0.021*"niche"')]

**ID topics**

last model makes the most sense - 4topics with nouns and adjectives

In [30]:
# 4 topics - more passes
ldana = models.ldamodel.LdaModel(corpus=corpusna, id2word=id2wordna, num_topics=4, passes=100)
ldana.print_topics()

[(0,
  '0.063*"rue" + 0.040*"jules" + 0.020*"people" + 0.016*"drugs" + 0.011*"way" + 0.011*"elliot" + 0.009*"drug" + 0.009*"relationship" + 0.008*"life" + 0.008*"time"'),
 (1,
  '0.024*"season" + 0.021*"nate" + 0.016*"cassie" + 0.015*"character" + 0.012*"episode" + 0.012*"characters" + 0.012*"fez" + 0.010*"lexi" + 0.010*"maddy" + 0.010*"jules"'),
 (2,
  '0.028*"ful" + 0.022*"ml" + 0.020*"box" + 0.015*"niche" + 0.013*"tester" + 0.010*"amp" + 0.008*"designer" + 0.007*"new" + 0.007*"vintage" + 0.006*"xb"'),
 (3,
  '0.042*"movie" + 0.030*"ful" + 0.025*"movies" + 0.017*"free" + 0.016*"online" + 0.011*"man" + 0.010*"watch" + 0.010*"home" + 0.010*"scream" + 0.009*"way"')]

**Topics**

- topic 0: types of drugs, party
- topic 1: drug actions (?)
- topic 2: characters in the show
- topic 3: effects of drugs (?)

In [43]:
# what is the topic dist of the document

corpus_transformed = ldana[corpusna]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))

[(3, 'smur2x'), (2, 'sn2vpk'), (0, 'sqhl33')]

In [66]:
# make gensim Dictionary from the corpus
dict_na = corpora.Dictionary([data_dtmna.columns])

In [70]:
from gensim import corpora, models
coh_model_lda = models.CoherenceModel(model=ldana, 
                                      texts=data_dtmna.columns,
                                      coherence='c_v',
                                      dictionary=dict_na)

In [71]:
coh_lda = coh_model_lda.get_coherence()
print('coherence score: ', coh_lda)

coherence score:  nan


  m_lr_i = np.log(numerator / denominator)
  return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))


In [44]:
# coherence analysis to find optimal number of topics
from gensim.models import CoherenceModel

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics
    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics
    start : Initial num of topics
    step : Increment between each topic number
    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, passes=10)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values


In [45]:
compute_coherence_values(dictionary=id2wordna, corpus=corpusna, texts=data_nouns_adj.body, start=2, limit=40, step=1)

AttributeError: 'dict' object has no attribute 'id2token'

In [None]:
# implement LDA
def lda_model(corpus, dictionary, num_topics=10, passes=20):
    """
    Create LDA model
    """
    lda_model = LdaMulticore(corpus=corpus,
                             id2word=dictionary,
                             num_topics=num_topics,
                             passes=passes,
                             workers=2)
    return lda_model