## Discovering impact of the Series 'Euphoria' through NLP
### Analysis based on posts and comments on the `r/euphoria` subreddit

#### Question & Data Cleaning

Analysis plan:  

0. identify question(s) and data sources
1. clean data to get it into a standard format for further analysis
   1. corpus (collection of texts) - to dataframe using `pandas`
   2. document-term matrix - clean, tokenize, tdm
2. EDA
3. topic modeling based on comments referencing 'rue' posts on r/euphoria
4. *network analysis between users of drug and euphoria communities* - time permitting
5. aspect based sentiment analysis of euphoria drug-related comments
6. *topic modeling based on comments that refer to euphoria on r/opioids, r/cannabis, r/benzodiazepenes* - time permitting

#### 0. What is our question?

*What about the drug portrayal on HBO's Euphoria makes it engaging to fans?*
*Specifically, what topics emerge in the commments and is there a observable difference between seasons 1 and 2?*

I: the question, "what makes viewers of HBO's Euphoria engaged in online discourse on Reddit - Seasons 1 and 2?"  
O: data that cleaned, organized, in a standard format that can be used in future analysis  

**Data Source(s)**  
1. Reddit - `r/euphoria`
   1. posts/comments
   2. filtering comments on date and 'Rue'
      1. S1: june 16 - aug 4, 2019
      2. S2: jan 9 - feb 7, 2022

**Limit Scope**
- Using library `psaw`, pull posts that mention 'Rue' during the time frame
- Using post ids, pull all comment trees for each post
- Experiment with tree depth : 100%, 85%, 50% or top N

**Data Gathering**
- `psaw`
  - wrapper around pushift.io reddit api
- `pickle`
  - saving data for later
- `pandas`
  - exporting data to csv


In [24]:
# load libraries
import pandas as pd
import numpy as np
import pickle

# run clean_funs.py to get the functions
exec(open('clean_funs.py').read())

In [25]:
# load data
from pandas import read_pickle

raw = read_pickle('../dat/s2_rue_comments.pkl')

In [26]:
# if the cell is of type 'float64', return the index
float_indices = raw[0].apply(lambda x: x.index if type(x) == 'float64'
                             else None).dropna().index

In [27]:
# remove the rows with float indices
float_indices = []
for i, value in enumerate(raw[0]):
    if isinstance(value, float):
        float_indices.append(i) 

In [28]:
# remove the rows with float indices
raw_no_float = raw.drop(float_indices)

In [29]:
# remove spammy comments
raw_no_float['isSpam'] = raw_no_float[0].apply(clean_spam)

In [30]:
# drop the rows with spammy comments
raw_no_float_no_spam = raw_no_float[raw_no_float['isSpam'] != 'spam']

In [31]:
# remove non english comments

raw_no_float_no_spam['lang'] = raw_no_float_no_spam[0].apply(detect_language)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  raw_no_float_no_spam['lang'] = raw_no_float_no_spam[0].apply(detect_language)


In [32]:
# filter for english
comments_ls = raw_no_float_no_spam[raw_no_float_no_spam['lang'] == 'en'][0].tolist()

#### 1. Clean data  

Common steps:
- remove punctuation  
- lowercase letters  
- remove numbers

Future steps after tokenization:  
- stem
- lemmatize
- combine phrases like 'thank you' to bigrams

In [33]:
data_clean = []
for comment in comments_ls:
    data_clean.append(round1(comment))

In [34]:
data_clean2 = []
for comment in data_clean:
    data_clean2.append(round2(comment))

In [35]:
data_clean3 = []
for comment in data_clean2:
    data_clean3.append(round3(comment))

*because of the cleanning of repeated characters affecting words with double letters (addict, struggle), do stemming and lemmatizing first, then clean repeated chars*

In [36]:
data_clean3a = []
for comment in data_clean3:
    data_clean3a.append(round3a(comment))

In [37]:
# modify('switching to needle is baaaaaddddddd she manipulated her into needle use planning to pimp her out to pay her debt dark af')

In [38]:
data_clean4 = []
for comment in data_clean3a:
    data_clean4.append(round4(comment))

there are a lot of words with similar roots: play, played, playing

also, somehow a lot of random posts ended up in the clean set:  
- ' shiping usa only payment paypal fampf buyer pay fe for paypal gampsdecants ml or ml htpsimgurcomalhnzx htpsimgurcomaiozvyq of botles and respective level measured with a syringe for reference ptfe tape betwen botle and nozle thread and parafilm around where nozle and botle met decants are individualy buble wraped before going inside zip bag and then inside buble mailersthanks for loking please chatpm if interested or if you have any questionsampxb htpsimgurcomalhnzx link for a photo of botlesmlmlgardenia antiguaorangerie venisepivoine suzhourose darabierogue malachitesable nuitvert malachiteampxb htpsimgurcomasvqrut link for a photo of botlesmlml rue cambonbeigebois de ilescoromandelcristale edtcristale eau vertela pausale liono edtno edpno poudreno sycomoreampxb htpsimgurcomausue link for a photo of botlesmlmlambre nuitbois dargentcuir canagebalade sauvagebele de jourdioramoureau noirefeve delicieusegrand balgris diorholy peonyjasmin de angesla cole noiremilylaforetmitzahnew lok oud ispahanoud rosewodpatchouli imperialpurple oudsantal noirvanila dioramavetiverampxb htpsimgurcomagoari link for a photo of botlesmlmlvelvet amber sunvelvet desert oudvelvet exotic leathervelvet incensovelvet mimosa blomvelvet tender oudampxb htpsimgurcomaksojbo link for a photo of botlesmlmlmlcarnal flowerlys mediteranethe monthe nightampxb htpsimgurcomazdbkwra link for a photo of botlesmlmla chant for the nympha song for the rosethe eye of the tigerthe voice of the snakeampxb htpsimgurcomahipx link for a photo of botlesmlmlangelique noirebois darmeniecuir belugaencens mythiquejoyeuse tubereuseneroli outrenoirsantal royalshalimar milsime vanila planifoliampxb htpsimgurcomalibgsh link for a photo of botlesmlmlamyris homeaqua universalis fortegrand soiroud extraitoud satin modampxb htpsimgurcomaeysuwsh link for a photo of botlesmlmlbabylondark lightday for nightdesert serenademarienbadmiracle of the roseampxb htpsimgurcomagdqne link for a photo of botlesmlmlambre sultanborneo cuir mauresquede profundisfumerie turquetubereuse crimineleampxb htpsimgurcomajxlzlf link for a photo of botlesmlml rue de belechase saint place sulpiceatlas gardencabancaftancapelinexquisite embroiderymagnificent goldsaharienesplendid wodtrenchtuxedoveloursvinyle'

**Organize data**  

1. Corpus = `data_clean5`
2. TDM

In [39]:
data_clean4_df = pd.DataFrame(data_clean4)

In [40]:
# remove any empty rows
data_clean5_df = data_clean4_df[data_clean4_df[0] != '']
data_clean5_df = data_clean5_df[data_clean5_df[0] != ' ']

In [41]:
# pickle it
data_clean5_df.to_pickle('../dat/corpus.pkl')

In [42]:
sw_spacy = nlp.Defaults.stop_words | {'rt', 'via', '…'}

In [43]:
# create document-term matrix
from sklearn.feature_extraction.text import CountVectorizer
# from sklearn.feature_extraction import text
add_stop_words = ['i', 'just','did', 'ab', 'amp', 'ml', 'xb','abc', 'abcb', 'abcny', 'abd', 'abdabca', 'fs', 
                  'zpqxhxhzanapjsjbf', 'zqcsrpwsge', 'zqnuhckwdqwrhkuo', 'zs', 'zshwbhethehenozxfyqg',
                  'zsmkbrmwngzsibrntkt', 'zy', 'zwhnrmujykdxmntiub', 'afqjcnguytghbsuvixmglpwzqbg', 'ebecadcbdfcbafbdb',
                  'abfbmltmqspf', 'abfafebfbad', 'abaedefabdfef', 'abafbfbedbada', 'her', 'him',  'and',
                  'episode', 'season', 's', 'lol']

# using spacy stopwords instead of sklearn
stop_words = sw_spacy.union(add_stop_words)

cv = CountVectorizer(stop_words=stop_words)
data_cv = cv.fit_transform(data_clean5_df[0])
data_tdm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data_tdm_index = data_tdm.index
data_tdm



Unnamed: 0,aback,abandon,abandonment,abash,abdoman,abduct,abey,abhorent,abide,abie,...,zombie,zomer,zomg,zone,zongao,zote,zoya,zrue,zsuzsana,zurich
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19036,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19037,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19038,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19039,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


meh, not sure how i feel about stemming and lemmatizing.
creates weird words that i think are more noisy than helpful.

In [44]:
# pickle for later use
data_tdm.to_pickle('../dat/tdm_s1.pkl')