## Discovering impact of the Series 'Euphoria' through NLP
### Analysis based on posts and comments on the `r/euphoria` subreddit

#### Question & Data Cleaning

Analysis plan:  

0. identify question(s) and data sources
1. clean data to get it into a standard format for further analysis
   1. corpus (collection of texts) - to dataframe using `pandas`
   2. document-term matrix - clean, tokenize, tdm
2. EDA
3. topic modeling based on comments referencing 'rue' posts on r/euphoria
4. *network analysis between users of drug and euphoria communities* - time permitting
5. aspect based sentiment analysis of euphoria drug-related comments
6. *topic modeling based on comments that refer to euphoria on r/opioids, r/cannabis, r/benzodiazepenes* - time permitting

#### 0. What is our question?

*What about the drug portrayal on HBO's Euphoria makes it engaging to fans?*
*Specifically, what topics emerge in the commments and is there a observable difference between seasons 1 and 2?*

I: the question, "what makes viewers of HBO's Euphoria engaged in online discourse on Reddit - Seasons 1 and 2?"  
O: data that cleaned, organized, in a standard format that can be used in future analysis  

**Data Source(s)**  
1. Reddit - `r/euphoria`
   1. posts/comments
   2. filtering comments on date and 'Rue'
      1. S1:
      2. S2: jan 9 - feb 7, 2022

**Limit Scope**
- Using library `psaw`, pull posts that mention 'Rue' during the time frame
- Using post ids, pull all comment trees for each post
- Experiment with tree depth : 100%, 85%, 50% or top N

**Data Gathering**
- `psaw`
  - wrapper around pushift.io reddit api
- `pickle`
  - saving data for later
- `pandas`
  - exporting data to csv


In [1]:
# load libraries
import pandas as pd
import numpy as np
import pickle

In [2]:
# load data
from pandas import read_pickle

comments = read_pickle('../dat/s2_rue_comments.pkl')

In [3]:
# remove rows where body = '[deleted]'
comments = comments[comments[0] != '[deleted]']
# remove any row that contatins an automatically generated response by the bot
pattern_del = 'Thank you for your submission'
filtered = comments[0].str.contains(pattern_del, na = False)
comments = comments[~filtered]
# remove solicitation comments
pattern_ads = 'paypal'
filtered_ads = comments[0].str.contains(pattern_ads, na = False)
comments = comments[~filtered_ads]


In [4]:
# remove non=english comments
from langdetect import detect

# add a new column for language
def detect_language(text):
    try:
        return detect(text)
    except:
        return 'unknown'

comments['lang'] = comments[0].apply(detect_language)


In [5]:
# filter for english
comments = comments[comments['lang'] == 'en']

In [6]:
comments_ls = comments[0].tolist()

#### 1. Clean data  

Common steps:
- remove punctuation  
- lowercase letters  
- remove numbers

Future steps after tokenization:  
- stem
- lemmatize
- combine phrases like 'thank you' to bigrams

In [7]:
# load libraries
from string import punctuation
import re
import unicodedata

In [8]:
# clean and remove numbers
# def cleaning_numbers(data):
#     return re.sub('[0-9]+', '', data)

In [8]:
# using regex - clean and remove URLs
def cleaning_URLs(text):
    return re.sub('((www.[^s]+)|(https?://[^s]+))',' ',text)

In [9]:
def clean_round1(text):
    # convert all to string
    text = str(text)
    # lower
    text = text.lower()
    # text in squre brackets
    text = re.sub('\[.*?\]', ' ', text)
    # urls
    text = cleaning_URLs(text)
    # punctuation
    text = re.sub('[%s]' % re.escape(punctuation), ' ', text)
    # remove numbers
    text = re.sub('[0-9]+', '', text)
    # it looks like there are edits being made to comments
    # remove any instance of edit:
    text = re.sub('edit:', '', text)
    # remove any user handles
    text = re.sub('@[a-zA-Z0-9_]+', '', text)
    return text

round1 = lambda x: clean_round1(x)

In [10]:
data_clean = []
for comment in comments_ls:
    data_clean.append(round1(comment))

In [11]:
# round two of cleaning: remove line breaks, emojis, qutes, etc.
def clean_round2(text):
    text = re.sub('\n', '', text)
    # text = re.sub('[''""...]', '', text)
    # remove emojis
    text = ''.join(c for c in text if not unicodedata.combining(c))
    return text

round2 = lambda x: clean_round2(x)

In [12]:
data_clean2 = []
for comment in data_clean:
    data_clean2.append(round2(comment))

In [13]:
# round 3 cleaning: expand contractions: i'd, you've, you're etc
import contractions
def clean_round3(text):
    # expand contractions
    text = contractions.fix(text)
    return text

round3 = lambda x: clean_round3(x)

In [14]:
data_clean3 = []
for comment in data_clean2:
    data_clean3.append(round3(comment))

*because of the cleanning of repeated characters affecting words with double letters (addict, struggle), do stemming and lemmatizing first, then clean repeated chars*

In [15]:
# another round with lemmatization and stemming
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

def clean_round3a(text):
    # stemming
    # stemmer = PorterStemmer()
    # text = ' '.join([stemmer.stem(word) for word in text.split()])
    # lemmatization
    lemmatizer = WordNetLemmatizer()
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

    return text

round3a = lambda x: clean_round3a(x)


In [16]:
data_clean3a = []
for comment in data_clean3:
    data_clean3a.append(round3a(comment))

In [67]:
# grab the 500 most common words in all the comments to use to populate a
# list of words to be excluded in the modify function
# def get_top_n_words(corpus, n=None):
#     vectorizer = CountVectorizer(max_features=n, stop_words='english')
#     vectorizer.fit_transform(corpus)
#     bag_of_words = vectorizer.vocabulary_
#     # sort the words by their frequency
#     bag_of_words = sorted(bag_of_words.items(), key=lambda x: x[1], reverse=True)
#     return bag_of_words

# top = get_top_n_words(data_clean3a, n=500)

In [17]:
# remove accented characters
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

# clean and remove repeated characters
# def cleaning_repeating_char(text):
#     return re.sub(r'(.)1+', r'1', text)

In [54]:
do_not_mod_ls = ['good', 'spelling', 'telling', 'addiction','addicts', 'finally', 'all', 'personally', 'struggles', 'really', 'cassie', 'free', 
                 'feeling', 'elliot', 'maddy', 'literally', 'getting', 'better', 'actually', 'totally', 'telling', 'supposed', 'teen',
                 'stuff', 'sorry', 'soon', 'sells', 'sees', 'schools', 'rooms', 'redeeming', 'reddit', 'putting', 'pretty', 'official', 'need',
                 'messages', 'matters', 'looks', 'little', 'kills', 'issues', 'horror', 'hell', 'happens', 'happy', 'gonna', 'getting', 'especially',
                 'different', 'classic', 'businesses', 'attention', 'basically', 'apples', 'weeks', 'streets', 'needles', 'planning']

from operator import contains
def modify(s):
    # split comments into words
    comment = []
    for word in s.split():
        if any(word in x for x in do_not_mod_ls if contains(x, word)):
            comment.append(word)
        else:
            word = re.sub(r'([a-z])\1+', r'\1', word)
            comment.append(word)
    # join the words back together
    comment = ' '.join(comment)
    return comment
    
# print(modify('good'))
# print(modify('waaayyy'))

In [55]:
# modify('switching to needle is baaaaaddddddd she manipulated her into needle use planning to pimp her out to pay her debt dark af')

'switching to needle is bad she manipulated her into needle use planning to pimp her out to pay her debt dark af'

In [56]:
# remove repeating characters and non unicode characters
def clean_round4(text):
    # text = remove_accented_chars(text)
    # text = cleaning_repeating_char(text)
    # remove non unicode characters
    text = re.sub('[^\x00-\x7F]+', '', text)
    # remove repeating characters
    text = modify(text)
    text = text.strip()
    return text

round4 = lambda x: clean_round4(x)

In [57]:
data_clean4 = []
for comment in data_clean3:
    data_clean4.append(round4(comment))

there are a lot of words with similar roots: play, played, playing

also, somehow a lot of random posts ended up in the clean set:  
- ' shiping usa only payment paypal fampf buyer pay fe for paypal gampsdecants ml or ml htpsimgurcomalhnzx htpsimgurcomaiozvyq of botles and respective level measured with a syringe for reference ptfe tape betwen botle and nozle thread and parafilm around where nozle and botle met decants are individualy buble wraped before going inside zip bag and then inside buble mailersthanks for loking please chatpm if interested or if you have any questionsampxb htpsimgurcomalhnzx link for a photo of botlesmlmlgardenia antiguaorangerie venisepivoine suzhourose darabierogue malachitesable nuitvert malachiteampxb htpsimgurcomasvqrut link for a photo of botlesmlml rue cambonbeigebois de ilescoromandelcristale edtcristale eau vertela pausale liono edtno edpno poudreno sycomoreampxb htpsimgurcomausue link for a photo of botlesmlmlambre nuitbois dargentcuir canagebalade sauvagebele de jourdioramoureau noirefeve delicieusegrand balgris diorholy peonyjasmin de angesla cole noiremilylaforetmitzahnew lok oud ispahanoud rosewodpatchouli imperialpurple oudsantal noirvanila dioramavetiverampxb htpsimgurcomagoari link for a photo of botlesmlmlvelvet amber sunvelvet desert oudvelvet exotic leathervelvet incensovelvet mimosa blomvelvet tender oudampxb htpsimgurcomaksojbo link for a photo of botlesmlmlmlcarnal flowerlys mediteranethe monthe nightampxb htpsimgurcomazdbkwra link for a photo of botlesmlmla chant for the nympha song for the rosethe eye of the tigerthe voice of the snakeampxb htpsimgurcomahipx link for a photo of botlesmlmlangelique noirebois darmeniecuir belugaencens mythiquejoyeuse tubereuseneroli outrenoirsantal royalshalimar milsime vanila planifoliampxb htpsimgurcomalibgsh link for a photo of botlesmlmlamyris homeaqua universalis fortegrand soiroud extraitoud satin modampxb htpsimgurcomaeysuwsh link for a photo of botlesmlmlbabylondark lightday for nightdesert serenademarienbadmiracle of the roseampxb htpsimgurcomagdqne link for a photo of botlesmlmlambre sultanborneo cuir mauresquede profundisfumerie turquetubereuse crimineleampxb htpsimgurcomajxlzlf link for a photo of botlesmlml rue de belechase saint place sulpiceatlas gardencabancaftancapelinexquisite embroiderymagnificent goldsaharienesplendid wodtrenchtuxedoveloursvinyle'

**Organize data**  

1. Corpus = `data_clean4`
2. TDM

In [58]:
data_clean4_df = pd.DataFrame(data_clean4)

In [59]:
# remove any empty rows
data_clean4_df = data_clean4_df[data_clean4_df[0] != '']
data_clean4_df = data_clean4_df[data_clean4_df[0] != ' ']

In [60]:
# pickle it
data_clean4_df.to_pickle('../dat/corpus.pkl')

In [61]:
# create document-term matrix
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean4_df[0])
data_tdm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data_tdm_index = data_tdm.index
data_tdm

Unnamed: 0,ab,aback,abaedefabdfef,abafbfbedbada,abandon,abandoned,abandoning,abandonment,abandons,abashed,...,zqcsrpwsge,zqnuhckwdqwrhkuo,zrue,zs,zshwbhethehenozxfyqg,zsmkbrmwngzsibrntkt,zsuzsana,zurich,zwhnrmujykdxmntiub,zy
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19283,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19284,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19285,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19286,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


meh, not sure how i feel about stemming and lemmatizing.
creates weird words that i think are more noisy than helpful.

In [62]:
# pickle for later use
data_tdm.to_pickle('../dat/tdm.pkl')