### Discovering impact of the Series 'Euphoria' through NLP
#### Analysis based on posts and comments on the `r/euphoria` subreddit

Analysis plan:  

0. identify question(s) and data sources
1. clean data to get it into a standard format for further analysis
   1. corpus (collection of texts) - to dataframe using `pandas`
   2. document-term matrix - clean, tokenize, tdm
2. EDA
3. topic modeling based on comments of drug-related posts on r/euphoria
4. network analysis between users of drug and euphoria communities
5. aspect based sentiment analysis of euphoria drug-related comments
6. topic modeling based on comments that refer to euphoria on r/opioids, r/cannabis, r/benzodiazepenes

#### 0. What is our question?

*What about the drug portrayal on HBO's Euphoria makes it engaging to fans?*

I: the question, "what makes Euphoria engaged in conversations around drugs?"  
O: data that cleaned, organized, in a standard format that can be used in future analysis  

**Data Source(s)**  
1. Reddit - `r/euphoria`
   1. comments
   2. building keyword list for filetering comments  

**Limit Scope**
- Using library `praw`, pull 'top' and 'hot' headlines from subreddit
- Filter headlines based on if any keyword appeared in the headline
  - keyword list was created from a brief manual examination of headlines for repeated mentions of certain substances
- Filter drug-related headlines based on the number of comments and overall relevance 
  - number of comments provided indication of how popular the headline was

**Data Gathering**
- `praw`
  - wrapper around reddit api
- `pickle`
  - saving data for later
- `pandas`
  - exporting data to csv


In [1]:
# load libraries
import pandas as pd
import numpy as np
import pickle

In [13]:
# load data
comments = pd.read_csv('../dat/all_comments.csv')

In [14]:
# remove rows where body = '[deleted]'
comments = comments[comments['body'] != '[deleted]']

In [7]:
# group comments based on post id and return dictionary of post id and comments

def combine_text(comments):
    comments = comments.groupby('post').agg({'body': lambda x: ' '.join(x)})
    return comments

In [15]:
comments_combined = combine_text(comments)

In [None]:
# comments_combined['body'][0]

#### 1. Clean data  

Common steps:
- remove punctuation  
- lowercase letters  
- remove numbers

Future steps after tokenization:  
- stem
- lemmatize
- combine phrases like 'thank you' to bigrams

In [16]:
# load libraries
from string import punctuation
import re
import unicodedata

In [17]:
def clean_round1(text):
    text = text.lower()
    # text in squre brackets
    text = re.sub('\[.*?\]', '', text)
    # punctuation
    text = re.sub('[%s]' % re.escape(punctuation), '', text)
    # remove numbers
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_round1(x)

In [19]:
data_clean = pd.DataFrame(comments_combined['body'].apply(round1))

In [20]:
# round two of cleaning: remove line breaks, emojis, quptes, etc.
def clean_round2(text):
    text = re.sub('\n', '', text)
    text = re.sub('[''""...]', '', text)
    # remove emojis
    text = ''.join(c for c in text if not unicodedata.combining(c))
    return text

round2 = lambda x: clean_round2(x)

In [21]:
data_clean2 = pd.DataFrame(data_clean['body'].apply(round2))

In [44]:
# remove punctuation, lowercase, remove numbers
punctuations_list = string.punctuation

def cleaning_punctuations(text):
    translator = str.maketrans('', '', punctuations_list)
    return text.translate(translator)

# clean and remove numbers
def cleaning_numbers(data):
    return re.sub('[0-9]+', '', data)

In [45]:
# put it together
def clean_text(text):
    text = text.lower()
    text = cleaning_punctuations(text)
    text = cleaning_numbers(text)
    return text

In [49]:
comments['clean_body'] = comments['body'].apply(clean_text)

**Organize data**  

1. Corpus = `data_clean2`
2. TDM

In [22]:
# corpus
# add column with meaningfull post identifier

titles = ['Likely to try drugs', 'Elliots response to free drugs', 'Realistic portrayal of withdrawal']
data_clean2['post_q'] = titles

In [23]:
# pickle it
data_clean2.to_pickle('../dat/corpus.pkl')

In [24]:
# create document-term matrix
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
data_cv = cv.fit_transform(data_clean2['body'])
data_tdm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_tdm_index = data_tdm.index
data_tdm



Unnamed: 0,aa,aana,ab,aback,abby,abhorrence,ability,able,about,above,...,zealand,zendaya,zendayas,zero,zoloft,zombie,zone,zoo,zooming,zs
0,1,0,0,0,1,0,0,9,41,1,...,1,8,1,0,0,0,1,0,1,1
1,2,1,1,1,0,0,2,6,107,1,...,0,2,0,1,0,1,0,0,0,0
2,3,0,1,0,0,1,0,14,127,0,...,0,4,0,3,1,1,1,1,0,0


#### 1. Topic modeling of comments from 3 posts in the r/euphoria subreddit  
N ~ 1,709 comments
1. Question: Does euphoria make you less likely to try drugs? Or are you more curious than you were before?
2. Not enough people are talking about Elliot's response to Rue telling him about her plan to get "free" drugs from Laurie
3. As an ex-opioid addict, Zendaya's withdrawal scenes are the most realistic portrayal I've ever seen before. her acting is phenomenal.

In [36]:
# load libraries
from gensim.models import LdaModel
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
from gensim.models.ldamulticore import LdaMulticore

**Clean text**

In [40]:
# using regex - clean and remove URLs
def cleaning_URLs(text):
    return re.sub('((www.[^s]+)|(https?://[^s]+))',' ',text)

# remove accented characters
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

# remove punctuation
import string
english_punctuations = string.punctuation
punctuations_list = english_punctuations
def cleaning_punctuations(text):
    translator = str.maketrans('', '', punctuations_list)
    return text.translate(translator)

# remove stopwords
STOPWORDS = set(stopwords.words('english'))
def cleaning_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

# clean and remove repeated characters
def cleaning_repeating_char(text):
    return re.sub(r'(.)1+', r'1', text)

# clean and remove numbers
def cleaning_numbers(data):
    return re.sub('[0-9]+', '', data)

tokenizer=ToktokTokenizer()

In [41]:
# text preprocessing - clean comments for non-ascii characters, remove stopwords, lemmatize
# tokenize comments

def clean_text(comment):
    doc = cleaning_URLs(comment)
    doc = remove_accented_chars(doc)
    doc = cleaning_punctuations(doc)
    doc = cleaning_numbers(doc)
    doc = cleaning_numbers(doc)
    doc = tokenizer.tokenize(doc)
    doc = cleaning_stopwords(doc)
    doc = WordNetLemmatizer().lemmatize(doc)
    doc = PorterStemmer().stem(doc)
    return doc

In [27]:
sample = comments['body'].sample(n=5)

In [42]:
clean_text(sample)

TypeError: expected string or bytes-like object

In [34]:
# delete tokens column
comments.drop(columns=['tokens'], inplace=True)

In [35]:
# apply clean_text to comments
comments['tokens'] = comments['body'].apply(clean_text)

In [30]:
# get dictionary
dictionary = Dictionary(comments['tokens'])

TypeError: decoding to str: need a bytes-like object, list found

In [11]:
# implement LDA
def lda_model(corpus, dictionary, num_topics=10, passes=20):
    """
    Create LDA model
    """
    lda_model = LdaMulticore(corpus=corpus,
                             id2word=dictionary,
                             num_topics=num_topics,
                             passes=passes,
                             workers=2)
    return lda_model

In [None]:
# apply LDA to comments
test = 