## Discovering impact of the Series 'Euphoria' through NLP
### Analysis based on posts and comments on the `r/euphoria` subreddit

#### Question & Data Cleaning

Analysis plan:  

0. identify question(s) and data sources
1. clean data to get it into a standard format for further analysis
   1. corpus (collection of texts) - to dataframe using `pandas`
   2. document-term matrix - clean, tokenize, tdm
2. EDA
3. topic modeling based on comments of drug-related posts on r/euphoria
4. network analysis between users of drug and euphoria communities
5. aspect based sentiment analysis of euphoria drug-related comments
6. topic modeling based on comments that refer to euphoria on r/opioids, r/cannabis, r/benzodiazepenes

#### 0. What is our question?

*What about the drug portrayal on HBO's Euphoria makes it engaging to fans?*

I: the question, "what makes Euphoria engaged in conversations around drugs?"  
O: data that cleaned, organized, in a standard format that can be used in future analysis  

**Data Source(s)**  
1. Reddit - `r/euphoria`
   1. comments
   2. building keyword list for filetering comments  

**Limit Scope**
- Using library `praw`, pull 'top' and 'hot' headlines from subreddit
- Filter headlines based on if any keyword appeared in the headline
  - keyword list was created from a brief manual examination of headlines for repeated mentions of certain substances
- Filter drug-related headlines based on the number of comments and overall relevance 
  - number of comments provided indication of how popular the headline was

**Data Gathering**
- `praw`
  - wrapper around reddit api
- `pickle`
  - saving data for later
- `pandas`
  - exporting data to csv


In [1]:
# load libraries
import pandas as pd
import numpy as np
import pickle

In [13]:
# load data
comments = pd.read_csv('../dat/all_comments.csv')

In [14]:
# remove rows where body = '[deleted]'
comments = comments[comments['body'] != '[deleted]']

In [7]:
# group comments based on post id and return dictionary of post id and comments

def combine_text(comments):
    comments = comments.groupby('post').agg({'body': lambda x: ' '.join(x)})
    return comments

In [15]:
comments_combined = combine_text(comments)

In [None]:
# comments_combined['body'][0]

#### 1. Clean data  

Common steps:
- remove punctuation  
- lowercase letters  
- remove numbers

Future steps after tokenization:  
- stem
- lemmatize
- combine phrases like 'thank you' to bigrams

In [16]:
# load libraries
from string import punctuation
import re
import unicodedata

In [17]:
def clean_round1(text):
    text = text.lower()
    # text in squre brackets
    text = re.sub('\[.*?\]', '', text)
    # punctuation
    text = re.sub('[%s]' % re.escape(punctuation), '', text)
    # remove numbers
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_round1(x)

In [19]:
data_clean = pd.DataFrame(comments_combined['body'].apply(round1))

In [27]:
# round two of cleaning: remove line breaks, emojis, qutes, etc.
def clean_round2(text):
    text = re.sub('\n', '', text)
    # text = re.sub('[''""...]', '', text)
    # remove emojis
    text = ''.join(c for c in text if not unicodedata.combining(c))
    return text

round2 = lambda x: clean_round2(x)

In [28]:
data_clean2 = pd.DataFrame(data_clean['body'].apply(round2))

In [30]:
# round 3 cleaning: expand contractions: i'd, you've, you're etc
import contractions
def clean_round3(text):
    # expand contractions
    text = contractions.fix(text)
    return text

round3 = lambda x: clean_round3(x)

In [31]:
data_clean3 = pd.DataFrame(data_clean2['body'].apply(round3))

In [44]:
# another round with lemmatization and stemming
# from nltk.stem import WordNetLemmatizer
# from nltk.stem.porter import PorterStemmer

# def clean_round4(text):
#     # lemmatization
#     # lemmatizer = WordNetLemmatizer()
#     # text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
#     # stemming
#     stemmer = PorterStemmer()
#     text = ' '.join([stemmer.stem(word) for word in text.split()])
#     return text

# round4 = lambda x: clean_round4(x)


In [45]:
# data_clean4 = pd.DataFrame(data_clean3['body'].apply(round4))

**Organize data**  

1. Corpus = `data_clean3`
2. TDM

In [48]:
# corpus
# add column with meaningfull post identifier

titles = ['Likely to try drugs', 'Elliots response to free drugs', 'Realistic portrayal of withdrawal']
data_clean3['post_q'] = titles

In [49]:
# pickle it
data_clean3.to_pickle('../dat/corpus.pkl')

In [50]:
# create document-term matrix
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean3['body'])
data_tdm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data_tdm_index = data_tdm.index
data_tdm

Unnamed: 0,aa,aana,ab,aback,abby,abhorrence,ability,able,abroad,absolute,...,zealand,zendaya,zendayas,zero,zoloft,zombie,zone,zoo,zooming,zs
0,1,0,0,0,1,0,0,9,0,4,...,1,8,1,0,0,0,1,0,1,1
1,2,1,1,1,0,0,2,6,0,0,...,0,2,0,1,0,1,0,0,0,0
2,3,0,1,0,0,1,0,14,1,4,...,0,4,0,3,1,1,1,1,0,0


meh, not sure how i feel about stemming and lemmatizing.
creates weird words that i think are more noisy than helpful.

In [51]:
# pickle for later use
data_tdm.to_pickle('../dat/tdm.pkl')