# Project 3 - Reddit webscraping
by Liyena Yusoff

## Contents:
* Problem Statement
* Datasets
* Data Import & cleaning
* Exploratory Data Analysis
* Next Steps

## Background
In the past year, Netflix stock price has been decreasing due to shortage of in-demand films which have been the drive for viewership and subscriptions (abcnews, 2023). Due to the fallout of Hollywood actors and writer's strike, there has been less original shows produced by Netflix on top of lower subscription sign ups as compared to the previous years. 

## Problem Statement
As the marketing team at Netflix, our primary objective is to boost website traffic and increase the number of sign-ups for our streaming service. To achieve this, we aim to develop a machine learning classifier to analyze Netflix and Disney+ Reddit posts, distinguishing between discussions related to Netflix and Disney+. The model will identify unique words and phrases associated specifically with Netflix, allowing us to understand what sets us apart in public perception.

## Goal

The goal of the project is to utilize the unique words and insights for the company's marketing campaigns to amplify our unique selling points, thereby driving more traffic to our website and increasing subscriber sign-ups and viewership.

## Stakeholders

1. Netflix Marketing Team
2. Netflix Content team

## Success metrics
* F1-score

## Datasets

[`Netflix_reddit_submissions.csv`](datasets/Netflix_reddit_submissions.csv)
[`DisneyPlus_reddit_submissions.csv`](datasets/DisneyPlus_reddit_submissions.csv)

# 1. Import Data

In [2]:
import pandas as pd

from bs4 import BeautifulSoup
import spacy
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords, wordnet
from nltk import pos_tag
import re 

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


In [3]:
nf = pd.read_csv('datasets/Netflix_reddit_submissions.csv')
dp = pd.read_csv('DisneyPlus_reddit_submissions.csv')

In [4]:
nf.head()

Unnamed: 0,title,selftext,ups,upvote_ratio,num_comments,author,link_flair_text,awards,is_original_content,is_video,post_type,domain,created_utc,pinned,locked,stickied
0,/r/Netflix Discord Server,We are pleased to announce we have affiliated ...,427,0.97,182,N3DSdude,Announcement,3,False,False,text,self.netflix,1619278000.0,False,False,True
1,Netflix Announces Plans to Crack Down on Passw...,> **Any post relating to this thread will now ...,661,0.94,3069,UniversallySecluded,Megathread,0,False,False,text,self.netflix,1675331000.0,False,False,True
2,Hope this is cool to share here: I was the art...,,8,0.67,2,DamienTorres,,0,False,False,link,reddit.com,1694479000.0,False,False,False
3,Any tips for de-morbiding your feed?,This is on behalf of a friend although my feed...,9,0.76,3,Scully__,,0,False,False,text,self.netflix,1694473000.0,False,False,False
4,Netflix Wrapping Up Anna Kendrick’s Serial Kil...,,4,1.0,0,misana123,,0,False,False,link,variety.com,1694481000.0,False,False,False


In [5]:
nf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3456 entries, 0 to 3455
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   title                3456 non-null   object 
 1   selftext             2103 non-null   object 
 2   ups                  3456 non-null   int64  
 3   upvote_ratio         3456 non-null   float64
 4   num_comments         3456 non-null   int64  
 5   author               3456 non-null   object 
 6   link_flair_text      217 non-null    object 
 7   awards               3456 non-null   int64  
 8   is_original_content  3456 non-null   bool   
 9   is_video             3456 non-null   bool   
 10  post_type            3456 non-null   object 
 11  domain               3456 non-null   object 
 12  created_utc          3456 non-null   float64
 13  pinned               3456 non-null   bool   
 14  locked               3456 non-null   bool   
 15  stickied             3456 non-null   b

# 2. Data Cleaning

1. Remove Duplicates
2. Fill in missing flair text
3. Drop rows with empty `selftext`
4. Reset Index
5. Get readable time
5. Include a `subreddit` column to differentiate the data between Netflix and DisneyPlus

In [6]:
nf.head(30)

Unnamed: 0,title,selftext,ups,upvote_ratio,num_comments,author,link_flair_text,awards,is_original_content,is_video,post_type,domain,created_utc,pinned,locked,stickied
0,/r/Netflix Discord Server,We are pleased to announce we have affiliated ...,427,0.97,182,N3DSdude,Announcement,3,False,False,text,self.netflix,1619278000.0,False,False,True
1,Netflix Announces Plans to Crack Down on Passw...,> **Any post relating to this thread will now ...,661,0.94,3069,UniversallySecluded,Megathread,0,False,False,text,self.netflix,1675331000.0,False,False,True
2,Hope this is cool to share here: I was the art...,,8,0.67,2,DamienTorres,,0,False,False,link,reddit.com,1694479000.0,False,False,False
3,Any tips for de-morbiding your feed?,This is on behalf of a friend although my feed...,9,0.76,3,Scully__,,0,False,False,text,self.netflix,1694473000.0,False,False,False
4,Netflix Wrapping Up Anna Kendrick’s Serial Kil...,,4,1.0,0,misana123,,0,False,False,link,variety.com,1694481000.0,False,False,False
5,Why did netflix remove some profile pictures?,Was using one of the one piece profile picture...,11,0.64,6,ironshadowy,,0,False,False,text,self.netflix,1694448000.0,False,False,False
6,One Piece Issue?,I saw episode one and two earlier and they wer...,2,1.0,0,ussjtrunksftw,,0,False,False,text,self.netflix,1694478000.0,False,False,False
7,Looking for tv shows that its story is based i...,"Hello, as the titles says looking for any tv s...",19,0.72,45,shaoOOlin,,0,False,False,text,self.netflix,1694419000.0,False,False,False
8,Dear Child,"I just finished watching ""Dear Child"" on Netfl...",0,0.4,2,Psychological-Ant562,,0,False,False,text,self.netflix,1694471000.0,False,False,False
9,Anyone still having Billing through Apple?,I know it has been a long time since Netflix s...,0,0.5,0,bblunt29,,0,False,False,text,self.netflix,1694461000.0,False,False,False


In [7]:
nf[nf['post_type'] == 'link'].isnull().sum()

title                     0
selftext               1326
ups                       0
upvote_ratio              0
num_comments              0
author                    0
link_flair_text        1263
awards                    0
is_original_content       0
is_video                  0
post_type                 0
domain                    0
created_utc               0
pinned                    0
locked                    0
stickied                  0
dtype: int64

In [8]:
nf.shape

(3456, 16)

## i. Cleaning Netflix Data

The `link_flair_text` is categorical which contains post flairs. These allow the subreddit moderators (called 'mods' in short) and the community members to create a visual flag for tagged content. These flairs are useful as they categorize the different types of post content types which can also be subtopics in the community.

In the Netflix subreddit, there are only 2 types of flairs: _'text'_ and _'link'_, while in the Disney+ subreddit, there are at least 20 types of flairs. For the null values in the `link_flair_text` column, we will input it with _'others'_.

As we will be analysing the text columns and words, we will drop the empty `selftext` columns.

In [9]:
def clean_data(df):
    
    # drop the duplicated rows
    df = df.drop_duplicates()
    
    # fill non-text columns nan values with 0
    df['link_flair_text'] = df['link_flair_text'].fillna('others')
    
    # drop nan rows
    df.dropna(inplace=True)
    
    # reset_index
    df.reset_index(drop=True, inplace=True)
    
    return df

In [10]:
nf = clean_data(nf)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['link_flair_text'] = df['link_flair_text'].fillna('others')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True)


In [11]:
nf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1913 entries, 0 to 1912
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   title                1913 non-null   object 
 1   selftext             1913 non-null   object 
 2   ups                  1913 non-null   int64  
 3   upvote_ratio         1913 non-null   float64
 4   num_comments         1913 non-null   int64  
 5   author               1913 non-null   object 
 6   link_flair_text      1913 non-null   object 
 7   awards               1913 non-null   int64  
 8   is_original_content  1913 non-null   bool   
 9   is_video             1913 non-null   bool   
 10  post_type            1913 non-null   object 
 11  domain               1913 non-null   object 
 12  created_utc          1913 non-null   float64
 13  pinned               1913 non-null   bool   
 14  locked               1913 non-null   bool   
 15  stickied             1913 non-null   b

### Cleanup the columns

Convert the `created_utc` to `datetime` format

In [12]:
nf['readable_time'] = pd.to_datetime(nf['created_utc'], unit='s')

In [13]:
nf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1913 entries, 0 to 1912
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   title                1913 non-null   object        
 1   selftext             1913 non-null   object        
 2   ups                  1913 non-null   int64         
 3   upvote_ratio         1913 non-null   float64       
 4   num_comments         1913 non-null   int64         
 5   author               1913 non-null   object        
 6   link_flair_text      1913 non-null   object        
 7   awards               1913 non-null   int64         
 8   is_original_content  1913 non-null   bool          
 9   is_video             1913 non-null   bool          
 10  post_type            1913 non-null   object        
 11  domain               1913 non-null   object        
 12  created_utc          1913 non-null   float64       
 13  pinned               1913 non-nul

Add new column to indicate that dataset is from Netflix

In [14]:
# 1 for netflix
nf['subreddit'] = 1

### ii. Combining the `title` and `selftext` columns

We combine these two columns for easier text processing and modeling in the later part.

In [73]:
# a function that combines the title and text columns into a new column

def combine_text_col(df,title_col, selftext_col):
    
    df['text'] = df['title'] + " " + df['selftext']
    
    return df

In [16]:
# combining text columns
combine_text_col(nf,'title','selftext')

Unnamed: 0,title,selftext,ups,upvote_ratio,num_comments,author,link_flair_text,awards,is_original_content,is_video,post_type,domain,created_utc,pinned,locked,stickied,readable_time,subreddit,text
0,/r/Netflix Discord Server,We are pleased to announce we have affiliated ...,427,0.97,182,N3DSdude,Announcement,3,False,False,text,self.netflix,1.619278e+09,False,False,True,2021-04-24 15:24:04,1,/r/Netflix Discord Server We are pleased to an...
1,Netflix Announces Plans to Crack Down on Passw...,> **Any post relating to this thread will now ...,661,0.94,3069,UniversallySecluded,Megathread,0,False,False,text,self.netflix,1.675331e+09,False,False,True,2023-02-02 09:35:27,1,Netflix Announces Plans to Crack Down on Passw...
2,Any tips for de-morbiding your feed?,This is on behalf of a friend although my feed...,9,0.76,3,Scully__,others,0,False,False,text,self.netflix,1.694473e+09,False,False,False,2023-09-11 23:00:51,1,Any tips for de-morbiding your feed? This is o...
3,Why did netflix remove some profile pictures?,Was using one of the one piece profile picture...,11,0.64,6,ironshadowy,others,0,False,False,text,self.netflix,1.694448e+09,False,False,False,2023-09-11 15:57:38,1,Why did netflix remove some profile pictures? ...
4,One Piece Issue?,I saw episode one and two earlier and they wer...,2,1.00,0,ussjtrunksftw,others,0,False,False,text,self.netflix,1.694478e+09,False,False,False,2023-09-12 00:19:47,1,One Piece Issue? I saw episode one and two ear...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1908,"As I have two different wifis in my home, Netf...",I live in Argentina where they are currently a...,787,0.95,162,franchuv17,others,0,False,False,text,self.netflix,1.661973e+09,False,False,False,2022-08-31 19:11:25,1,"As I have two different wifis in my home, Netf..."
1909,Blockbuster is so disappointing,"Like a lot of people, I was so excited for thi...",766,0.95,187,Phillies059,others,0,False,False,text,self.netflix,1.667768e+09,False,False,False,2022-11-06 21:01:05,1,Blockbuster is so disappointing Like a lot of ...
1910,"Has anyone else watched Love and Monsters, I t...",\nThought it would be a Sci Fi original level ...,772,0.96,165,C1-10PTHX1138,others,2,False,False,text,self.netflix,1.618929e+09,False,False,False,2021-04-20 14:25:36,1,"Has anyone else watched Love and Monsters, I t..."
1911,Why does Netflix only suggest shitty movies wh...,When I look at the front page of netflix it se...,765,0.94,111,LeSpatula,others,0,False,False,text,self.netflix,1.659830e+09,False,False,False,2022-08-06 23:56:56,1,Why does Netflix only suggest shitty movies wh...


We want to examine some of the texts to have a preview of how they look like.

In [17]:
test = nf['text'][0]
test

'/r/Netflix Discord Server We are pleased to announce we have affiliated with https://discord.gg/Netflix which will be the subreddit Discord server for the Netflix subreddit! \n\nFeel free to join the server and talk about everything Netflix related, including shows on Netflix as well :).'

In [18]:
test2 = nf['text'][1001]
test2

"So...Cloverfield: Paradox is at 16% on Rotten Tomatoes...I'm now convinced there is a dedicated brigade to undermine Netflix original movies. Run, studio Cowards, Run. This is beyond ridiculous."

### iii. Process the `text` column

From the previews, we want to apply these to the combined texts:
1. Remove the url in the text
2. Remove non-letters such as punctuations '!' and '%'
3. Convert the text to lower case
4. Remove english stopwords such as 'he', 'she', 'of'

In [19]:
# Function to convert a raw text to a string of words

def clean_text(raw_text):
    
    # 1. Remove HTML.
    text = BeautifulSoup(raw_text).get_text()
    
    # 2. Remove urls.
    no_urls = re.sub("http\S+", " ", text)
    
    # 3. Remove non-letters.
    letters_only = re.sub("[^a-zA-Z]", " ", no_urls)
    
    # 4. Convert to lower case, split into individual words.
    words = letters_only.lower().split()
    
    # 5. Stopwords to be removed.
    stops = set(stopwords.words('english'))
    
    # 6. Remove stopwords.
    meaningful_words = [w for w in words if not w in stops]
    
    # 8. Join the words back into one string separated by space, 
    # and return the result.
    return(" ".join(meaningful_words))

In [20]:
clean_text(test)

'r netflix discord server pleased announce affiliated subreddit discord server netflix subreddit feel free join server talk everything netflix related including shows netflix well'

**How does _clean_text_ function work?**

| Text   | Processed Text |
|---------|---------------|
|/r/Netflix Discord Server We are pleased to announce we have affiliated with https://discord.gg/Netflix which will be the subreddit Discord server for the Netflix subreddit! \n\nFeel free to join the server and talk about everything Netflix related, including shows on Netflix as well :).| r netflix discord server pleased announce affiliated subreddit discord server netflix subreddit feel free join server talk everything netflix related including shows netflix well|

In [23]:
# Function that gets the pos_tag for a given word. 
# The pos_tag needs to be passed together with the given word into the lemmatizer
# in order to effectively lemmatize all words besides nouns

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [24]:
lemmatizer = WordNetLemmatizer()

# Function to convert a raw review to a string of words

def clean_text_lem(raw_text):

    
    # 1. Remove HTML.
    text = BeautifulSoup(raw_text).get_text()
    
    # 2. Remove urls.
    no_urls = re.sub("http\S+", " ", text)
    
    # 3. Remove non-letters.
    letters_only = re.sub("[^a-zA-Z]", " ", no_urls)
    
    # 4. Convert to lower case, split into individual words.
    words = letters_only.lower().split()
    
    # 5. Lemmatize
    lem_words = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in words]
    
    # 6. Stopwords to be removed.
    stops = set(stopwords.words('english'))
    
    # 7. Remove stopwords.
    meaningful_words = [w for w in lem_words if not w in stops]
    
    # 8. Join the words back into one string separated by space, 
    # and return the result.
    return(" ".join(meaningful_words))

In [25]:
clean_text_lem(test)

'r netflix discord server pleased announce affiliate subreddit discord server netflix subreddit feel free join server talk everything netflix related include show netflix well'

**How does _clean_text_lem_ function work?**

| Text   | Processed Text | Processed Lemmatized Text  |
|---------|---------------|----------------------------|
|/r/Netflix Discord Server We are pleased to announce we have affiliated with https://discord.gg/Netflix which will be the subreddit Discord server for the Netflix subreddit! \n\nFeel free to join the server and talk about everything Netflix related, including shows on Netflix as well :).| r netflix discord server pleased announce affiliated subreddit discord server netflix subreddit feel free join server talk everything netflix related including shows netflix well| r netflix discord server pleased announce affiliate subreddit discord server netflix subreddit feel free join server talk everything netflix related include show netflix well|


Words such as 'affiliated', 'including' and 'shows' are lemmatized to 'affiliate', 'include' and 'show'.

In [74]:
# Load the English language model
nlp = spacy.load("en_core_web_sm")

# Process a text string
text = clean_text(test)
doc = nlp(text)

# Accessing token information
for token in doc:
    print(token.text, token.pos_, token.dep_)

r NOUN compound
netflix PROPN compound
discord NOUN compound
server NOUN nmod
pleased ADJ amod
announce NOUN npadvmod
affiliated VERB ROOT
subreddit NOUN compound
discord NOUN compound
server NOUN compound
netflix PROPN compound
subreddit NOUN nsubj
feel VERB ccomp
free ADJ amod
join NOUN compound
server NOUN compound
talk NOUN compound
everything PRON compound
netflix PROPN nsubj
related VERB advcl
including VERB prep
shows VERB pobj
netflix NOUN dobj
well ADV advmod


In [75]:
nlp = spacy.load("en_core_web_sm")

# Define the lemmatization function
def text_lem_spacy(text):
    doc = nlp(text)
    lemmatized_text = ' '.join([token.lemma_ for token in doc])
    return lemmatized_text

In [28]:
text_lem_spacy(clean_text(test))

'r netflix discord server pleased announce affiliate subreddit discord server netflix subreddit feel free join server talk everything netflix relate include show netflix well'

**How does _text_lem_spacy_ compare to the nltk lemmatization?**

| Processed Text | Processed Lemmatized Text  | Processed Lemmatized Text (spacy) |
|---------------|----------------------------|-----------------------------------|
| r netflix discord server pleased announce affiliated subreddit discord server netflix subreddit feel free join server talk everything netflix related including shows netflix well| r netflix discord server pleased announce affiliate subreddit discord server netflix subreddit feel free join server talk everything netflix related include show netflix well|r netflix discord server pleased announce affiliate subreddit discord server netflix subreddit feel free join server talk everything netflix relate include show netflix well


The word 'related' is lemmatized using spacy but not when using the nltk lemmatization function.

In [29]:
# Function to convert a raw review to a string of words
p_stemmer = PorterStemmer()
def clean_text_stem(raw_text):
    
    # 1. Remove HTML.
    text = BeautifulSoup(raw_text).get_text()
    
    # 2. Remove urls.
    no_urls = re.sub("http\S+", " ", text)
    
    # 3. Remove non-letters.
    letters_only = re.sub("[^a-zA-Z]", " ", no_urls)
    
    # 4. Convert to lower case, split into individual words.
    words = letters_only.lower().split()
    
    # 5. Lemmatize
    stem_words = [p_stemmer.stem(word) for word in words]
    
    # 6. Stopwords to be removed.
    stops = set(stopwords.words('english'))
    
    # 7. Remove stopwords.
    meaningful_words = [w for w in stem_words if not w in stops]
    
    # 8. Join the words back into one string separated by space, 
    # and return the result.
    return(" ".join(meaningful_words))

In [30]:
clean_text_stem(test)

'r netflix discord server pleas announc affili subreddit discord server netflix subreddit feel free join server talk everyth netflix relat includ show netflix well'

**How does _stemming_ works?**

| Processed Text | Processed Lemmatized Text  | Processed Stemmed Text |
|---------------|----------------------------|-----------------------------------|
| r netflix discord server pleased announce affiliated subreddit discord server netflix subreddit feel free join server talk everything netflix related including shows netflix well| r netflix discord server pleased announce affiliate subreddit discord server netflix subreddit feel free join server talk everything netflix related include show netflix well|r netflix discord server pleas announc affili subreddit discord server netflix subreddit feel free join server talk everyth netflix relat includ show netflix well|




After stemming, the words seem to be cut off and the result consists of incomplete words. As a result, we will be lemmatizing the processed text using the spacy lemmatizer and not through nltk stemming.

In [None]:
# mapping the functions onto the netflix dataframe

In [32]:
nf['proc_text'] = nf['text'].map(clean_text)



In [33]:
nf['proc_text_lem'] = nf['proc_text'].map(text_lem_spacy)

In [34]:
nf.head()

Unnamed: 0,title,selftext,ups,upvote_ratio,num_comments,author,link_flair_text,awards,is_original_content,is_video,...,domain,created_utc,pinned,locked,stickied,readable_time,subreddit,text,proc_text,proc_text_lem
0,/r/Netflix Discord Server,We are pleased to announce we have affiliated ...,427,0.97,182,N3DSdude,Announcement,3,False,False,...,self.netflix,1619278000.0,False,False,True,2021-04-24 15:24:04,1,/r/Netflix Discord Server We are pleased to an...,r netflix discord server pleased announce affi...,r netflix discord server pleased announce affi...
1,Netflix Announces Plans to Crack Down on Passw...,> **Any post relating to this thread will now ...,661,0.94,3069,UniversallySecluded,Megathread,0,False,False,...,self.netflix,1675331000.0,False,False,True,2023-02-02 09:35:27,1,Netflix Announces Plans to Crack Down on Passw...,netflix announces plans crack password sharing...,netflix announce plan crack password sharing m...
2,Any tips for de-morbiding your feed?,This is on behalf of a friend although my feed...,9,0.76,3,Scully__,others,0,False,False,...,self.netflix,1694473000.0,False,False,False,2023-09-11 23:00:51,1,Any tips for de-morbiding your feed? This is o...,tips de morbiding feed behalf friend although ...,tip de morbide feed behalf friend although fee...
3,Why did netflix remove some profile pictures?,Was using one of the one piece profile picture...,11,0.64,6,ironshadowy,others,0,False,False,...,self.netflix,1694448000.0,False,False,False,2023-09-11 15:57:38,1,Why did netflix remove some profile pictures? ...,netflix remove profile pictures using one one ...,netflix remove profile picture use one one pie...
4,One Piece Issue?,I saw episode one and two earlier and they wer...,2,1.0,0,ussjtrunksftw,others,0,False,False,...,self.netflix,1694478000.0,False,False,False,2023-09-12 00:19:47,1,One Piece Issue? I saw episode one and two ear...,one piece issue saw episode one two earlier fi...,one piece issue see episode one two early fine...


In [76]:
nf.describe(include='all')

  nf.describe(include='all')


Unnamed: 0,title,selftext,ups,upvote_ratio,num_comments,author,link_flair_text,awards,is_original_content,is_video,...,domain,created_utc,pinned,locked,stickied,readable_time,subreddit,text,proc_text,proc_text_lem
count,1913,1913,1913.0,1913.0,1913.0,1913.0,1913,1913.0,1913,1913,...,1913,1913.0,1913,1913,1913,1913,1913.0,1913,1913,1913
unique,1683,1689,,,,1488.0,11,,1,2,...,14,,1,2,2,1691,,1690,1690,1690
top,One Piece,This is on behalf of a friend although my feed...,,,,,others,,False,False,...,self.netflix,,False,False,False,2023-09-10 17:43:08,,Any tips for de-morbiding your feed? This is o...,tips de morbiding feed behalf friend although ...,tip de morbide feed behalf friend although fee...
freq,4,3,,,,143.0,1814,,1913,1901,...,1872,,1913,1903,1911,3,,3,3,3
first,,,,,,,,,,,...,,,,,,2011-06-23 20:49:52,,,,
last,,,,,,,,,,,...,,,,,,2023-09-12 00:19:47,,,,
mean,,,281.329848,0.621422,63.62206,,,0.113434,,,...,,1632904000.0,,,,,1.0,,,
std,,,1742.226681,0.191734,280.994206,,,0.585023,,,...,,85398050.0,,,,,0.0,,,
min,,,0.0,0.06,0.0,,,0.0,,,...,,1308862000.0,,,,,1.0,,,
25%,,,0.0,0.5,3.0,,,0.0,,,...,,1592033000.0,,,,,1.0,,,


In [87]:
nf[nf[['title','selftext']].duplicated(keep=False)].sort_values(by=['proc_text_lem'])

Unnamed: 0,title,selftext,ups,upvote_ratio,num_comments,author,link_flair_text,awards,is_original_content,is_video,...,domain,created_utc,pinned,locked,stickied,readable_time,subreddit,text,proc_text,proc_text_lem
223,Can't access DVD.COM,Anyone experienced this (screen below). I'm no...,5,0.65,4,00derek,others,0,False,False,...,self.netflix,1.693064e+09,False,False,False,2023-08-26 15:30:20,1,Can't access DVD.COM Anyone experienced this (...,access dvd com anyone experienced screen compu...,access dvd com anyone experience screen comput...
496,Can't access DVD.COM,Anyone experienced this (screen below). I'm no...,6,0.67,4,00derek,others,0,False,False,...,self.netflix,1.693064e+09,False,False,False,2023-08-26 15:30:20,1,Can't access DVD.COM Anyone experienced this (...,access dvd com anyone experienced screen compu...,access dvd com anyone experience screen comput...
582,Access Thailand Netflix from the US.,I’m trying to access Thai Netflix from the US ...,0,0.50,4,Tokyo_Hardnutz,others,0,False,False,...,self.netflix,1.691860e+09,False,False,False,2023-08-12 17:12:36,1,Access Thailand Netflix from the US. I’m tryin...,access thailand netflix us trying access thai ...,access thailand netflix we try access thai net...
376,Access Thailand Netflix from the US.,I’m trying to access Thai Netflix from the US ...,0,0.45,4,Tokyo_Hardnutz,others,0,False,False,...,self.netflix,1.691860e+09,False,False,False,2023-08-12 17:12:36,1,Access Thailand Netflix from the US. I’m tryin...,access thailand netflix us trying access thai ...,access thailand netflix we try access thai net...
67,“Your account can’t be used on this account”,I have traveled to a different country and I c...,1,0.56,2,theonlysisterfister,others,0,False,False,...,self.netflix,1.694112e+09,False,False,False,2023-09-07 18:40:03,1,“Your account can’t be used on this account” I...,account used account traveled different countr...,account use account travel different country u...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
556,Word of caution: Check for cancellation first,Apologies if this has been discussed here ad n...,0,0.40,4,mcknuckle,others,0,False,False,...,self.netflix,1.692291e+09,False,False,False,2023-08-17 16:42:26,1,Word of caution: Check for cancellation first ...,word caution check cancellation first apologie...,word caution check cancellation first apology ...
853,Zambian Girl Power On NETFLIX\n\nThere’s a gro...,\n\nThere’s a growing thirst for African films...,2,0.53,0,AfricanStream,others,0,False,True,...,v.redd.it,1.689972e+09,False,False,False,2023-07-21 20:39:22,1,Zambian Girl Power On NETFLIX\n\nThere’s a gro...,zambian girl power netflix growing thirst afri...,zambian girl power netflix grow thirst african...
1171,Zambian Girl Power On NETFLIX\n\nThere’s a gro...,\n\nThere’s a growing thirst for African films...,3,0.54,0,AfricanStream,others,0,False,True,...,v.redd.it,1.689972e+09,False,False,False,2023-07-21 20:39:22,1,Zambian Girl Power On NETFLIX\n\nThere’s a gro...,zambian girl power netflix growing thirst afri...,zambian girl power netflix grow thirst african...
560,Zombieverse Netflix Review,This show is absolutely horrible I'm debating ...,5,0.73,4,One-Office-8127,others,0,False,False,...,self.netflix,1.692245e+09,False,False,False,2023-08-17 03:56:57,1,Zombieverse Netflix Review This show is absolu...,zombieverse netflix review show absolutely hor...,zombieverse netflix review show absolutely hor...


## iv. Data Cleaning - Disney+

In [35]:
dp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3915 entries, 0 to 3914
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   title                3915 non-null   object 
 1   selftext             2000 non-null   object 
 2   ups                  3915 non-null   int64  
 3   upvote_ratio         3915 non-null   float64
 4   num_comments         3915 non-null   int64  
 5   author               3915 non-null   object 
 6   link_flair_text      3729 non-null   object 
 7   awards               3915 non-null   int64  
 8   is_original_content  3915 non-null   bool   
 9   is_video             3915 non-null   bool   
 10  post_type            3915 non-null   object 
 11  domain               3915 non-null   object 
 12  created_utc          3915 non-null   float64
 13  pinned               3915 non-null   bool   
 14  locked               3915 non-null   bool   
 15  stickied             3915 non-null   b

In [36]:
dp = clean_data(dp)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['link_flair_text'] = df['link_flair_text'].fillna('others')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True)


In [37]:
dp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1798 entries, 0 to 1797
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   title                1798 non-null   object 
 1   selftext             1798 non-null   object 
 2   ups                  1798 non-null   int64  
 3   upvote_ratio         1798 non-null   float64
 4   num_comments         1798 non-null   int64  
 5   author               1798 non-null   object 
 6   link_flair_text      1798 non-null   object 
 7   awards               1798 non-null   int64  
 8   is_original_content  1798 non-null   bool   
 9   is_video             1798 non-null   bool   
 10  post_type            1798 non-null   object 
 11  domain               1798 non-null   object 
 12  created_utc          1798 non-null   float64
 13  pinned               1798 non-null   bool   
 14  locked               1798 non-null   bool   
 15  stickied             1798 non-null   b

In [38]:
dp['link_flair_text'].unique()

array([':Tech: Tech Support', ':Thread: Mega Thread',
       ':Like: Recommendation', ':Discussion: Discussion',
       ':Question: Question', ':Watch: What Should I Watch?',
       ':New: New on Disney+!', ':Review: Review', ':News: News Article',
       ':Trailer: Official Trailer', ':Art: Fan Art', 'Mega Thread',
       ':Mod: Mod Post', 'Review', 'Discussion', ':WORLD: Global',
       'Question', 'Disney+ Service', ':WORLD: All', 'others', ':US: US',
       'DisneyPlus', 'Rumor', 'Recommendation', 'News', 'DisneyPlus Star',
       'Europe', 'Missing Movie/Show', 'Tech Issue', 'Announcement',
       'Missing/Out of Order Episode', ':FI: FI', 'North America',
       'Star Wars', ':snoo_thoughtful: Discussion', ':WORLD: World',
       ':UK: UK', 'Technical Issue', 'Fox', 'Oceania',
       'What Should I Watch?', ':CH: CH', 'Removed: Rule 8', ':CA: CA',
       'Disney', 'Latin America', 'Asia', ':AU: AU',
       'National Geographic', 'Film Discussion Thread', 'Marvel',
       'Origina

### Clean the texts

Convert the `created_utc` to `datetime` format

In [39]:
dp['readable_time'] = pd.to_datetime(dp['created_utc'], unit='s')

In [40]:
dp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1798 entries, 0 to 1797
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   title                1798 non-null   object        
 1   selftext             1798 non-null   object        
 2   ups                  1798 non-null   int64         
 3   upvote_ratio         1798 non-null   float64       
 4   num_comments         1798 non-null   int64         
 5   author               1798 non-null   object        
 6   link_flair_text      1798 non-null   object        
 7   awards               1798 non-null   int64         
 8   is_original_content  1798 non-null   bool          
 9   is_video             1798 non-null   bool          
 10  post_type            1798 non-null   object        
 11  domain               1798 non-null   object        
 12  created_utc          1798 non-null   float64       
 13  pinned               1798 non-nul

Add new column to indicate that dataset is from Netflix

In [41]:
# 0 for disneyplus
dp['subreddit'] = 0

In [42]:
dp.head()

Unnamed: 0,title,selftext,ups,upvote_ratio,num_comments,author,link_flair_text,awards,is_original_content,is_video,post_type,domain,created_utc,pinned,locked,stickied,readable_time,subreddit
0,This is the Weekly Tech Support Thread,All posts regarding tech support belong here.\...,3,0.81,26,AutoModerator,:Tech: Tech Support,0,False,False,text,self.DisneyPlus,1694020000.0,False,False,True,2023-09-06 17:04:21,0
1,Ahsoka - Episodes 1 and 2 Megathread,Ahsoka is (almost) here!\n\nStart streaming th...,19,0.99,27,anonRedd,:Thread: Mega Thread,0,False,False,text,self.DisneyPlus,1692730000.0,False,False,True,2023-08-22 18:38:53,0
2,For a while I was very reticent about watching...,"… I loved it. It’s not the original, it’s a re...",65,0.66,95,kindaweird0,:Like: Recommendation,0,False,False,link,i.redd.it,1694429000.0,False,False,False,2023-09-11 10:35:02,0
3,You guys think that Hulu would be like Star fo...,"Personally, I am very excited about the merger...",2,1.0,0,thekirasquad,:Discussion: Discussion,0,False,False,text,self.DisneyPlus,1694483000.0,False,False,False,2023-09-12 01:36:00,0
4,Can Disney+ run 1080p or 4K on a PC now?,I keep finding threads that are 2-3 year olds ...,3,0.81,4,ArsenalThePhoenix,:Question: Question,0,False,False,text,self.DisneyPlus,1694462000.0,False,False,False,2023-09-11 19:59:45,0


In [None]:
# combining the title and selftext columns

In [43]:
combine_text_col(dp, 'title', 'selftext')

Unnamed: 0,title,selftext,ups,upvote_ratio,num_comments,author,link_flair_text,awards,is_original_content,is_video,post_type,domain,created_utc,pinned,locked,stickied,readable_time,subreddit,text
0,This is the Weekly Tech Support Thread,All posts regarding tech support belong here.\...,3,0.81,26,AutoModerator,:Tech: Tech Support,0,False,False,text,self.DisneyPlus,1.694020e+09,False,False,True,2023-09-06 17:04:21,0,This is the Weekly Tech Support Thread All pos...
1,Ahsoka - Episodes 1 and 2 Megathread,Ahsoka is (almost) here!\n\nStart streaming th...,19,0.99,27,anonRedd,:Thread: Mega Thread,0,False,False,text,self.DisneyPlus,1.692730e+09,False,False,True,2023-08-22 18:38:53,0,Ahsoka - Episodes 1 and 2 Megathread Ahsoka is...
2,For a while I was very reticent about watching...,"… I loved it. It’s not the original, it’s a re...",65,0.66,95,kindaweird0,:Like: Recommendation,0,False,False,link,i.redd.it,1.694429e+09,False,False,False,2023-09-11 10:35:02,0,For a while I was very reticent about watching...
3,You guys think that Hulu would be like Star fo...,"Personally, I am very excited about the merger...",2,1.00,0,thekirasquad,:Discussion: Discussion,0,False,False,text,self.DisneyPlus,1.694483e+09,False,False,False,2023-09-12 01:36:00,0,You guys think that Hulu would be like Star fo...
4,Can Disney+ run 1080p or 4K on a PC now?,I keep finding threads that are 2-3 year olds ...,3,0.81,4,ArsenalThePhoenix,:Question: Question,0,False,False,text,self.DisneyPlus,1.694462e+09,False,False,False,2023-09-11 19:59:45,0,Can Disney+ run 1080p or 4K on a PC now? I kee...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1793,Disney should now release all of the original ...,Now that Disney is making the remarkable decis...,264,0.92,146,4KBlurayAvenger,:Discussion: Discussion,0,False,False,link,i.redd.it,1.692669e+09,False,False,False,2023-08-22 01:44:47,0,Disney should now release all of the original ...
1794,Disney CEO Bob Iger Is Open To Selling Hulu,Sharing it here just for the sake of some inte...,239,0.98,111,HumanOrAlien,:News: News Article,0,False,False,link,deadline.com,1.675961e+09,False,False,False,2023-02-09 16:39:17,0,Disney CEO Bob Iger Is Open To Selling Hulu Sh...
1795,"I wish I could remove stuff from ""continue wat...",Its baffling because every other streaming ser...,228,0.98,45,Mythdon-,:Discussion: Discussion,0,False,False,text,self.DisneyPlus,1.682673e+09,False,False,False,2023-04-28 09:13:33,0,"I wish I could remove stuff from ""continue wat..."
1796,Futurama Returns after almost 20 years!!,My favorite show probably ever has returned!! ...,211,0.90,54,UnrealityPsychosis,:New: New on Disney+!,0,False,False,link,i.redd.it,1.690290e+09,False,False,False,2023-07-25 13:07:26,0,Futurama Returns after almost 20 years!! My fa...


In [44]:
dp['text'].head()

0    This is the Weekly Tech Support Thread All pos...
1    Ahsoka - Episodes 1 and 2 Megathread Ahsoka is...
2    For a while I was very reticent about watching...
3    You guys think that Hulu would be like Star fo...
4    Can Disney+ run 1080p or 4K on a PC now? I kee...
Name: text, dtype: object

Examining some of the texts

In [45]:
dp['text'][0]

"This is the Weekly Tech Support Thread All posts regarding tech support belong here.\n\nExamples of tech support questions are:  \n\n\n* How do I cancel?\n* Why does the app crash on my Fire Stick/Roku/Apple TV?\n* Why don't the subtitles work correctly?\n* I am being overcharged for my subscription.\n\nBrowse other tech support posts [here](https://old.reddit.com/r/DisneyPlus/search?q=tech+support&restrict_sr=on&sort=relevance&t=all)."

In [46]:
dp['text'][1001]

'I watched Zootopia for the first time over the weekend, and honestly I’m mad I didn’t watch it sooner. I haven’t seen many recent Disney movies, and didn’t understand why they would consider adding a Zootopia area to Animal Kingdom, but now I get it. Great movie, and the part where Judy is riding the train into Zootopia was surprisingly beautiful.'

In [None]:
# mapping the functions onto the disney+ dataframe

In [48]:
dp['proc_text'] = dp['text'].map(clean_text)



In [49]:
dp['proc_text_lem'] = dp['proc_text'].map(text_lem_spacy)

In [50]:
dp.head()

Unnamed: 0,title,selftext,ups,upvote_ratio,num_comments,author,link_flair_text,awards,is_original_content,is_video,...,domain,created_utc,pinned,locked,stickied,readable_time,subreddit,text,proc_text,proc_text_lem
0,This is the Weekly Tech Support Thread,All posts regarding tech support belong here.\...,3,0.81,26,AutoModerator,:Tech: Tech Support,0,False,False,...,self.DisneyPlus,1694020000.0,False,False,True,2023-09-06 17:04:21,0,This is the Weekly Tech Support Thread All pos...,weekly tech support thread posts regarding tec...,weekly tech support thread post regard tech su...
1,Ahsoka - Episodes 1 and 2 Megathread,Ahsoka is (almost) here!\n\nStart streaming th...,19,0.99,27,anonRedd,:Thread: Mega Thread,0,False,False,...,self.DisneyPlus,1692730000.0,False,False,True,2023-08-22 18:38:53,0,Ahsoka - Episodes 1 and 2 Megathread Ahsoka is...,ahsoka episodes megathread ahsoka almost start...,ahsoka episode megathread ahsoka almost start ...
2,For a while I was very reticent about watching...,"… I loved it. It’s not the original, it’s a re...",65,0.66,95,kindaweird0,:Like: Recommendation,0,False,False,...,i.redd.it,1694429000.0,False,False,False,2023-09-11 10:35:02,0,For a while I was very reticent about watching...,reticent watching movie yesterday thanks final...,reticent watch movie yesterday thank finally l...
3,You guys think that Hulu would be like Star fo...,"Personally, I am very excited about the merger...",2,1.0,0,thekirasquad,:Discussion: Discussion,0,False,False,...,self.DisneyPlus,1694483000.0,False,False,False,2023-09-12 01:36:00,0,You guys think that Hulu would be like Star fo...,guys think hulu would like star us merger one ...,guy think hulu would like star we merger one a...
4,Can Disney+ run 1080p or 4K on a PC now?,I keep finding threads that are 2-3 year olds ...,3,0.81,4,ArsenalThePhoenix,:Question: Question,0,False,False,...,self.DisneyPlus,1694462000.0,False,False,False,2023-09-11 19:59:45,0,Can Disney+ run 1080p or 4K on a PC now? I kee...,disney run p k pc keep finding threads year ol...,disney run p k pc keep find thread year old ba...


## 3. Combine the df's

In [51]:
data = pd.concat([nf,dp], axis = 0)

In [52]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3711 entries, 0 to 1797
Data columns (total 21 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   title                3711 non-null   object        
 1   selftext             3711 non-null   object        
 2   ups                  3711 non-null   int64         
 3   upvote_ratio         3711 non-null   float64       
 4   num_comments         3711 non-null   int64         
 5   author               3711 non-null   object        
 6   link_flair_text      3711 non-null   object        
 7   awards               3711 non-null   int64         
 8   is_original_content  3711 non-null   bool          
 9   is_video             3711 non-null   bool          
 10  post_type            3711 non-null   object        
 11  domain               3711 non-null   object        
 12  created_utc          3711 non-null   float64       
 13  pinned               3711 non-nul

In [53]:
data.reset_index(drop=True)

Unnamed: 0,title,selftext,ups,upvote_ratio,num_comments,author,link_flair_text,awards,is_original_content,is_video,...,domain,created_utc,pinned,locked,stickied,readable_time,subreddit,text,proc_text,proc_text_lem
0,/r/Netflix Discord Server,We are pleased to announce we have affiliated ...,427,0.97,182,N3DSdude,Announcement,3,False,False,...,self.netflix,1.619278e+09,False,False,True,2021-04-24 15:24:04,1,/r/Netflix Discord Server We are pleased to an...,r netflix discord server pleased announce affi...,r netflix discord server pleased announce affi...
1,Netflix Announces Plans to Crack Down on Passw...,> **Any post relating to this thread will now ...,661,0.94,3069,UniversallySecluded,Megathread,0,False,False,...,self.netflix,1.675331e+09,False,False,True,2023-02-02 09:35:27,1,Netflix Announces Plans to Crack Down on Passw...,netflix announces plans crack password sharing...,netflix announce plan crack password sharing m...
2,Any tips for de-morbiding your feed?,This is on behalf of a friend although my feed...,9,0.76,3,Scully__,others,0,False,False,...,self.netflix,1.694473e+09,False,False,False,2023-09-11 23:00:51,1,Any tips for de-morbiding your feed? This is o...,tips de morbiding feed behalf friend although ...,tip de morbide feed behalf friend although fee...
3,Why did netflix remove some profile pictures?,Was using one of the one piece profile picture...,11,0.64,6,ironshadowy,others,0,False,False,...,self.netflix,1.694448e+09,False,False,False,2023-09-11 15:57:38,1,Why did netflix remove some profile pictures? ...,netflix remove profile pictures using one one ...,netflix remove profile picture use one one pie...
4,One Piece Issue?,I saw episode one and two earlier and they wer...,2,1.00,0,ussjtrunksftw,others,0,False,False,...,self.netflix,1.694478e+09,False,False,False,2023-09-12 00:19:47,1,One Piece Issue? I saw episode one and two ear...,one piece issue saw episode one two earlier fi...,one piece issue see episode one two early fine...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3706,Disney should now release all of the original ...,Now that Disney is making the remarkable decis...,264,0.92,146,4KBlurayAvenger,:Discussion: Discussion,0,False,False,...,i.redd.it,1.692669e+09,False,False,False,2023-08-22 01:44:47,0,Disney should now release all of the original ...,disney release original movies series removed ...,disney release original movie series remove di...
3707,Disney CEO Bob Iger Is Open To Selling Hulu,Sharing it here just for the sake of some inte...,239,0.98,111,HumanOrAlien,:News: News Article,0,False,False,...,deadline.com,1.675961e+09,False,False,False,2023-02-09 16:39:17,0,Disney CEO Bob Iger Is Open To Selling Hulu Sh...,disney ceo bob iger open selling hulu sharing ...,disney ceo bob iger open sell hulu sharing sak...
3708,"I wish I could remove stuff from ""continue wat...",Its baffling because every other streaming ser...,228,0.98,45,Mythdon-,:Discussion: Discussion,0,False,False,...,self.DisneyPlus,1.682673e+09,False,False,False,2023-04-28 09:13:33,0,"I wish I could remove stuff from ""continue wat...",wish could remove stuff continue watching baff...,wish could remove stuff continue watch baffle ...
3709,Futurama Returns after almost 20 years!!,My favorite show probably ever has returned!! ...,211,0.90,54,UnrealityPsychosis,:New: New on Disney+!,0,False,False,...,i.redd.it,1.690290e+09,False,False,False,2023-07-25 13:07:26,0,Futurama Returns after almost 20 years!! My fa...,futurama returns almost years favorite show pr...,futurama return almost year favorite show prob...


In [None]:
# checking for data imbalance

In [54]:
data['subreddit'].value_counts()

1    1913
0    1798
Name: subreddit, dtype: int64

In [89]:
data[data[['title','selftext']].duplicated(keep='first')].sort_values(by=['proc_text_lem'])

Unnamed: 0,title,selftext,ups,upvote_ratio,num_comments,author,link_flair_text,awards,is_original_content,is_video,...,domain,created_utc,pinned,locked,stickied,readable_time,subreddit,text,proc_text,proc_text_lem
802,"ABC, Hulu, YouTube & Roku To Present Episodes ...","ABC: Saturday, June 24 at 8:00pm ET/PT – Episo...",68,0.93,8,UltimatePixarFan,:News: News Article,0,False,False,...,press.disneyplus.com,1.687419e+09,False,False,False,2023-06-22 07:22:35,0,"ABC, Hulu, YouTube & Roku To Present Episodes ...",abc hulu youtube roku present episodes critica...,abc hulu youtube roku present episode critical...
965,Abc on Uk Disney Plus,So i've just found out that 9-1-1 was cancelle...,6,0.76,10,AG171996,:Question: Question,0,False,False,...,self.DisneyPlus,1.682972e+09,False,False,False,2023-05-01 20:09:53,0,Abc on Uk Disney Plus So i've just found out t...,abc uk disney plus found cancelled fox moving ...,abc uk disney plus find cancel fox move channe...
496,Can't access DVD.COM,Anyone experienced this (screen below). I'm no...,6,0.67,4,00derek,others,0,False,False,...,self.netflix,1.693064e+09,False,False,False,2023-08-26 15:30:20,1,Can't access DVD.COM Anyone experienced this (...,access dvd com anyone experienced screen compu...,access dvd com anyone experience screen comput...
582,Access Thailand Netflix from the US.,I’m trying to access Thai Netflix from the US ...,0,0.50,4,Tokyo_Hardnutz,others,0,False,False,...,self.netflix,1.691860e+09,False,False,False,2023-08-12 17:12:36,1,Access Thailand Netflix from the US. I’m tryin...,access thailand netflix us trying access thai ...,access thailand netflix we try access thai net...
411,“Your account can’t be used on this account”,I have traveled to a different country and I c...,0,0.50,2,theonlysisterfister,others,0,False,False,...,self.netflix,1.694112e+09,False,False,False,2023-09-07 18:40:03,1,“Your account can’t be used on this account” I...,account used account traveled different countr...,account use account travel different country u...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1171,Zambian Girl Power On NETFLIX\n\nThere’s a gro...,\n\nThere’s a growing thirst for African films...,3,0.54,0,AfricanStream,others,0,False,True,...,v.redd.it,1.689972e+09,False,False,False,2023-07-21 20:39:22,1,Zambian Girl Power On NETFLIX\n\nThere’s a gro...,zambian girl power netflix growing thirst afri...,zambian girl power netflix grow thirst african...
688,Zombies reanimated question,Does anyone know if they’re doing a full narra...,5,0.86,0,UV-SkillCityProds,:Discussion: Discussion,0,False,False,...,self.DisneyPlus,1.691162e+09,False,False,False,2023-08-04 15:08:44,0,Zombies reanimated question Does anyone know i...,zombies reanimated question anyone know full n...,zombie reanimate question anyone know full nar...
560,Zombieverse Netflix Review,This show is absolutely horrible I'm debating ...,5,0.73,4,One-Office-8127,others,0,False,False,...,self.netflix,1.692245e+09,False,False,False,2023-08-17 03:56:57,1,Zombieverse Netflix Review This show is absolu...,zombieverse netflix review show absolutely hor...,zombieverse netflix review show absolutely hor...
682,"Why ""Zootopia"" become ""Zootropolis""",on Disney+?,7,0.74,8,ThenAdhesiveness1863,:Discussion: Discussion,0,False,False,...,self.DisneyPlus,1.691353e+09,False,False,False,2023-08-06 20:18:02,0,"Why ""Zootopia"" become ""Zootropolis"" on Disney+?",zootopia become zootropolis disney,zootopia become zootropolis disney


In [55]:
data.sort_values(by='author', ascending=False).head(20)

Unnamed: 0,title,selftext,ups,upvote_ratio,num_comments,author,link_flair_text,awards,is_original_content,is_video,...,domain,created_utc,pinned,locked,stickied,readable_time,subreddit,text,proc_text,proc_text_lem
1327,i am one episode in black mirror and it entire...,black mirror has always been about the future ...,0,0.48,17,zzcool,others,0,False,False,...,self.netflix,1686971000.0,False,False,False,2023-06-17 03:04:37,1,i am one episode in black mirror and it entire...,one episode black mirror entirely lost made gr...,one episode black mirror entirely lose make gr...
956,try canceling you may realize you don't need it,try canceling Netflix you may realize you don'...,2,0.52,31,zzcool,others,0,False,False,...,self.netflix,1688586000.0,False,False,False,2023-07-05 19:42:42,1,try canceling you may realize you don't need i...,try canceling may realize need try canceling n...,try canceling may realize need try cancel netf...
977,been a subscriber since day 1 and i finally ca...,Netflix was an amazing innovation at the time ...,0,0.47,20,zzcool,others,0,False,False,...,self.netflix,1684006000.0,False,False,False,2023-05-13 19:24:47,1,been a subscriber since day 1 and i finally ca...,subscriber since day finally canceled netflix ...,subscriber since day finally cancel netflix am...
842,Zom 100 episode release days,Hello there! So I’ve been watching this show a...,5,0.86,4,zslayer89,others,0,False,False,...,self.netflix,1690083000.0,False,False,False,2023-07-23 03:24:35,1,Zom 100 episode release days Hello there! So I...,zom episode release days hello watching show b...,zom episode release day hello watch show blast...
651,Netflix randomly dropping in quality while bin...,"So I’m rewatching cobra kai, and everything i...",2,0.58,7,zslayer89,others,0,False,False,...,self.netflix,1691440000.0,False,False,False,2023-08-07 20:32:01,1,Netflix randomly dropping in quality while bin...,netflix randomly dropping quality binging rewa...,netflix randomly drop quality binging rewatche...
1564,Why is Netflix so lazy with thriller plots? (E...,I'd just started watching El Silencio and the ...,0,0.5,1,zeinterwebz,others,0,False,False,...,self.netflix,1685126000.0,False,False,False,2023-05-26 18:26:04,1,Why is Netflix so lazy with thriller plots? (E...,netflix lazy thriller plots el silencio starte...,netflix lazy thriller plot el silencio start w...
1488,"Someone down to watch ""don't look up"" now and ...",Hope I'll find someone to do this. We could do...,2,0.53,3,zayane_,others,0,False,False,...,self.netflix,1640533000.0,False,False,False,2021-12-26 15:41:34,1,"Someone down to watch ""don't look up"" now and ...",someone watch look talk via text would love ch...,someone watch look talk via text would love ch...
710,"Today, they finally added X-Men: Evolution to ...","Subtitles in Danish, Swedish, Norwegian and Fi...",55,0.92,4,zakawer2,:New: New on Disney+!,0,False,False,...,self.DisneyPlus,1690356000.0,False,False,False,2023-07-26 07:12:18,0,"Today, they finally added X-Men: Evolution to ...",today finally added x men evolution disney den...,today finally add x man evolution disney denma...
220,"Today, they finally added X-Men: Evolution to ...","Subtitles in Danish, Swedish, Norwegian and Fi...",53,0.92,4,zakawer2,:New: New on Disney+!,0,False,False,...,self.DisneyPlus,1690356000.0,False,False,False,2023-07-26 07:12:18,0,"Today, they finally added X-Men: Evolution to ...",today finally added x men evolution disney den...,today finally add x man evolution disney denma...
1350,Why does Netflix hate Sci-Fi? They've got one ...,"Really, what's going on at Netflix where they ...",4,0.54,22,zakats,others,0,False,False,...,self.netflix,1577336000.0,False,False,False,2019-12-26 04:58:47,1,Why does Netflix hate Sci-Fi? They've got one ...,netflix hate sci fi got one two good shows res...,netflix hate sci fi get one two good show rest...


Found out that there are rows where the text columns are exactly the same, posted by the same author.

In [90]:
data2 = data.drop_duplicates(subset=['proc_text_lem'], keep='first')

In [91]:
data2 = data2.reset_index(drop=True)

In [None]:
# checking for data imbalance after removing duplicated texts

In [92]:
data2['subreddit'].value_counts()

1    1690
0    1332
Name: subreddit, dtype: int64

In [106]:
data2[data2[['title','selftext']].duplicated(keep='first')].sort_values(by=['title'])

Unnamed: 0,title,selftext,ups,upvote_ratio,num_comments,author,link_flair_text,awards,is_original_content,is_video,...,domain,created_utc,pinned,locked,stickied,readable_time,subreddit,text,proc_text,proc_text_lem


## 4. Only text

Having a separate dataframe of columns that are only text with the `readeable_time` and `subreddit` columns.

In [96]:
columns = ['readable_time','subreddit', 'author','title', 'selftext', 'text', 'proc_text', 'proc_text_lem']
reddit_text = data2[columns]
reddit_text

Unnamed: 0,readable_time,subreddit,author,title,selftext,text,proc_text,proc_text_lem
0,2021-04-24 15:24:04,1,N3DSdude,/r/Netflix Discord Server,We are pleased to announce we have affiliated ...,/r/Netflix Discord Server We are pleased to an...,r netflix discord server pleased announce affi...,r netflix discord server pleased announce affi...
1,2023-02-02 09:35:27,1,UniversallySecluded,Netflix Announces Plans to Crack Down on Passw...,> **Any post relating to this thread will now ...,Netflix Announces Plans to Crack Down on Passw...,netflix announces plans crack password sharing...,netflix announce plan crack password sharing m...
2,2023-09-11 23:00:51,1,Scully__,Any tips for de-morbiding your feed?,This is on behalf of a friend although my feed...,Any tips for de-morbiding your feed? This is o...,tips de morbiding feed behalf friend although ...,tip de morbide feed behalf friend although fee...
3,2023-09-11 15:57:38,1,ironshadowy,Why did netflix remove some profile pictures?,Was using one of the one piece profile picture...,Why did netflix remove some profile pictures? ...,netflix remove profile pictures using one one ...,netflix remove profile picture use one one pie...
4,2023-09-12 00:19:47,1,ussjtrunksftw,One Piece Issue?,I saw episode one and two earlier and they wer...,One Piece Issue? I saw episode one and two ear...,one piece issue saw episode one two earlier fi...,one piece issue see episode one two early fine...
...,...,...,...,...,...,...,...,...
3017,2019-11-12 10:48:09,0,Lycanvenom,To Everyone With Samsung TVs,"I wanted to drop it in the launch thread, but ...",To Everyone With Samsung TVs I wanted to drop ...,everyone samsung tvs wanted drop launch thread...,everyone samsung tvs want drop launch thread c...
3018,2019-11-12 12:13:51,0,anakinfan8,All of the first six Star Wars films have the ...,The 2015 digital versions of the films/the new...,All of the first six Star Wars films have the ...,first six star wars films th century fox logo ...,first six star war film th century fox logo be...
3019,2023-01-26 20:04:49,0,sickfuck3000,'Percy Jackson and the Olympians' Casts Lance ...,https://variety.com/2023/tv/news/percy-jackson...,'Percy Jackson and the Olympians' Casts Lance ...,percy jackson olympians casts lance reddick ze...,percy jackson olympians casts lance reddick ze...
3020,2020-03-07 14:13:36,0,makenzie71,My tv is only 720p but disney+ is streaming at...,disney+ is streaming at nearly 7gb an hour to ...,My tv is only 720p but disney+ is streaming at...,tv p disney streaming k disney streaming nearl...,tv p disney stream k disney stream nearly gb h...


In [97]:
def char_word_count(df):
    
    df['char_length'] = [len(s) for s in df['proc_text']]
    df['word_count'] = [len(s.split()) for s in df['proc_text']]

In [98]:
char_word_count(reddit_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['char_length'] = [len(s) for s in df['proc_text']]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['word_count'] = [len(s.split()) for s in df['proc_text']]


In [99]:
reddit_text.head()

Unnamed: 0,readable_time,subreddit,author,title,selftext,text,proc_text,proc_text_lem,char_length,word_count
0,2021-04-24 15:24:04,1,N3DSdude,/r/Netflix Discord Server,We are pleased to announce we have affiliated ...,/r/Netflix Discord Server We are pleased to an...,r netflix discord server pleased announce affi...,r netflix discord server pleased announce affi...,178,24
1,2023-02-02 09:35:27,1,UniversallySecluded,Netflix Announces Plans to Crack Down on Passw...,> **Any post relating to this thread will now ...,Netflix Announces Plans to Crack Down on Passw...,netflix announces plans crack password sharing...,netflix announce plan crack password sharing m...,576,78
2,2023-09-11 23:00:51,1,Scully__,Any tips for de-morbiding your feed?,This is on behalf of a friend although my feed...,Any tips for de-morbiding your feed? This is o...,tips de morbiding feed behalf friend although ...,tip de morbide feed behalf friend although fee...,301,44
3,2023-09-11 15:57:38,1,ironshadowy,Why did netflix remove some profile pictures?,Was using one of the one piece profile picture...,Why did netflix remove some profile pictures? ...,netflix remove profile pictures using one one ...,netflix remove profile picture use one one pie...,174,24
4,2023-09-12 00:19:47,1,ussjtrunksftw,One Piece Issue?,I saw episode one and two earlier and they wer...,One Piece Issue? I saw episode one and two ear...,one piece issue saw episode one two earlier fi...,one piece issue see episode one two early fine...,217,34


In [100]:
reddit_text.sort_values(by='word_count', ascending=False)[['subreddit', 'proc_text_lem','word_count']].head(20)

Unnamed: 0,subreddit,proc_text_lem,word_count
1900,0,rank disney film come disney world early year ...,1191
1594,1,list netflix original tv show imdb rating dram...,1038
416,1,glamorous netflix review partner watch entire ...,987
2087,0,disney concept suite life j outline j detweile...,899
2185,0,dr doom disney series thought input suggestion...,776
2246,0,thought snow white seven dwarf lately binge wa...,774
2728,0,mandalorian disney plus series review new high...,768
2119,0,thought sleep beauty overview walt disney exci...,673
1775,0,thought big hero overview disney acquire marve...,636
2308,0,live netherlands terrible experience disney ap...,609


Saving the data as csv files.

In [107]:
# save cleaned data to csv
data.to_csv('datasets/cleaned_data.csv', index=False)
reddit_text.to_csv('datasets/cleaned_data_text_only.csv', index=False)

In [102]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3711 entries, 0 to 1797
Data columns (total 21 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   title                3711 non-null   object        
 1   selftext             3711 non-null   object        
 2   ups                  3711 non-null   int64         
 3   upvote_ratio         3711 non-null   float64       
 4   num_comments         3711 non-null   int64         
 5   author               3711 non-null   object        
 6   link_flair_text      3711 non-null   object        
 7   awards               3711 non-null   int64         
 8   is_original_content  3711 non-null   bool          
 9   is_video             3711 non-null   bool          
 10  post_type            3711 non-null   object        
 11  domain               3711 non-null   object        
 12  created_utc          3711 non-null   float64       
 13  pinned               3711 non-nul

In [104]:
data2.describe(include='all')

  data2.describe(include='all')


Unnamed: 0,title,selftext,ups,upvote_ratio,num_comments,author,link_flair_text,awards,is_original_content,is_video,...,domain,created_utc,pinned,locked,stickied,readable_time,subreddit,text,proc_text,proc_text_lem
count,3022,3022,3022.0,3022.0,3022.0,3022.0,3022,3022.0,3022,3022,...,3022,3022.0,3022,3022,3022,3022,3022.0,3022,3022,3022
unique,3006,3018,,,,2520.0,76,,1,2,...,28,,1,2,2,3022,,3022,3022,3022
top,Help,"For a thriller, it didn’t possess many suspens...",,,,,others,,False,False,...,self.netflix,,False,False,False,2021-04-24 15:24:04,,/r/Netflix Discord Server We are pleased to an...,r netflix discord server pleased announce affi...,r netflix discord server pleased announce affi...
freq,5,2,,,,232.0,1669,,3022,3013,...,1657,,3022,3005,3018,1,,1,1,1
first,,,,,,,,,,,...,,,,,,2011-06-23 20:49:52,,,,
last,,,,,,,,,,,...,,,,,,2023-09-12 01:36:00,,,,
mean,,,200.746856,0.642283,46.325943,,,0.081403,,,...,,1634395000.0,,,,,0.559232,,,
std,,,1395.59064,0.197625,225.065141,,,0.479288,,,...,,73189600.0,,,,,0.496561,,,
min,,,0.0,0.06,0.0,,,0.0,,,...,,1308862000.0,,,,,0.0,,,
25%,,,0.0,0.5,3.0,,,0.0,,,...,,1590358000.0,,,,,0.0,,,


Despite removing the duplicated texts, the summary table still shows that there could be some duplicated rows.

#### Next part:

Next, we will proceed to analyse the data.

[Part II - Exploratory Data Analysis](EDA.ipynb)