<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP

## 04 Model Demonstration

> SG-DSI-41 Group 01: Daryl Chia, Germaine Choo, Sharifah Nurulhuda, Tan Wei Chiong

---

## 01 Import Libraries

In [1]:
# Import libraries, modules, and functions:
import os
import joblib
import pandas as pd
import re

from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

## 02 Load & Check Data

### 02.01 Load & Check Data

In [2]:
# Load data:
comments = pd.read_csv('../data/sample_data.csv')

# Increase data frame column width:
pd.set_option('max_colwidth', 400)

# Check data:
print(comments.shape)
comments.head()

(10, 1)


Unnamed: 0,comment_text
0,"You can bounce back. 7-8 years ago I was a homeless drug addict who was done trying also and it took jail to get me back on track. Now Im happily married, have a great job making great money. I had to hit rock bottom and experience homelessness to get my shit together. Being homeless is probably the worst experience I have ever gone through and it made me realize things can always get worse. W..."
1,What kids? I can't afford to have kids.
2,"My mom thinks you just string together the first letters and everyone just knows what you meant. She'll be on her way to my house, and I get a text that says something like IOMWBFTDDIHTSTPADYWAS?C?\n My mom would have you believe that everyone knows what she meant. I'll help y'all on this first one because you're new. I'm on my way but fuck these damn diuretics I have to stop to pee again, do ..."
3,"Lots of teen angst going on in here. I like following younger genz stuff to stay connected but it’s the dooming is definitely a pattern I see on Reddit and TikTok.\n I genuinely think younger genZ lost a lot more to the pandemic than we did. It seems like some never recovered and it the negativity just continues. I see so many people say they don’t have friends, don’t go out, and only do solo ..."
4,"It’s a Reddit problem. Take all the posts you see with a grain of salt. A ton of the people on the entire website are chronically online, especially the younger ones"


### 02.02 Check Data Information & for Null Values

In [3]:
# Check data info:
comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   comment_text  10 non-null     object
dtypes: object(1)
memory usage: 212.0+ bytes


In [4]:
# Check data for null values:
comments.isnull().sum()

comment_text    0
dtype: int64

In [5]:
# Check data for null values in column 'comment_text':
comments[comments['comment_text'].isnull()]

Unnamed: 0,comment_text


### 02.03 Check for Substrings to Clean

In [6]:
# Check for data in column 'comment_text' that contain substring '[deleted]':
print(comments['comment_text'].str.contains(r'\[deleted\]').value_counts())
comments[comments['comment_text'].str.contains(r'\[deleted\]') == True].head()

comment_text
False    10
Name: count, dtype: int64


Unnamed: 0,comment_text


In [7]:
# Check for data in column 'comment_text' that contain substring '[removed]':
print(comments['comment_text'].str.contains(r'\[removed\]').value_counts())
comments[comments['comment_text'].str.contains(r'\[removed\]') == True].head()

comment_text
False    10
Name: count, dtype: int64


Unnamed: 0,comment_text


In [8]:
# Check for data in column 'comment_text' that contain bot comments:
print(comments['comment_text'].str.contains(r'I am a bot.|I\'m a bot.').value_counts())
comments[comments['comment_text'].str.contains(r'I am a bot.|I\'m a bot.') == True].head()

comment_text
False    10
Name: count, dtype: int64


Unnamed: 0,comment_text


In [9]:
# Check for data in column 'comment_text' that contain line or tab characters:
print(comments['comment_text'].str.contains(r'\r|\t|\n').value_counts())
comments[comments['comment_text'].str.contains(r'\r|\t|\n') == True].head()

comment_text
False    8
True     2
Name: count, dtype: int64


Unnamed: 0,comment_text
2,"My mom thinks you just string together the first letters and everyone just knows what you meant. She'll be on her way to my house, and I get a text that says something like IOMWBFTDDIHTSTPADYWAS?C?\n My mom would have you believe that everyone knows what she meant. I'll help y'all on this first one because you're new. I'm on my way but fuck these damn diuretics I have to stop to pee again, do ..."
3,"Lots of teen angst going on in here. I like following younger genz stuff to stay connected but it’s the dooming is definitely a pattern I see on Reddit and TikTok.\n I genuinely think younger genZ lost a lot more to the pandemic than we did. It seems like some never recovered and it the negativity just continues. I see so many people say they don’t have friends, don’t go out, and only do solo ..."


In [10]:
# Check for data in column 'comment_text' that contain substring 'This post was mass deleted and anonymized with [Redact](https://redact.dev)':
print(comments['comment_text'].str.contains(r'This post was mass deleted and anonymized with \[Redact\]\(https://redact.dev\)').value_counts())
comments[comments['comment_text'].str.contains(r'This post was mass deleted and anonymized with \[Redact\]\(https://redact.dev\)') == True].head()

comment_text
False    10
Name: count, dtype: int64


Unnamed: 0,comment_text


In [11]:
# Check for data in column 'comment_text' that contain website urls:
print(comments['comment_text'].str.contains(r'http[s]?\://\S+').value_counts())
comments[comments['comment_text'].str.contains(r'http[s]?\://\S+') == True].head()

comment_text
False    10
Name: count, dtype: int64


Unnamed: 0,comment_text


In [12]:
# Check for data in column 'comment_text' that contain HTML hexadecimal codes:
print(comments['comment_text'].str.contains(r'\&\#x\S+\;').value_counts())
comments[comments['comment_text'].str.contains(r'\&\#x\S+\;') == True].head()

comment_text
False    10
Name: count, dtype: int64


Unnamed: 0,comment_text


In [13]:
# Check for data in column 'comment_text' that contain emoji hexadecimal codes:
print(comments['comment_text'].str.contains(r'\u200d').value_counts())
comments[comments['comment_text'].str.contains(r'\u200d') == True].head()

comment_text
False    10
Name: count, dtype: int64


Unnamed: 0,comment_text


In [14]:
# Check for data in column 'comment_text' that contain gif image codes:
print(comments['comment_text'].str.contains(r'!\[gif\]\(giphy\|\w+(\|downsized)?\)').value_counts())
comments[comments['comment_text'].str.contains(r'!\[gif\]\(giphy\|\w+(\|downsized)?\)') == True].head()

comment_text
False    10
Name: count, dtype: int64


  print(comments['comment_text'].str.contains(r'!\[gif\]\(giphy\|\w+(\|downsized)?\)').value_counts())
  comments[comments['comment_text'].str.contains(r'!\[gif\]\(giphy\|\w+(\|downsized)?\)') == True].head()


Unnamed: 0,comment_text


In [15]:
# Check for data in column 'comment_text' that contain reddit subreddit references:
print(comments['comment_text'].str.contains(r'\s[/]?r/\S+').value_counts())
comments[comments['comment_text'].str.contains(r'\s[/]?r/\S+') == True].head()

comment_text
False    10
Name: count, dtype: int64


Unnamed: 0,comment_text


In [16]:
# Check for data in column 'comment_text' that contain reddit username references:
print(comments['comment_text'].str.contains(r'\s[/]?u/\S+').value_counts())
comments[comments['comment_text'].str.contains(r'\s[/]?u/\S+') == True].head()

comment_text
False    10
Name: count, dtype: int64


Unnamed: 0,comment_text


In [17]:
# Check for data in column 'comment_text' that contain hashtags:
print(comments['comment_text'].str.contains(r'#\S+').value_counts())
comments[comments['comment_text'].str.contains(r'#\S+') == True].head()

comment_text
False    10
Name: count, dtype: int64


Unnamed: 0,comment_text


In [18]:
# Check for data in column 'comment_text' that contain numbers, including shorthand decade references:
print(comments['comment_text'].str.contains(r'[0-9]+[s]?').value_counts())
comments[comments['comment_text'].str.contains(r'[0-9]+[s]?') == True].head()

comment_text
False    7
True     3
Name: count, dtype: int64


Unnamed: 0,comment_text
0,"You can bounce back. 7-8 years ago I was a homeless drug addict who was done trying also and it took jail to get me back on track. Now Im happily married, have a great job making great money. I had to hit rock bottom and experience homelessness to get my shit together. Being homeless is probably the worst experience I have ever gone through and it made me realize things can always get worse. W..."
5,"Yeah lmao it’s funny to watch. My ass moved out at 17 (essentially 16) and been working 60 hour weeks, often across multiple jobs to pay bills, while doing highschool online trying to graduate. And then I go onto reddit and see a bunch of 13-15 year olds talking about how hard it’ll be in the future, even though I guarantee they have their parents full support in life. 30% of Gen Z (adults) ar..."
7,"I was skeptical, but took a chance on the Black Friday deal. Haven’t used another cup since it arrived. Everything I drink from it tastes immaculate. My 3rd and 4th cups are in the mail. The Hydration+ and Tea Infusers are great too. Couldn’t recommend more highly."


In [19]:
# Check for data in column 'comment_text' that contain bracket characters:
print(comments['comment_text'].str.contains(r'\(|\)|\[|\]').value_counts())
comments[comments['comment_text'].str.contains(r'\(|\)|\[|\]') == True].head()

comment_text
False    9
True     1
Name: count, dtype: int64


Unnamed: 0,comment_text
5,"Yeah lmao it’s funny to watch. My ass moved out at 17 (essentially 16) and been working 60 hour weeks, often across multiple jobs to pay bills, while doing highschool online trying to graduate. And then I go onto reddit and see a bunch of 13-15 year olds talking about how hard it’ll be in the future, even though I guarantee they have their parents full support in life. 30% of Gen Z (adults) ar..."


In [20]:
# Check for data in column 'comment_text' that contain punctuations, excluding apostrophes:
print(comments['comment_text'].str.contains(r'[^a-zA-Z0-9\'\’\′\ʼ\s]').value_counts())
comments[comments['comment_text'].str.contains(r'[^a-zA-Z0-9\'\’\′\ʼ\s]') == True].head()

comment_text
True    10
Name: count, dtype: int64


Unnamed: 0,comment_text
0,"You can bounce back. 7-8 years ago I was a homeless drug addict who was done trying also and it took jail to get me back on track. Now Im happily married, have a great job making great money. I had to hit rock bottom and experience homelessness to get my shit together. Being homeless is probably the worst experience I have ever gone through and it made me realize things can always get worse. W..."
1,What kids? I can't afford to have kids.
2,"My mom thinks you just string together the first letters and everyone just knows what you meant. She'll be on her way to my house, and I get a text that says something like IOMWBFTDDIHTSTPADYWAS?C?\n My mom would have you believe that everyone knows what she meant. I'll help y'all on this first one because you're new. I'm on my way but fuck these damn diuretics I have to stop to pee again, do ..."
3,"Lots of teen angst going on in here. I like following younger genz stuff to stay connected but it’s the dooming is definitely a pattern I see on Reddit and TikTok.\n I genuinely think younger genZ lost a lot more to the pandemic than we did. It seems like some never recovered and it the negativity just continues. I see so many people say they don’t have friends, don’t go out, and only do solo ..."
4,"It’s a Reddit problem. Take all the posts you see with a grain of salt. A ton of the people on the entire website are chronically online, especially the younger ones"


In [21]:
# Check for data in column 'comment_text' that contain extra whitespaces:
print(comments['comment_text'].str.contains(r'\s{2,}').value_counts())
comments[comments['comment_text'].str.contains(r'\s{2,}') == True].head()

comment_text
False    8
True     2
Name: count, dtype: int64


Unnamed: 0,comment_text
2,"My mom thinks you just string together the first letters and everyone just knows what you meant. She'll be on her way to my house, and I get a text that says something like IOMWBFTDDIHTSTPADYWAS?C?\n My mom would have you believe that everyone knows what she meant. I'll help y'all on this first one because you're new. I'm on my way but fuck these damn diuretics I have to stop to pee again, do ..."
3,"Lots of teen angst going on in here. I like following younger genz stuff to stay connected but it’s the dooming is definitely a pattern I see on Reddit and TikTok.\n I genuinely think younger genZ lost a lot more to the pandemic than we did. It seems like some never recovered and it the negativity just continues. I see so many people say they don’t have friends, don’t go out, and only do solo ..."


## 03 Clean Data

### 03.01 Drop Relevant Rows

In [22]:
# Count data rows:
comments.shape[0]

10

In [23]:
# Drop rows with null values in column 'comment_text':
comments = comments[comments['comment_text'].notnull()]
comments.shape[0]

10

In [24]:
# Drop rows with substring '[deleted]' in column 'comment_text':
comments = comments[comments['comment_text'].str.contains(r'\[deleted\]') == False]
comments.shape[0]

10

In [25]:
# Drop rows with substring '[removed]' in column 'comment_text':
comments = comments[comments['comment_text'].str.contains(r'\[removed\]') == False]
comments.shape[0]

10

In [26]:
# Drop rows with substring 'I am a bot.' in column 'comment_text':
comments = comments[comments['comment_text'].str.contains(r'I am a bot.|I\'m a bot.') == False]
comments.shape[0]

10

### 03.02 Replace Substrings for Relevant Rows

In [27]:
# Replace line of tab characters in column 'comment_text' with ' ':
comments['comment_text'] = comments['comment_text'].str.replace(pat='\r|\t|\n', repl=' ', regex=True)

In [28]:
# Replace redaction substrings in column 'comment_text' with ' ':
comments['comment_text'] = comments['comment_text'].str.replace(pat='This post was mass deleted and anonymized with \[Redact\]\(https://redact.dev\)', repl=' ', regex=True)

In [29]:
# Replace website urls in column 'comment_text' with ' ':
comments['comment_text'] = comments['comment_text'].str.replace(pat='http[s]?\://\S+', repl=' ', regex=True)

In [30]:
# Replace html hexadecimal codes in column 'comment_text' with ' ':
comments['comment_text'] = comments['comment_text'].str.replace(pat='\&\#x[A-F0-9]+\;', repl=' ', regex=True)

In [31]:
# Replace emoji hexadecimal codes in column 'comment_text' with ' ':
comments['comment_text'] = comments['comment_text'].str.replace(pat='u200d', repl=' ', regex=True)

In [32]:
# Replace .gif image codes in column 'comment_text' with ' ':
comments['comment_text'] = comments['comment_text'].str.replace(pat='!\[gif\]\(giphy\|\w+(\|downsized)?\)', repl=' ', regex=True)

In [33]:
# Replace reddit subreddit references in column 'comment_text' with ' ':
comments['comment_text'] = comments['comment_text'].str.replace(pat='\s[/]?r/\S+', repl=' ', regex=True)

In [34]:
# Replace reddit username references in column 'comment_text' with ' ':
comments['comment_text'] = comments['comment_text'].str.replace(pat='\s[/]?u/\S+', repl=' ', regex=True)

In [35]:
# Replace numbers, including decade references, in column 'comment_text' with ' ':
comments['comment_text'] = comments['comment_text'].str.replace(pat='[0-9]+[s]?', repl=' ', regex=True)

In [36]:
# Replace bracket characters, in column 'comment_text' with ' ':
comments['comment_text'] = comments['comment_text'].str.replace(pat='\(|\)|\[|\]', repl=' ', regex=True)

In [37]:
# Replace punctuation, except apostrophes, in column 'comment_text' with ' ':
comments['comment_text'] = comments['comment_text'].str.replace(pat='[^a-zA-Z0-9\'\’\′\ʼ\s]', repl=' ', regex=True)

### 03.03 Others

In [38]:
# Standardise all apostrophes to the same type ''':
comments['comment_text'] = comments['comment_text'].str.replace(pat=r'[\’\′\ʼ]', repl='\'', regex=True)

In [39]:
# Remove trailing whitespaces in column 'comment_text':
comments['comment_text'] = comments['comment_text'].str.strip()

In [40]:
# Check for data in column 'comment_text' that are empty strings:
print(comments[comments['comment_text'] == '']['comment_text'].value_counts())
comments[comments['comment_text'] == ''].head()

Series([], Name: count, dtype: int64)


Unnamed: 0,comment_text


In [41]:
# Drop rows with empty strings in column 'comment_text':
comments = comments[comments['comment_text'] != '']
comments.shape[0]

10

In [42]:
# Check data:
print(comments.shape)
comments.head()

(10, 1)


Unnamed: 0,comment_text
0,You can bounce back years ago I was a homeless drug addict who was done trying also and it took jail to get me back on track Now Im happily married have a great job making great money I had to hit rock bottom and experience homelessness to get my shit together Being homeless is probably the worst experience I have ever gone through and it made me realize things can always get worse W...
1,What kids I can't afford to have kids
2,My mom thinks you just string together the first letters and everyone just knows what you meant She'll be on her way to my house and I get a text that says something like IOMWBFTDDIHTSTPADYWAS C My mom would have you believe that everyone knows what she meant I'll help y'all on this first one because you're new I'm on my way but fuck these damn diuretics I have to stop to pee again do y...
3,Lots of teen angst going on in here I like following younger genz stuff to stay connected but it's the dooming is definitely a pattern I see on Reddit and TikTok I genuinely think younger genZ lost a lot more to the pandemic than we did It seems like some never recovered and it the negativity just continues I see so many people say they don't have friends don't go out and only do solo h...
4,It's a Reddit problem Take all the posts you see with a grain of salt A ton of the people on the entire website are chronically online especially the younger ones


In [43]:
# Reset data frame index:
comments.reset_index(drop=True, inplace=True)

## 04 Preprocess Data

### 04.01 Stem Comment Words

In [44]:
# Define stemmer function:
def stemmer(row):
    '''Applies `PorterStemmer()` to each token in a document.
    '''
    # initializes PorterStemmer object
    stem = PorterStemmer()

    # extracts document
    document = row['comment_text']

    # splits document up into tokens using RegexpTokenizer
    re_tokenizer = RegexpTokenizer(pattern=r"(?u)\b(?:\w\w+|i|I)(?:[\'\'\′\ʼ](?:s|t|m|re|ve|d|ll))?\b")
    lst_of_tokens = re_tokenizer.tokenize(document)

    # initialize return value
    stemmed_document = ''

    # applies PorterStemmer to the list of tokens
    for token in lst_of_tokens:
        stemmed_document += f'{stem.stem(token)} '
    
    return stemmed_document

In [45]:
# Stem comments:
comments['comment_text'] = comments.apply(stemmer,axis=1)
comments

Unnamed: 0,comment_text
0,you can bounc back year ago i wa homeless drug addict who wa done tri also and it took jail to get me back on track now im happili marri have great job make great money i had to hit rock bottom and experi homeless to get my shit togeth be homeless is probabl the worst experi i have ever gone through and it made me realiz thing can alway get wors whatev you'r go thru it help to have support net...
1,what kid i can't afford to have kid
2,my mom think you just string togeth the first letter and everyon just know what you meant she'll be on her way to my hous and i get text that say someth like iomwbftddihtstpadywa my mom would have you believ that everyon know what she meant i'll help all on thi first one becaus you'r new i'm on my way but fuck these damn diuret i have to stop to pee again do you want snicker coke
3,lot of teen angst go on in here i like follow younger genz stuff to stay connect but it' the doom is definit pattern i see on reddit and tiktok i genuin think younger genz lost lot more to the pandem than we did it seem like some never recov and it the neg just continu i see so mani peopl say they don't have friend don't go out and onli do solo hobbi etc it genuin concern i wa extrem social in...
4,it' reddit problem take all the post you see with grain of salt ton of the peopl on the entir websit are chronic onlin especi the younger one
5,yeah lmao it' funni to watch my ass move out at essenti and been work hour week often across multipl job to pay bill while do highschool onlin tri to graduat and then i go onto reddit and see bunch of year old talk about how hard it'll be in the futur even though i guarante they have their parent full support in life of gen adult are homeown the most of ani gener other than boomer by their mil...
6,i just bought one i love for way cheaper than your i like the look of your but love the look of my new tumbler better so sorri your ridicul price lose and i win
7,i wa skeptic but took chanc on the black friday deal haven't use anoth cup sinc it arriv everyth i drink from it tast immacul my rd and th cup are in the mail the hydrat and tea infus are great too couldn't recommend more highli
8,if it wa cheaper your friend would be mad you didn't spend more money also we'r the first cup that ever fit in hand and that took lot of money to develop
9,if we made them cheaper they wouldn't feel like expens gift and they wouldn't fit so perfectli in our hand what not sell point if you'r tri to justifi the cost qualiti in an ad


### 04.02 Word Vectorize Comments

In [46]:
# Import fitted 'TF-IDF' word vectorizer:
tvec = joblib.load("./tvec.pkl")

In [47]:
comments_vec = pd.DataFrame(tvec.transform(comments['comment_text']).todense(),
                            columns=tvec.get_feature_names_out())
comments_vec.shape

(10, 50000)

## 05 Predict Classifier Group

### 05.01 Generation Predictions

In [48]:
# Import fitted logistic regression model:
# Ridge (l2), α=1:
model_lr = joblib.load("./model_lr.pkl")

In [49]:
# Generate classifier predictions:
model_lr.predict(comments_vec)

array([0, 1, 0, 1, 1, 1, 0, 0, 0, 0])

In [50]:
# Create results column in data frame 'comments':
comments['generation'] = model_lr.predict(comments_vec)

In [51]:
# Convert 'generation_z' column to binary outcomes:
comments['generation'] = comments['generation'].map({1: 'Z',0: 'Y' })

In [52]:
# Check data
print(comments.shape)
comments.head()

(10, 2)


Unnamed: 0,comment_text,generation
0,you can bounc back year ago i wa homeless drug addict who wa done tri also and it took jail to get me back on track now im happili marri have great job make great money i had to hit rock bottom and experi homeless to get my shit togeth be homeless is probabl the worst experi i have ever gone through and it made me realiz thing can alway get wors whatev you'r go thru it help to have support net...,Y
1,what kid i can't afford to have kid,Z
2,my mom think you just string togeth the first letter and everyon just know what you meant she'll be on her way to my hous and i get text that say someth like iomwbftddihtstpadywa my mom would have you believ that everyon know what she meant i'll help all on thi first one becaus you'r new i'm on my way but fuck these damn diuret i have to stop to pee again do you want snicker coke,Y
3,lot of teen angst go on in here i like follow younger genz stuff to stay connect but it' the doom is definit pattern i see on reddit and tiktok i genuin think younger genz lost lot more to the pandem than we did it seem like some never recov and it the neg just continu i see so mani peopl say they don't have friend don't go out and onli do solo hobbi etc it genuin concern i wa extrem social in...,Z
4,it' reddit problem take all the post you see with grain of salt ton of the peopl on the entir websit are chronic onlin especi the younger one,Z


In [53]:
# Create output folder and export data frame 'comments' as 'results.csv':
if not os.path.exists('../data'):
    os.makedirs('data')
comments.to_csv('../data/results.csv', index=False)