# Identifying Precise Forecasters on r/Wallstreetbets
**BrainStation Data Science Bootcamp - Capstone Project**

**Author: L Gavrilova**

**Date: 15 October 2023**

# Notebook 2 - Text Cleaning 

## Table of Contents

1. [Introduction](#1.-Introduction)

Removing website links <br>
Filtering out emojis by creating a new column <br>
2. Spellcheck.

. [Conclusion](#5.-Conclusion)

## 1. Text Cleaning

In [1]:
# Importing several libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (6.0, 4.0) #setting figure size
import seaborn as sns
import matplotlib.dates as mdates
import os
import emoji
import re
import pyspellchecker

ModuleNotFoundError: No module named 'emoji'

In [18]:
DATAFILE = '../data/merged_data.csv'

In [19]:
df = pd.read_csv(DATAFILE)

  df = pd.read_csv(DATAFILE)


### Step 1. Removing website links

In [43]:
# This code uses the re.sub function from the re module to remove all website links by replacing them with empty spaces
df_clean = df.copy()

# Function to clean text
def purge_url(text):
    # Define url in regex
    url_pattern = r'https?://\S+|www\.\S+'
    # Remove special characters and symbols
    purged_text = re.sub(url_pattern, '', text)
    return purged_text

# Clean the 'title' column
df_clean['title'] = df_clean['title'].apply(purge_url)

# Clean the 'selftext' column
df_clean['selftext'] = df_clean['selftext'].apply(purge_url)

In [44]:
# Create a copy of the initial df and new empty columns 'Tagging' and 'Hashtags' in the DataFrame
# These two new columns, 'Tagging' and 'Hashtags', extract mentions and hashtags from the 'selftext' column 
# and format them as comma-separated lists in their respective columns. This can be useful for analyzing and categorizing the content of the posts.
# I borrowed this approach from a capstone project on sentiment done by another BrainStation graduate Mohamed Imran (June 2023).  

df_clean['Tagging'] = ""
df_clean['Hashtags'] = ""

# Iterate through each row in the DataFrame
for index, row in df_clean.iterrows():
    selftext = row['selftext']  # Get the 'selftext' content for the current row
    words = selftext.split()  # Split the selftext into words

    # Initialize empty lists to store mentions and hashtags
    mentions = []
    hashtags = []

    # Iterate through each word in the posts
    for word in words:
        if word.startswith('@'):
            mentions.append(word)  # Include '@' symbol
        elif word.startswith('#'):
            hashtags.append(word)  # Include '#' symbol

    # Join the mentions and hashtags lists into comma-separated strings
    tagging_str = ', '.join(mentions)
    hashtags_str = ', '.join(hashtags)

    # Update the 'Tagging' and 'Hashtags' columns in the DataFrame
    df_clean.at[index, 'Tagging'] = tagging_str
    df_clean.at[index, 'Hashtags'] = hashtags_str

# Replace an empty field with 'None'

In [45]:
# Checking the new columns
df_clean['Tagging'].value_counts()
#df_clean['Hashtags'].value_counts()

Tagging
                               1411436
@                                 3134
@, @                               607
@, @, @                            218
@, @, @, @                         115
                                ...   
@, @, @, @, @, @, @, @, @8k          1
@6k, @2.5k, @5.3k                    1
@0.75                                1
@1.29,                               1
@Joe                                 1
Name: count, Length: 1140, dtype: int64

### Filtering out `emojis` by creating a new column

In [47]:


def extract_and_replace_emojis(df, text_column_name='selftext', emoji_column_name=''):
    # Initialize an empty column for extracted emojis
    df[emoji_column_name] = ''

    # Function to extract and replace emojis in a text
    def process_text(text):
        emojis = []
        # Define a regular expression pattern to detect emojis
        emoji_pattern = r'[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F700-\U0001F77F\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F\U0001FA70-\U0001FAFF\U0001FB00-\U0001FBFF\U0001F004]+'
        
        # Find all emojis in the text using the regex pattern
        emojis = re.findall(emoji_pattern, text)
        
        # Replace emojis with their descriptions (e.g., 🚀 -> 'rocket')
        for emoji in emojis:
            text = text.replace(emoji, emoji_description(emoji))
        
        return text, emojis

    # Apply the processing function to the specified column
    df[[text_column_name, emoji_column_name]] = df[text_column_name].apply(process_text).apply(pd.Series)

    return df

# Function to map emojis to their descriptions (add more as needed)
def emoji_description(emoji):
    emoji_map = {
        "🚀": "rocket",
        # Add more emoji mappings here
    }
    return emoji_map.get(emoji, emoji)  # Default to original emoji if not in mapping

# Example usage:
test = pd.DataFrame({'selftext': ["WE ARE PLUS 430!!!'Hello 🚀 world! 👋', 🚀🌙🦍 '🐱 Meow! 🐶 🚀 Let's go to the moon! 🦍🚀 #apes #rocket  WE ARE DOING IT. NOW KEEP HOLDING AND DONT GIVE INTO WHAT THEY WANT. IT WILL BE 600 BEFORE THE DAYS OUT AND THEN MONDAY IT WILL OPEN AT $1000 🤞 IF NOT THEN JUST A LITTLE LESS. BUY BUY BUY PEOPLE THIS IS STILL ONLY THE BEGINNING!!! AND DONT SELL WHEN YOU SEE THE DROP THIS MORNING THIS IS WHAT THEY WANT YOU TO DO AND THEN THEY WILL MAKE IT SO YOU CAN BUY ANYMORE STOCKS OR GAME STOP LIKE THEY DONE YESTERDAY BUT WE ARE NOT STUPID!! HOLDING AND NOT SELLING AND BUYING MORE HERE WHEN LAME ASS ROBINHOOD OPENS. HOLD HOLD HOLD. TO THE MOON🚀🚀🚀🚀🚀🚀🚀"]})
test = extract_and_replace_emojis(test)
print(test)


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               selftext  \
0  WE ARE PLUS 430!!!'Hello rocket world! 👋', rocket🌙🦍 '🐱 Meow! 🐶 rocket Let's go to the moon! 🦍rocket #apes #rocket  WE ARE DOING IT. NOW KEEP HOLDING AND DONT GIVE INTO WHAT THEY WANT. IT WILL BE 600 BEFORE THE DAYS OUT AND THEN MONDAY IT WILL OPEN AT $1000 🤞 IF NOT THEN JUST A LITTLE LESS. BUY BUY

import pandas as pd
import emoji

def extract_and_replace_emojis(df, text_column_name='selftext', emoji_column_name='emojis'):
    # Initialize an empty column for extracted emojis
    df[emoji_column_name] = ''

    # Function to extract and replace emojis in a text
    def process_text(text):
        emojis = []
        replaced_text = emoji.demojize(text)  # Replace emojis with words
        for word in replaced_text.split():
            if word.startswith(':rocket:'):
                emojis.append('rocket')
                replaced_text = replaced_text.replace(word, 'rocket')  # Replace rocket emojis with 'rocket' word
        return replaced_text, emojis

    # Apply the processing function to the specified column
    df[[text_column_name, emoji_column_name]] = df[text_column_name].apply(process_text).apply(pd.Series)

    return df



# Example usage:
# Assuming df is your DataFrame with a 'selftext' column
test = pd.DataFrame({'selftext': ["WE ARE PLUS 430!!! 🌙  🦍 WE ARE DOING IT. NOW KEEP HOLDING AND DONT GIVE INTO WHAT THEY WANT. IT WILL BE 600 BEFORE THE DAYS OUT AND THEN MONDAY IT WILL OPEN AT $1000 🤞 IF NOT THEN JUST A LITTLE LESS. BUY BUY BUY PEOPLE THIS IS STILL ONLY THE BEGINNING!!! AND DONT SELL WHEN YOU SEE THE DROP THIS MORNING THIS IS WHAT THEY WANT YOU TO DO AND THEN THEY WILL MAKE IT SO YOU CAN BUY ANYMORE STOCKS OR GAME STOP LIKE THEY DONE YESTERDAY BUT WE ARE NOT STUPID!! HOLDING AND NOT SELLING AND BUYING MORE HERE WHEN LAME ASS ROBINHOOD OPENS. HOLD HOLD HOLD. TO THE MOON🚀🚀🚀🚀🚀🚀🚀"]})
test = extract_and_replace_emojis(test)
print(test)


In [48]:
# Call the function to extract and replace emojis from 'title' and 'selftext'
df_clean = extract_and_replace_emojis(df_clean, text_column_name='title', emoji_column_name='emoji_title')
df_clean = extract_and_replace_emojis(df_clean, text_column_name='selftext', emoji_column_name='emoji_selftext')

# Now df contains the modified data with emojis extracted and replaced


In [49]:
# Checking the new columns
df_clean.sample().T
df_clean['emoji_title'].value_counts()

emoji_title
[]              1247412
[🚀🚀🚀]             14501
[🚀]               12843
[🚀🚀]               5385
[🚀🚀🚀🚀]             5144
                 ...   
[🏳, 🌈🐻, 🚀🚀🚀]          1
[💎🤲🦍, 🦍🤲💎]            1
[🌕🌕🌕🔥🔥]               1
[🤣🦍🍌]                 1
[📈🌒]                  1
Name: count, Length: 40243, dtype: int64

In [50]:
# Checking the new columns
df_clean.sample().T
df_clean['emoji_selftext'].value_counts()

emoji_selftext
[]                                                              1382795
[🚀🚀🚀]                                                              1168
[🚀]                                                                1164
[💎🙌]                                                                895
[🦍]                                                                 586
                                                                 ...   
[🌈, 🐻, 🚀, 🚀, 🚀🚀🚀🚀🚀🚀🚀🚀]                                                1
[👍🏻, 😂, 😉, 🙌🏻💎, 😉, 🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🚀🙌🏻💎🙌🏻💎🙌🏻💎🙌🏻💎🚀🚀🚀🚀, 🦍, 😉, 🌈🐻, 😎]          1
[😎🚀🚀🚀🚀🚀🚀💎🙌]                                                           1
[💎👐, 🤔]                                                               1
[🧻🤚, 💎💎🤲, 💎🤲, 🚀🚀🚀🚀🚀🚀, 🌑]                                              1
Name: count, Length: 15536, dtype: int64

### Spellchecking 

In [60]:
from spellchecker import SpellChecker

spell = SpellChecker()

# Combine 'title' and 'selftext' columns into a single 'text' column
df_clean['text'] = df_clean['title'] + ' ' + df_clean['selftext']

# Drop rows with missing or empty 'text' values
df_clean.dropna(subset=['text'], inplace=True)

# Spellcheck and correct the 'text' column
df_clean['corrected_text'] = df_clean['text'].apply(lambda x: ' '.join([spell.correction(word) for word in x.split()]))

# Now you have a new 'corrected_text' column with spell-checked text


# Assuming you have a DataFrame named df_clean with 'title' and 'selftext' columns

# Now, we combine the text from 'title' and 'selftext' columns into a single column named 'text'
df_clean['text'] = df_clean['title'] + ' ' + df_clean['selftext']

# Next, we go through each row of the 'text' column
for index, row in df_clean.iterrows():
    text = row['text']  # Get the text content for the current row
    
    if isinstance(text, str):  # Check if the text is a valid string
        # Split the text into words (like breaking a sentence into words)
        words = text.split()
        
        # Initialize an empty list to store corrected words
        corrected_words = []
        
        # Now, we check and correct the spelling for each word
        for word in words:
            corrected_word = spell.correction(word)  # Correct the spelling of the word
            corrected_words.append(corrected_word)    # Add the corrected word to the list
            
        # Join the corrected words back into a single string
        corrected_text = ' '.join(corrected_words)
        
        # Replace the original 'text' column with the corrected text
        df_clean.at[index, 'text'] = corrected_text
    else:
        # If the text is not a string (e.g., None or empty), simply skip it
        pass

# Print the corrected text for the first row
print(df_clean['text'].iloc[0])


TypeError: sequence item 6: expected str instance, NoneType found

# Define the emojis pattern  - code by Mohammed Imran - maybe needs to be deleted 

 
emojis_pattern = r'(:\) |:\]|: \)|=\)|:d|;d|:\(|:\[|:- \))'

# Add a new column to store the presence of emojis
emojis = df_clean['title'].str.extractall(emojis_pattern).groupby(level=0).agg(','.join)
df_clean['emojis'] = emojis if not emojis.empty else None
df_clean['emojis'] = df_clean['Emojis'].fillna('None')

### Remove all non-word characters 

In [None]:
import re
# This code uses the re.sub function from the re module to replace all non-word characters (anything that is not a letter, digit, or underscore) 
# with a single space. It then removes extra spaces and converts the text to lowercase to obtain a cleaned version with only words.

# Function to clean text
def clean_text(text):
    # Remove special characters and symbols
    cleaned_text = re.sub(r'[^\w\s]', '', text)
    
    # Remove extra spaces
    cleaned_text = ' '.join(cleaned_text.split())
    
    # Convert to lowercase
    cleaned_text = cleaned_text.lower()
    
    return cleaned_text

# Clean the 'title' column
df_clean['title'] = df_clean['title'].apply(clean_text)

# Clean the 'selftext' column
df_clean['selftext'] = df_clean['selftext'].apply(clean_text)

In [None]:
# to be deleted
df_clean['rockets'].value_counts

# sanity check
df_clean.sample(5).T

Unnamed: 0,336687,156524,613391,598684,1188697
id,m549x6,mu7sit,m7sov3,nd46ng,lyic5k
author,vvelouriaa,CthuluThePotato,carlos_tak,[deleted],-5H4D0W
created,2021-03-14 21:07:14,2021-04-19 19:00:11,2021-03-18 15:00:37,2021-05-15 17:37:54,2021-03-05 18:22:37
pinned,0,0,0,0,0
removed,0,0,0,1,0
deleted,0,0,0,1,0
is_self,1,1,1,1,0
is_video,0,0,0,0,0
title,is it better to invest in a high fee simple ira or taxable brokerage account,unpopular opinion technical indicators dont mean shit for gamestop,a poem by john greenleaf whittier that has helped me through this rollercoaster of emotions that has been holding gme,tips for taking a mini work sabbatical,is it just me
link_flair_text,Retirement,🐵 Discussion 💬,💎🙌,Other,Valuation


### cleaning special characters - not sure this is needed?

In [None]:
df_clean.drop(df_clean[df_clean['selftext'].str.contains(r'ñ||å||§|ù||¹|¡|³|ã|©|®|â|¬|î|±|ä|°|ð|ç||é|ì|²|\
                                    ¢|×|¨|æ|¸|ë|ê|»|¶|à|¼|¾||£|')].index, inplace=True)

In [58]:
# Let's check if each 'selftext' entry contains at least one character that is not a lowercase letter or a space.

df_clean[df_clean['selftext'].str.contains(r'[^a-z0-9 ]')].sample(1).T

Unnamed: 0,217211
id,pprrrf
author,rozodots
created,2021-09-17 02:46:53
pinned,0
removed,0
deleted,0
is_self,1
is_video,0
title,How do you screen/identify companies worth investing? How do they end up on your stocks radar?
link_flair_text,Advice Request


In [53]:
desired_id = 'maco48'  # Replace 'your_desired_id' with the actual ID you're looking for
selected_row_after = df_clean[df_clean['id'] == desired_id]

if not selected_row_after.empty:
    print(selected_row_after[['selftext', 'title']])
else:
    print("No record found with the specified 'id'.")


                                                                                                                                                                                selftext  \
503468  Hello guys, I've made some clarifications on some intresting points that some apes made. I've also added some usefull information that certain poeple made in the comments.\n\n[   

                                                    title  
503468  GameStop's voting date - June 10th 2021* (edited)  


In [54]:
desired_id = 'maco48'  # Replace 'your_desired_id' with the actual ID you're looking for
selected_row_before = df[df['id'] == desired_id]

if not selected_row_before.empty:
    print(selected_row_before[['selftext', 'title']])
else:
    print("No record found with the specified 'id'.")


                                                                                                                                                                                                                                                                                                                                                                                                                                                                              selftext  \
503468  Hello guys, I've made some clarifications on some intresting points that some apes made. I've also added some usefull information that certain poeple made in the comments.\n\n[https://www.reddit.com/r/GME/comments/m9enm6/i\_believe\_that\_the\_next\_annual\_date\_is\_june\_10/?utm\_source=share&utm\_medium=web2x&context=3](https://www.reddit.com/r/GME/comments/m9enm6/i_believe_that_the_next_annual_date_is_june_10/?utm_source=share&utm_medium=web2x&context=3)   

                                                   

In [55]:
# sanity check
df_clean[df_clean['title'] == '']

Unnamed: 0,id,author,created,pinned,removed,deleted,is_self,is_video,title,link_flair_text,...,num_comments,num_crossposts,selftext,shortlink,FolderName,date,Tagging,Hashtags,emoji_title,emoji_selftext
27940,om3gcx,Glittering-Pie6039,2021-07-17 12:38:49,0,0,0,0,0,,📱 Social Media 🐦,...,82,0,,https://redd.it/om3gcx,gme,2021-07-17,,,[],[]
37962,lmptkh,LiftUpVets,2021-02-18 15:54:10,0,1,0,0,0,,News,...,61,0,,https://redd.it/lmptkh,wallstreetbets,2021-02-18,,,[],[]
103148,ltf38z,merdock_69,2021-02-27 03:23:20,0,0,0,0,0,,💎🙌,...,23,0,,https://redd.it/ltf38z,gme,2021-02-27,,,[],[]
127218,mirux3,foknrekt,2021-04-02 19:39:54,0,0,0,0,0,,News 📰,...,19,1,,https://redd.it/mirux3,gme,2021-04-02,,,[],[]
139154,lstzd6,jlvillaraza,2021-02-26 09:19:07,0,0,0,0,0,,Hedge Fund Tears,...,17,0,,https://redd.it/lstzd6,gme,2021-02-26,,,[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1409355,l7ikqp,jmvp,2021-01-29 04:00:33,0,1,0,1,0,,Discussion,...,0,0,[removed],https://redd.it/l7ikqp,wallstreetbets,2021-01-29,,,[],[]
1412043,l7fite,[deleted],2021-01-29 01:41:11,0,1,0,1,0,,News,...,0,0,[removed],https://redd.it/l7fite,wallstreetbets,2021-01-29,,,[],[]
1414607,l7h2p7,lefty11235,2021-01-29 02:52:32,0,1,0,1,0,,Meme,...,0,0,[removed],https://redd.it/l7h2p7,wallstreetbets,2021-01-29,,,[],[]
1414948,l7gwmo,laydra100,2021-01-29 02:45:03,0,1,0,1,0,,News,...,0,0,[removed],https://redd.it/l7gwmo,wallstreetbets,2021-01-29,,,[],[]


### Allocate to a cluster based on the preferred type of post

In [None]:
df_agg_by_date['link_flair_text'] == ""

Discussion                         268004
Meme                               126793
YOLO                               125227
News                                78455
Gain                                68969
DD                                  39078
Loss                                31760
Technical Analysis                  13242
Chart                                8844
Daily Discussion                      600
Shitpost                              551
Donation                              258
Earnings Thread                        80
Weekend Discussion                     55
Mods                                   52

SyntaxError: invalid syntax (2046917487.py, line 3)

## Frequency of posting by various accounts  -FILTER by name

## 5. Conclusion

To conclude this notebook, it involves data collection, cleaning and some preliminary exploration of dataset by going through EDA techniques in order to understand the data and plot meaningful insights. 

In my next notebook, I will be going for a preprocessing and set a base model using Logistic Regression