# Identifying Precise Forecasters on r/Wallstreetbets
**BrainStation Data Science Bootcamp - Capstone Project**

**Author: L Gavrilova**

**Date: 6 November 2023**

# Notebook 2B - Reddit datasets - Text Cleaning 

## 2.0. Introduction


This notebook is a close copy of the notebook for cleaning the labelled dataset. 

In this notebook 2B I perform the following steps:

1. I load the reddit dataset and repeat all the same cleaning steps that I designed for cleaning the labelled dataset (removing website links, removing and preserving emojis, applying WSB dictionary, empty spaces removal, etc). 
2. I merged the 'title' and 'selftext' fields from the reddit dataset into one new column named 'text'.  This is done to simplify processing (to process one column, instead of two). Moreover, in nearly 30% of rows 'selftext' has missing value, whereas 'title' column had no missing values.  It made sense to combine these two columns as they carry no significant difference in meaning or significance. 

The result is a csv file that is prepared for further machine processing. 

## 2.1. Cleaning Text

### 2.1.1. Data Loading and Basic Checks

In [3]:
# Standard Libraries for data manipulation
import pandas as pd
import numpy as np

# Regular Expressions Library
import re

# Emoji Handling Library
import emoji

In [4]:
df = pd.read_csv('../data/reddit_GMEonly_cleaned.csv')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155167 entries, 0 to 155166
Data columns (total 18 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   id                   155167 non-null  object 
 1   author               155167 non-null  object 
 2   created              155167 non-null  object 
 3   removed              155167 non-null  int64  
 4   deleted              155167 non-null  int64  
 5   is_self              155167 non-null  int64  
 6   is_video             155167 non-null  int64  
 7   title                155167 non-null  object 
 8   link_flair_text      155167 non-null  object 
 9   upvote_ratio         155167 non-null  float64
 10  score                155167 non-null  int64  
 11  num_comments         155167 non-null  int64  
 12  selftext             110596 non-null  object 
 13  shortlink            155167 non-null  object 
 14  FolderName           155167 non-null  object 
 15  word_count_selfte

In [6]:
df.sample(5)

Unnamed: 0,id,author,created,removed,deleted,is_self,is_video,title,link_flair_text,upvote_ratio,score,num_comments,selftext,shortlink,FolderName,word_count_selftext,word_count_title,date
94147,pu4hfs,ercanbas,2021-09-23 20:54:32,1,0,1,0,GME and ComputerShare,Discussion,1.0,1,1,[removed],https://redd.it/pu4hfs,wallstreetbets,1,3,2021-09-23
142739,l76uyz,[deleted],2021-01-28 19:55:04,1,1,0,0,"Public allowing sale of GME, AMC, KOSS once more",News,1.0,1,0,[deleted],https://redd.it/l76uyz,wallstreetbets,1,9,2021-01-28
123093,l70ccz,SenateMajorityLeader,2021-01-28 15:51:23,0,0,1,0,Is WSB down again???,none,0.75,2,0,I think Robinhood got hit with a lawsuit so th...,https://redd.it/l70ccz,gme,25,4,2021-01-28
155153,l7ga3v,[deleted],2021-01-29 02:15:32,1,1,0,0,We respect gamestop because their company's hi...,Discussion,1.0,1,0,[deleted],https://redd.it/l7ga3v,wallstreetbets,1,38,2021-01-29
132324,l7z5ea,GotSodium,2021-01-29 17:55:05,1,0,0,0,ROBINHOOD REDUCED THE AMOUNT OF GME YOU CAN OW...,Discussion,1.0,1,0,,https://redd.it/l7z5ea,wallstreetbets,1,21,2021-01-29


In [7]:
df['id'].nunique() == df.shape[0]

True

In [8]:
df.isna().sum()/df.shape[0]

id                     0.000000
author                 0.000000
created                0.000000
removed                0.000000
deleted                0.000000
is_self                0.000000
is_video               0.000000
title                  0.000000
link_flair_text        0.000000
upvote_ratio           0.000000
score                  0.000000
num_comments           0.000000
selftext               0.287245
shortlink              0.000000
FolderName             0.000000
word_count_selftext    0.000000
word_count_title       0.000000
date                   0.000000
dtype: float64

In [9]:
df_clean = df.copy() 

In [10]:
# Concatenate 'title' and 'selftext' into a new column 'title_selftext'
df_clean['text'] = df_clean['title'] + ' ' + df_clean['selftext'].fillna('')

# Check for missing values in the new 'title_selftext' column
missing_values = df_clean['text'].isnull().sum()

# Print the number of missing values
print(f"Number of missing values in 'title_selftext': {missing_values}")

# Drop the legacy columns 'selftext' and 'title'
df_clean = df_clean.drop(columns=['title', 'selftext'])

Number of missing values in 'title_selftext': 0


In [11]:
# Function to clean text
def purge_content(text):
    if pd.isna(text):
        return text  # Return NaN as it is
    text_without_urls = re.sub(r'https?://\S+|www\.\S+', '', text)
    text_without_hashtags = re.sub(r'#\S+', '', text_without_urls)
    text_without_mentions = re.sub(r'@\S+', '', text_without_hashtags)
    clean_text = re.sub(r'\n+', ' ', text_without_mentions)

    return clean_text

# Use the apply function to clean the 'selftext' column
df_clean.loc[:, 'text'] = df_clean['text'].apply(purge_content)

In [12]:
pd.set_option('display.max_colwidth', None)

In [13]:
# sanity check
df_clean[df_clean['text'] == '']

Unnamed: 0,id,author,created,removed,deleted,is_self,is_video,link_flair_text,upvote_ratio,score,num_comments,shortlink,FolderName,word_count_selftext,word_count_title,date,text


In [14]:
# Drop rows where the 'Text' column is an empty string
df_clean = df_clean[df_clean['text'] != '']

In [15]:
# Recording the cleaned dataset as a new csv file to be used in other notebooks:
# Save the DataFrame to a CSV file
# df.to_csv('../data/02_reddit_GMEonly_cleaned2.csv', index=False)

### 2.1.2. Filtering out emojis by creating a new column

In [16]:
# Function to map emojis to their descriptions
def emoji_description(emoji):
    emoji_map = {
        "🚀": " super optimistic, ",
        "🦍": " brotherhood, ",
        "🤞": " hope, ",
        "🌙": " very optimistic, ",
        "🌕": " very optimistic, ",
        "💎🤚🏼": " patient investors, ",
        "💎🖐": " patient investors, ",
        "💎🙌": " patient investors, ",
        "🙌": " patient investors, ",
        "💎": " patient investors, ",
        "🧻🤚🏼": " impatient investors, ",
        "🧻🖐": " impatient investors, ",
        # Add more mappings as needed
    }
    # If the full emoji is in the map, return the description
    if emoji in emoji_map:
        return emoji_map[emoji]
    # If not, split any combined emojis and look up their individual descriptions
    else:
        return ''.join([emoji_map.get(char, '') for char in emoji])  # Default to empty string if not in mapping

def extract_and_replace_emojis(df, text_column_name='Text', emoji_column_name='emoji_text'):
    # Initialize an empty column for extracted emojis if a column name is provided
    if emoji_column_name:
        df[emoji_column_name] = ''

    # Function to extract and replace emojis in a text
    def process_text(text):
        emoji_pattern = re.compile(r'[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F700-\U0001F77F\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F\U0001FA70-\U0001FAFF\U0001FB00-\U0001FBFF\U0001F004]+')

        # Find all emojis in the text using the regex pattern
        emoji_matches = emoji_pattern.findall(text)
        emojis_extracted = ''
        text_with_replaced_emojis = text

        # Iterate over the found emojis
        for emoji_str in emoji_matches:
            # For each emoji in the emoji string
            for emoji_char in emoji_str:
                emoji_desc = emoji_description(emoji_char)  # Get description for individual emoji
                text_with_replaced_emojis = text_with_replaced_emojis.replace(emoji_char, emoji_desc, 1)
                emojis_extracted += emoji_char + ' '  # Add space to separate emojis

        # Return the modified text and the extracted emojis
        return text_with_replaced_emojis, emojis_extracted.strip()

    # Apply the processing function to the specified column and create new columns for text and emojis
    result = df[text_column_name].apply(process_text)
    df[text_column_name] = result.apply(lambda x: x[0])
    
    if emoji_column_name:
        df[emoji_column_name] = result.apply(lambda x: x[1])

    return df

In [17]:
# Applying the function to extract and replace emojis from 'Text' column
df_clean = df_clean.copy()
df_clean = extract_and_replace_emojis(df_clean, text_column_name='text', emoji_column_name='emoji_text')

In [18]:
# Checking the new column with emojis extracted from the text
df_clean.sample().T
df_clean['emoji_text'].value_counts()

emoji_text
                                           118163
🚀 🚀 🚀                                        3202
🚀                                            2139
🚀 🚀 🚀 🚀                                      1311
🚀 🚀                                          1185
                                            ...  
💲 💎 💎                                           1
🤤 💎 💎 🚀 🚀 🚀                                     1
🌖 🦧 🍌 🚀 💎 🙌                                     1
💎 💎 💎 💎 💎 💎 🙏 🏼 🙏 🏼 🙏 🏼 🙏 🏼 💎 💎 💎 💎 💎 💎         1
🥝 🚀 🚀 🚀                                         1
Name: count, Length: 12881, dtype: int64

### 2.1.3. Replacing slang with a custom made "WSB Dictionary"

In [19]:
# Load the WSB lingo dictionary
wsb_dict_df = pd.read_csv('../data/WSB_dictionary.csv')

# Convert the DataFrame to a dictionary
wsb_dict = dict(zip(wsb_dict_df['WSB lingo'], wsb_dict_df['English']))

# Function to replace WSB lingo with English
def replace_wsb_lingo(text):
    # Use a regex pattern to match only whole words
    pattern = r'\b(' + '|'.join(re.escape(key) for key in wsb_dict.keys()) + r')\b'
    # Replace occurrences of each lingo with the English equivalent
    return re.sub(pattern, lambda x: wsb_dict[x.group()], text)

# Apply the function to the 'Text' column
df_clean['text'] = df_clean['text'].apply(replace_wsb_lingo)

### 2.1.4 Examples of texts before and after the cleaning steps

In [20]:
original_with_index = df.loc[2026]
print(original_with_index)

clean_with_index = df_clean.loc[2026]
print(clean_with_index)

id                                                                                lb2mfm
author                                                                    scott_doge_wow
created                                                              2021-02-02 18:26:07
removed                                                                                1
deleted                                                                                0
is_self                                                                                0
is_video                                                                               0
title                  My friend got his boomer mom to buy some $GME. 💎🤲 ANYONE CAN HOLD
link_flair_text                                                                     YOLO
upvote_ratio                                                                        0.96
score                                                                               7993
num_comments         

In [21]:
original_with_index = df.loc[3386]
print(original_with_index)

clean_with_index = df_clean.loc[3386]
print(clean_with_index)

id                                                     l8wp15
author                                             davidr2448
created                                   2021-01-30 21:52:05
removed                                                     0
deleted                                                     0
is_self                                                     0
is_video                                                    1
title                  Current Situation 😂📈🚀🍿🔥 $GME $AMC $NOK
link_flair_text                                          none
upvote_ratio                                             0.98
score                                                    1125
num_comments                                              156
selftext                                                  NaN
shortlink                              https://redd.it/l8wp15
FolderName                                                gme
word_count_selftext                                         1
word_cou

## 2.2. Exporting the cleaned dataset into .csv file

In [22]:
# Recording the cleaned dataset as a new csv file to be used in future:
# Save the DataFrame to a CSV file
df_clean.to_csv('../data/02_reddit_GMEonly_wo_emoji.csv', index=False)