# Identifying Precise Forecasters on r/Wallstreetbets
**BrainStation Data Science Bootcamp - Capstone Project**

**Author: L Gavrilova**

**Date: 6 November 2023**

# Notebook 2A - Labelled dataset - Text Cleaning 

## 2.0. Introduction

Notebooks 2A and 2B are designed for text cleaning for the labelled dataset (Notebook 2A) and the reddit dataset (Notebook 2B).  

In this notebook 2A I perform the following steps:

1. I load the reddit dataset and perform the following text cleaning steps:

* removing rows with missing values
* removing empty spaces
* correcting labels in the target column
* removing website inks (urls), hashtags (#) and mentions (@)
* the resulting dataset is then saved into a csv file.
* I then do an additional cleaning step by removing and isolating emojis into a separate column.  This information can be useful during feature engineering steps, so I don’t want to lose this indicator of sentiment. 
* Finally, slang words and emojis used inside the wallstreetbets community and not obvious to outsiders are replaced with normal English words and phrases. For that, I created a csv file names "WSB dictionary" where I mapped the WSB slang with corresponding common English words. 

The result is a csv file that is prepared for further machine learning techniques. 

## 2.1. Cleaning Text

### 2.1.1. Data Loading and Basic Checks

In [32]:
# Standard Libraries for data manipulation
import pandas as pd
import numpy as np

# Regular Expressions Library
import re

# Emoji Handling Library
import emoji

In [33]:
df = pd.read_csv('../data/annotation file 3600 done 1142022.csv')

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5020 entries, 0 to 5019
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   link_id    5001 non-null   object
 1   parent_id  5001 non-null   object
 2   User       5001 non-null   object
 3   Text       5001 non-null   object
 4   Intent     5001 non-null   object
 5   Support    5001 non-null   object
dtypes: object(6)
memory usage: 235.4+ KB


In [35]:
df.sample(5)

Unnamed: 0,link_id,parent_id,User,Text,Intent,Support
63,t3_l6y2hy,t3_l6y2hy,MrStaraptor,"Fuck Robinhood \nFuck D1 Capital \nFuck Melvin Capital \nFuck Wall Street \nFUCK EM ALL \nHOLD AMC, HOLD GME DON'T LET THE FIRE DIE OUT",u,y
1847,t3_kwirha,t3_kwirha,Spanky_Stonks,Fuck yeah! Melvin Capital can suck it 📈🚀 GME to the moon!! ☄️,u,y
3900,t3_l8509l,t3_l8509l,Spicy_Yasuo,"Dumbass question I know, but, where if even possible can I buy $gme? I cant find a single app that I have used in the past that is allowing gme trading atm.",y,u
3795,t3_m5tffe,t1_gr2q3fj,FreshestCremeFraiche,GME,u,y
4799,t3_lra5cg,t3_lra5cg,Jd562310,GME,u,y


In [36]:
df.describe()

Unnamed: 0,link_id,parent_id,User,Text,Intent,Support
count,5001,5001,5001,5001,5001,5001
unique,1948,3153,4662,4952,6,4
top,t3_ladzdt,t3_ladzdt,AutoModerator,GME,u,y
freq,66,46,14,22,3246,2473


In [37]:
df['link_id'].nunique() == df.shape[0]

False

In [38]:
df.isna().sum()/df.shape[0]

link_id      0.003785
parent_id    0.003785
User         0.003785
Text         0.003785
Intent       0.003785
Support      0.003785
dtype: float64

In [39]:
df[ df['Text'].isna() ]

Unnamed: 0,link_id,parent_id,User,Text,Intent,Support
5001,,,,,,
5002,,,,,,
5003,,,,,,
5004,,,,,,
5005,,,,,,
5006,,,,,,
5007,,,,,,
5008,,,,,,
5009,,,,,,
5010,,,,,,


In [40]:
# Dropping rows that have NaN values
df = df.dropna()

In [41]:
df[ df['Text'].isna() ]

Unnamed: 0,link_id,parent_id,User,Text,Intent,Support


In [42]:
df['Intent'].value_counts()

Intent
u     3246
y      983
m      370
i      318
n       83
 u       1
Name: count, dtype: int64

In [43]:
# Replacing ' u' with 'u' in the 'Intent' column
df['Intent'] = df['Intent'].str.replace(' u', 'u', regex=False)
# checking again:
value_counts = df['Intent'].value_counts() 
print(value_counts)

Intent
u    3247
y     983
m     370
i     318
n      83
Name: count, dtype: int64


In [44]:
df_clean = df.copy() 

In [45]:
# Function to clean text
def purge_content(text):
    text_without_urls = re.sub(r'https?://\S+|www\.\S+', '', text)
    text_without_hashtags = re.sub(r'#\S+', '', text_without_urls)
    text_without_mentions = re.sub(r'@\S+', '', text_without_hashtags)
    clean_text = re.sub(r'\n+', ' ', text_without_mentions)

    return clean_text

for i in range(len(df_clean['Text'])):
    df_clean['Text'][i] = purge_content(df_clean['Text'][i])

In [46]:
pd.set_option('display.max_colwidth', None)

In [47]:
# sanity check
df_clean[df_clean['Text'] == '']

Unnamed: 0,link_id,parent_id,User,Text,Intent,Support
402,t3_l66caa,t1_gkyxvml,EllipticalOrbitMan,,i,i
1264,t3_l0mc06,t1_gju8jei,wolfiasty,,i,i
2187,t3_l6kqyk,t1_gl17oj6,EconomicallyLiterate,,i,i
4157,t3_khq3x2,t1_ggo6fi4,JonBoy82,,i,i
4265,t3_lat43j,t1_glq69gz,Free_Joty,,i,i


In [48]:
# Drop rows where the 'Text' column is an empty string
df_clean = df_clean[df_clean['Text'] != '']

In [49]:
# Recording the cleaned dataset as a new csv file to be used in other notebooks:
# Save the DataFrame to a CSV file
df.to_csv('../data/labelled_dataset_cleaned.csv', index=False)

### 2.1.2. Filtering out emojis by creating a new column

In [50]:
# Function to map emojis to their descriptions
def emoji_description(emoji):
    emoji_map = {
        "🚀": " super optimistic, ",
        "🦍": " brotherhood, ",
        "🤞": " hope, ",
        "🌙": " very optimistic, ",
        "🌕": " very optimistic, ",
        "💎🤚🏼": " patient investors, ",
        "💎🖐": " patient investors, ",
        "💎🙌": " patient investors, ",
        "🙌": " patient investors, ",
        "💎": " patient investors, ",
        "🧻🤚🏼": " impatient investors, ",
        "🧻🖐": " impatient investors, ",
        # Add more mappings as needed
    }
    # If the full emoji is in the map, return the description
    if emoji in emoji_map:
        return emoji_map[emoji]
    # If not, split any combined emojis and look up their individual descriptions
    else:
        return ''.join([emoji_map.get(char, '') for char in emoji])  # Default to empty string if not in mapping

def extract_and_replace_emojis(df, text_column_name='Text', emoji_column_name='emoji_text'):
    # Initialize an empty column for extracted emojis if a column name is provided
    if emoji_column_name:
        df[emoji_column_name] = ''

    # Function to extract and replace emojis in a text
    def process_text(text):
        emoji_pattern = re.compile(r'[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F700-\U0001F77F\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F\U0001FA70-\U0001FAFF\U0001FB00-\U0001FBFF\U0001F004]+')

        # Find all emojis in the text using the regex pattern
        emoji_matches = emoji_pattern.findall(text)
        emojis_extracted = ''
        text_with_replaced_emojis = text

        # Iterate over the found emojis
        for emoji_str in emoji_matches:
            # For each emoji in the emoji string
            for emoji_char in emoji_str:
                emoji_desc = emoji_description(emoji_char)  # Get description for individual emoji
                text_with_replaced_emojis = text_with_replaced_emojis.replace(emoji_char, emoji_desc, 1)
                emojis_extracted += emoji_char + ' '  # Add space to separate emojis

        # Return the modified text and the extracted emojis
        return text_with_replaced_emojis, emojis_extracted.strip()

    # Apply the processing function to the specified column and create new columns for text and emojis
    result = df[text_column_name].apply(process_text)
    df[text_column_name] = result.apply(lambda x: x[0])
    
    if emoji_column_name:
        df[emoji_column_name] = result.apply(lambda x: x[1])

    return df

In [51]:
# Applying the function to extract and replace emojis from 'Text' column
df_clean = extract_and_replace_emojis(df_clean, text_column_name='Text', emoji_column_name='emoji_text')

In [52]:
# Checking the new column with emojis extracted from the text
df_clean.sample().T
df_clean['emoji_text'].value_counts()

emoji_text
                                                                     4254
🚀 🚀 🚀                                                                  66
🚀                                                                      50
🚀 🚀                                                                    28
🚀 🚀 🚀 🚀                                                                26
                                                                     ... 
🧻 🤚 🏼 🧻 🤚 🏼 🧻 🤚 🏼 🧻 🤚 🏼 💎 🤚 🏼 💎 🤚 🏼 💎 🤚 🏼 💎 🤚 🏼 💎 🚀 🚀 🚀 🚀 🚀 🚀 🚀 🚀       1
💎 🚀 🚀 🌙                                                                 1
🚀 💪 🏋 💎                                                                 1
📝 👋 💎 👏 🚀 🚀 🚀 🌈 🐻 📉 🚀 🚀 🚀 🌕 🔥 🔥                                         1
🚀 🚀 🚀 🖐 💎 🖐 💵 🖐 🍿 🍗 🚀 🦍 🌚 🚀 🚀                                           1
Name: count, Length: 383, dtype: int64

In [53]:
df_clean.sample(3)

Unnamed: 0,link_id,parent_id,User,Text,Intent,Support,emoji_text
823,t3_kwe7q7,t1_gj5gnv5,rustyham,GME. Friday a bunch of puts will be exercised and price goes up more,u,u,
3714,t3_l6er79,t1_gl0b033,raahiv,Pretty sure they blocked GME and AMC,u,u,
631,t3_l8fqua,t1_glcdrl4,Vicous,"Honestly. I know GameStop has been a joke for quite some time but I feel like they've taken their lumps and they can turn things around. And I hope they do because I've met some really nice people at their stores and it's suck to see them all lose their jobs. GameStop should consider being gaming lounges rather than pure retailers and even during this pandemic I bet that'd be damn popular. It'd be like your local comic shop but with the ability to truly deck their stores out with official game and pop culture merch and throw in some sponsored game tournaments, like, some Super Nintendo Land / Pokemon Center shit mixed with a coffee-shop-like setting. Since now we all technically own so much of the company, we may as well be on the board of directors and rally behind this. So yeah, I like this stock.",u,y,


### 2.1.3. Replacing WSB slang with custom made "WSB Dictionary"

In [54]:
# Load the WSB lingo dictionary
wsb_dict_df = pd.read_csv('../data/WSB_dictionary.csv')

# Convert the DataFrame to a dictionary
wsb_dict = dict(zip(wsb_dict_df['WSB lingo'], wsb_dict_df['English']))

# Function to replace WSB lingo with English
def replace_wsb_lingo(text):
    # Use a regex pattern to match only whole words
    pattern = r'\b(' + '|'.join(re.escape(key) for key in wsb_dict.keys()) + r')\b'
    # Replace occurrences of each lingo with the English equivalent
    return re.sub(pattern, lambda x: wsb_dict[x.group()], text)

# Apply the function to the 'Text' column
df_clean['Text'] = df_clean['Text'].apply(replace_wsb_lingo)

### 2.1.4 Examples of texts before and after the cleaning steps

In [55]:
original_with_index = df.loc[2026]
print(original_with_index)

clean_with_index = df_clean.loc[2026]
print(clean_with_index)

link_id                                                                                                                                                                                                                                                                                                                                             t3_l6cb1x
parent_id                                                                                                                                                                                                                                                                                                                                           t3_l6cb1x
User                                                                                                                                                                                                                                                                                                        

In [56]:
original_with_index = df.loc[3986]
print(original_with_index)

clean_with_index = df_clean.loc[3986]
print(clean_with_index)

link_id                                            t3_l8ynt4
parent_id                                          t3_l8ynt4
User                                       wowexcellentstuff
Text         did NOT read. $GME to mf Andromeda 🚀🚀🚀🌌🌌\n\n💎🤲💎
Intent                                                     u
Support                                                    y
Name: 3986, dtype: object
link_id                                                                                                                                   t3_l8ynt4
parent_id                                                                                                                                 t3_l8ynt4
User                                                                                                                              wowexcellentstuff
Text          did NOT read. $GME to mf Andromeda  super optimistic,  super optimistic,  super optimistic,   patient investors,  patient investors, 
Intent          

In [57]:
original_with_index = df.loc[3386]
print(original_with_index)

clean_with_index = df_clean.loc[3386]
print(clean_with_index)

link_id                                    t3_kkwy50
parent_id                                  t3_kkwy50
User                                SnooMacarons1548
Text         GME🚀🚀🚀\n\nIt's a money printing company
Intent                                             u
Support                                            y
Name: 3386, dtype: object
link_id                                                                                        t3_kkwy50
parent_id                                                                                      t3_kkwy50
User                                                                                    SnooMacarons1548
Text          GME super optimistic,  super optimistic,  super optimistic,  It's a money printing company
Intent                                                                                                 u
Support                                                                                                y
emoji_text                

## 2.2. Conclusion

In [58]:
# Recording the cleaned dataset as a new csv file to be used in further steps:
# Save the DataFrame to a CSV file
df_clean.to_csv('../data/labelled_dataset_wo_emoji.csv', index=False)