# Identifying Precise Forecasters on r/Wallstreetbets
**BrainStation Data Science Bootcamp - Capstone Project**

**Author: L Gavrilova**

**Date: 6 November 2023**

# Notebook 2 - Labelled dataset - Text Cleaning 

## 2.0. Table of Contents

1. [Introduction](#1.-Introduction)

Removing website links <br>
Filtering out emojis by creating a new column <br>
2. Spellcheck.

. [Conclusion](#5.-Conclusion)

## 2.1. Cleaning Text

I am 

### 2.1.1. Data Loading and Basic Checks

In [45]:
# Standard Libraries for data manipulation
import pandas as pd
import numpy as np

# Regular Expressions Library
import re

# Emoji Handling Library
import emoji

In [46]:
df = pd.read_csv('../data/annotation file 3600 done 1142022.csv')

In [47]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5020 entries, 0 to 5019
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   link_id    5001 non-null   object
 1   parent_id  5001 non-null   object
 2   User       5001 non-null   object
 3   Text       5001 non-null   object
 4   Intent     5001 non-null   object
 5   Support    5001 non-null   object
dtypes: object(6)
memory usage: 235.4+ KB


In [48]:
df.sample(5)

Unnamed: 0,link_id,parent_id,User,Text,Intent,Support
1075,t3_l692dj,t3_l692dj,Logical_Psychology55,Gotta tell ya: this is just the beginning. Dont look at the chart and perceive gain or losses as the chart moves. Take care of yourself and put the screen aside. The whole investment world is learning about why not to short a stock with over 100% short interest. A hard lesson that holding GME is teaching everyone. The outcome of this will be legendary.,u,y
1344,t3_kxsd2p,t3_kxsd2p,SideOk3956,"What is dead may never die; rise, GME!",u,y
4271,t3_lsc5ji,t3_lsc5ji,Hurdrs123,"GME only thing making me money today, too bad it’s still on 200 shares I’m 💼 holding from 280 🤡",y,u
573,t3_lacs3f,t3_lacs3f,Kureist,All in on $GME you stupid cunt.,y,y
4783,t3_l6hopd,t1_gl1j9qe,randomperson0284,"Yeah there is a ton of reading involved, but statistically 95% of options expire worthless and people lose money. \n\nThis sub has like over a million new people many of whom think this gamestop movement is a normal market occurrence lol its bizarre to read. I mean i hope you can double or triple what you have, idk. Gl",u,u


In [49]:
df.describe()

Unnamed: 0,link_id,parent_id,User,Text,Intent,Support
count,5001,5001,5001,5001,5001,5001
unique,1948,3153,4662,4952,6,4
top,t3_ladzdt,t3_ladzdt,AutoModerator,GME,u,y
freq,66,46,14,22,3246,2473


In [50]:
df['link_id'].nunique() == df.shape[0]

False

In [51]:
df.isna().sum()/df.shape[0]

link_id      0.003785
parent_id    0.003785
User         0.003785
Text         0.003785
Intent       0.003785
Support      0.003785
dtype: float64

In [52]:
df[ df['Text'].isna() ]

Unnamed: 0,link_id,parent_id,User,Text,Intent,Support
5001,,,,,,
5002,,,,,,
5003,,,,,,
5004,,,,,,
5005,,,,,,
5006,,,,,,
5007,,,,,,
5008,,,,,,
5009,,,,,,
5010,,,,,,


In [53]:
# Dropping rows that have NaN values
df = df.dropna()

In [54]:
df[ df['Text'].isna() ]

Unnamed: 0,link_id,parent_id,User,Text,Intent,Support


In [55]:
df['Intent'].value_counts()

Intent
u     3246
y      983
m      370
i      318
n       83
 u       1
Name: count, dtype: int64

In [56]:
# Replacing ' u' with 'u' in the 'Intent' column
df['Intent'] = df['Intent'].str.replace(' u', 'u', regex=False)
# checking again:
value_counts = df['Intent'].value_counts() 
print(value_counts)

Intent
u    3247
y     983
m     370
i     318
n      83
Name: count, dtype: int64


In [57]:
df_clean = df.copy() 

In [58]:
# Function to clean text
def purge_content(text):
    text_without_urls = re.sub(r'https?://\S+|www\.\S+', '', text)
    text_without_hashtags = re.sub(r'#\S+', '', text_without_urls)
    text_without_mentions = re.sub(r'@\S+', '', text_without_hashtags)
    clean_text = re.sub(r'\n+', ' ', text_without_mentions)

    return clean_text

for i in range(len(df_clean['Text'])):
    df_clean['Text'][i] = purge_content(df_clean['Text'][i])

In [59]:
pd.set_option('display.max_colwidth', None)

In [60]:
# sanity check
df_clean[df_clean['Text'] == '']

Unnamed: 0,link_id,parent_id,User,Text,Intent,Support
402,t3_l66caa,t1_gkyxvml,EllipticalOrbitMan,,i,i
1264,t3_l0mc06,t1_gju8jei,wolfiasty,,i,i
2187,t3_l6kqyk,t1_gl17oj6,EconomicallyLiterate,,i,i
4157,t3_khq3x2,t1_ggo6fi4,JonBoy82,,i,i
4265,t3_lat43j,t1_glq69gz,Free_Joty,,i,i


In [61]:
# Drop rows where the 'Text' column is an empty string
df_clean = df_clean[df_clean['Text'] != '']

In [62]:
# Recording the cleaned dataset as a new csv file to be used in other notebooks:
# Save the DataFrame to a CSV file
df.to_csv('../data/labelled_dataset_cleaned.csv', index=False)

### 2.1.2. Filtering out `emojis` by creating a new column

In [63]:
# Function to map emojis to their descriptions
def emoji_description(emoji):
    emoji_map = {
        "🚀": " super optimistic, ",
        "🦍": " brotherhood, ",
        "🤞": " hope, ",
        "🌙": " very optimistic, ",
        "🌕": " very optimistic, ",
        "💎🤚🏼": " patient investors, ",
        "💎🖐": " patient investors, ",
        "💎🙌": " patient investors, ",
        "🙌": " patient investors, ",
        "💎": " patient investors, ",
        "🧻🤚🏼": " impatient investors, ",
        "🧻🖐": " impatient investors, ",
        # Add more mappings as needed
    }
    # If the full emoji is in the map, return the description
    if emoji in emoji_map:
        return emoji_map[emoji]
    # If not, split any combined emojis and look up their individual descriptions
    else:
        return ''.join([emoji_map.get(char, '') for char in emoji])  # Default to empty string if not in mapping

def extract_and_replace_emojis(df, text_column_name='Text', emoji_column_name='emoji_text'):
    # Initialize an empty column for extracted emojis if a column name is provided
    if emoji_column_name:
        df[emoji_column_name] = ''

    # Function to extract and replace emojis in a text
    def process_text(text):
        emoji_pattern = re.compile(r'[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F700-\U0001F77F\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F\U0001FA70-\U0001FAFF\U0001FB00-\U0001FBFF\U0001F004]+')

        # Find all emojis in the text using the regex pattern
        emoji_matches = emoji_pattern.findall(text)
        emojis_extracted = ''
        text_with_replaced_emojis = text

        # Iterate over the found emojis
        for emoji_str in emoji_matches:
            # For each emoji in the emoji string
            for emoji_char in emoji_str:
                emoji_desc = emoji_description(emoji_char)  # Get description for individual emoji
                text_with_replaced_emojis = text_with_replaced_emojis.replace(emoji_char, emoji_desc, 1)
                emojis_extracted += emoji_char + ' '  # Add space to separate emojis

        # Return the modified text and the extracted emojis
        return text_with_replaced_emojis, emojis_extracted.strip()

    # Apply the processing function to the specified column and create new columns for text and emojis
    result = df[text_column_name].apply(process_text)
    df[text_column_name] = result.apply(lambda x: x[0])
    
    if emoji_column_name:
        df[emoji_column_name] = result.apply(lambda x: x[1])

    return df

In [1]:
# Applying the function to extract and replace emojis from 'Text' column
df_clean = extract_and_replace_emojis(df_clean, text_column_name='Text', emoji_column_name='emoji_text')

NameError: name 'extract_and_replace_emojis' is not defined

In [65]:
# Checking the new column with emojis extracted from the text
df_clean.sample().T
df_clean['emoji_text'].value_counts()

emoji_text
                                                                     4254
🚀 🚀 🚀                                                                  66
🚀                                                                      50
🚀 🚀                                                                    28
🚀 🚀 🚀 🚀                                                                26
                                                                     ... 
🧻 🤚 🏼 🧻 🤚 🏼 🧻 🤚 🏼 🧻 🤚 🏼 💎 🤚 🏼 💎 🤚 🏼 💎 🤚 🏼 💎 🤚 🏼 💎 🚀 🚀 🚀 🚀 🚀 🚀 🚀 🚀       1
💎 🚀 🚀 🌙                                                                 1
🚀 💪 🏋 💎                                                                 1
📝 👋 💎 👏 🚀 🚀 🚀 🌈 🐻 📉 🚀 🚀 🚀 🌕 🔥 🔥                                         1
🚀 🚀 🚀 🖐 💎 🖐 💵 🖐 🍿 🍗 🚀 🦍 🌚 🚀 🚀                                           1
Name: count, Length: 383, dtype: int64

In [66]:
df_clean.sample(3)

Unnamed: 0,link_id,parent_id,User,Text,Intent,Support,emoji_text
4781,t3_m3f9fp,t3_m3f9fp,jessekg,Opposite of a ladder attack — the Gamma swarm lol [,i,i,
3853,t3_lnqgz8,t1_go1wkpr,skwolf522,Everytime i think this whole gme saga has peaked.... shit like this happens.,u,y,
306,t3_ladzdt,t3_ladzdt,Wolf_Of_1337_Street,"so many people here are sitting on life-changing profits from GME but they are refusing to take profit because somehow this got spun into class warfare vs. ""the hedge funds."" It's really sad to see people missing out on this money for no reason. GME cannot stay at this level forever.",u,n,


### 2.1.3. Replacing slang with custom made "WSB Dictionary"

In [80]:
# Load the WSB lingo dictionary
wsb_dict_df = pd.read_csv('../data/WSB_dictionary.csv')

# Convert the DataFrame to a dictionary
wsb_dict = dict(zip(wsb_dict_df['WSB lingo'], wsb_dict_df['English']))

# Function to replace WSB lingo with English
def replace_wsb_lingo(text):
    # Use a regex pattern to match only whole words
    pattern = r'\b(' + '|'.join(re.escape(key) for key in wsb_dict.keys()) + r')\b'
    # Replace occurrences of each lingo with the English equivalent
    return re.sub(pattern, lambda x: wsb_dict[x.group()], text)

# Apply the function to the 'Text' column
df_clean['Text'] = df_clean['Text'].apply(replace_wsb_lingo)

### 2.1.4 Examples of texts before and after the cleaning steps

In [79]:
original_with_index = df.loc[2026]
print(original_with_index)

clean_with_index = df_clean.loc[2026]
print(clean_with_index)

link_id                                                                                                                                                                                                                                                                                                                                             t3_l6cb1x
parent_id                                                                                                                                                                                                                                                                                                                                           t3_l6cb1x
User                                                                                                                                                                                                                                                                                                        

In [77]:
original_with_index = df.loc[3986]
print(original_with_index)

clean_with_index = df_clean.loc[3986]
print(clean_with_index)

link_id                                            t3_l8ynt4
parent_id                                          t3_l8ynt4
User                                       wowexcellentstuff
Text         did NOT read. $GME to mf Andromeda 🚀🚀🚀🌌🌌\n\n💎🤲💎
Intent                                                     u
Support                                                    y
Name: 3986, dtype: object
link_id                                                                                                                                   t3_l8ynt4
parent_id                                                                                                                                 t3_l8ynt4
User                                                                                                                              wowexcellentstuff
Text          did NOT read. $GME to mf Andromeda  super optimistic,  super optimistic,  super optimistic,   patient investors,  patient investors, 
Intent          

In [76]:
original_with_index = df.loc[3386]
print(original_with_index)

clean_with_index = df_clean.loc[3386]
print(clean_with_index)

link_id                                    t3_kkwy50
parent_id                                  t3_kkwy50
User                                SnooMacarons1548
Text         GME🚀🚀🚀\n\nIt's a money printing company
Intent                                             u
Support                                            y
Name: 3386, dtype: object
link_id                                                                                        t3_kkwy50
parent_id                                                                                      t3_kkwy50
User                                                                                    SnooMacarons1548
Text          GME super optimistic,  super optimistic,  super optimistic,  It's a money printing company
Intent                                                                                                 u
Support                                                                                                y
emoji_text                

## 2.2. Conclusion

In [72]:
# Recording the cleaned dataset as a new csv file to be used in future:
# Save the DataFrame to a CSV file
df_clean.to_csv('../data/labelled_dataset_wo_emoji.csv', index=False)

# Optional 

In [73]:
# punctuation and anything except for letters is stripped away, also empty spaces go away. 

# Emojis are stripped off!!!! NB! 

if False:
    
    cleaned_df = df.copy()
    # to replace any character that is not a lowercase or uppercase letter with a single space
    # then to replace one or more whitespace characters (\s+) with a single space
    # then to replace '\n' with empty spaces
    # then to remove all types of whitespace characters at the ends of the string
    # cleaned_df["Text"] = cleaned_df["Text"].replace("\n", "").str.replace(r"[^a-zA-Z]", " ").str.replace(r"\s+", " ")


    # First, replace newline characters with an empty string for each element
    cleaned_df["Text"] = cleaned_df["Text"].str.replace("\n", "", regex=False)

    # Then, replace non-alphabetic characters with a space for each element
    cleaned_df["Text"] = cleaned_df["Text"].str.replace(r"[^a-zA-Z]", " ", regex=True)

    # Then, replace multiple spaces with a single space for each element
    cleaned_df["Text"] = cleaned_df["Text"].str.replace(r"\s+", " ", regex=True)

    # Finally, strip leading and trailing spaces from each element
    cleaned_df["Text"] = cleaned_df["Text"].str.strip()

    df=cleaned_df.copy()

In [75]:
# dfdfd

if False:
        df_clean = df.copy()

    # Function to clean text
    def purge_content(text):
        # Define patterns for URLs, hashtags, mentions, and newlines
        url_pattern = r'https?://\S+|www\.\S+'
        hashtag_pattern = r'#\S+'
        mention_pattern = r'@\S+'
        newline_pattern = r'\n+'
        
        # Remove URLs
        purged_text = re.sub(url_pattern, '', text)
        # Remove hashtags
        purged_text = re.sub(hashtag_pattern, '', purged_text)
        # Remove mentions
        purged_text = re.sub(mention_pattern, '', purged_text)
        # Remove newlines
        purged_text = re.sub(newline_pattern, ' ', purged_text)
        
        return purged_text

    # Clean the 'Text' column
    df_clean['Text'] = df_clean['Text'].apply(purge_content)

IndentationError: unindent does not match any outer indentation level (<tokenize>, line 7)