# Text Preprocessing

by Michael Hunziker

## Summary
In this notebook we preprocess our dataset for our downstream nlp tasks.
This preprocessing includes



*   Determine, if we need to clean the text
*   Cleaning the text
*   Save the cleaned version

</br>

<a href="https://colab.research.google.com/github/miam-bonbon/assignment-adv-nlp/blob/main/adv_nlp_assignment_mh_01_text_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First, we load the data from the following paper:

@article{naseem2021covidsenti,
  title={COVIDSenti: A Large-Scale Benchmark Twitter Data Set for COVID-19 Sentiment Analysis},
  author={Naseem, Usman and Razzak, Imran and Khushi, Matloob and Eklund, Peter W and Kim, Jinman},
  journal={IEEE Transactions on Computational Social Systems},
  year={2021},
  publisher={IEEE}
}

Let's do some imports

In [None]:
%%capture

!pip install 'fhnw-nlp-utils>=0.8.0,<0.9.0'

from fhnw.nlp.utils.processing import parallelize_dataframe
from fhnw.nlp.utils.processing import is_iterable
from fhnw.nlp.utils.storage import download
from fhnw.nlp.utils.storage import save_dataframe
from fhnw.nlp.utils.storage import load_dataframe


import pandas as pd
import numpy as np

In [None]:
from fhnw.nlp.utils.system import set_log_level
from fhnw.nlp.utils.system import system_info

set_log_level()
print(system_info())

OS name: posix
Platform name: Linux
Platform release: 6.1.85+
Python version: 3.10.12
CPU brand: Intel(R) Xeon(R) CPU @ 2.20GHz
CPU cores: 1
RAM: 12.67GB total and 11.02GB available
Tensorflow version: 2.17.0
GPU is available


In [None]:
# create a DEV variable to use later
DEV = True

In [None]:
# !rm -r "data"

download("https://github.com/usmaann/COVIDSenti/raw/refs/heads/main/COVIDSenti.csv", "data/COVIDSenti.csv")
data = load_dataframe("data/COVIDSenti.csv")

In [None]:
print(data.shape)
data.head(3)

(90000, 2)


Unnamed: 0,tweet,label
0,Coronavirus | Human Coronavirus Types | CDC ht...,neu
1,"@shehryar_taseer That‚Äôs üíØ true , \nCorona...",neu
2,"TLDR: Not SARS, possibly new coronavirus. Diff...",neg


We have some cleaning to do - fist let's remove neutral sentiments

In [None]:
data = data[data["label"] != "neu"]
print(data.shape)
data.head(3)

(22615, 2)


Unnamed: 0,tweet,label
2,"TLDR: Not SARS, possibly new coronavirus. Diff...",neg
8,@tezuma75 Why #CCP keep on saying unknown caus...,neg
11,I always feel weird hoping for another coronav...,neg


In [None]:
from bs4 import BeautifulSoup
import re

def clean_text(text, keep_punctuation=False):
    """Cleans text by removing html tags, non ascii chars, digits and optionally punctuation

    Parameters
    ----------
    text : str
        The text to clean
    keep_punctuation : bool
        Defines if punctuation should be kept

    Returns
    -------
    str
        The cleaned text
    """
    # remove HTML tags
    soup = BeautifulSoup(text, "html.parser")
    text = soup.get_text()

    # remove non-ASCII characters
    text = ''.join([i if ord(i) < 128 else '' for i in text])

    # remove digits
    text = re.sub(r'\d+', '', text)

    # optionally remove punctuation
    if not keep_punctuation:
        text = re.sub(r'[^\w\s]', '', text)

    #remove \n and other escape sequences
    text = text.replace('\n', ' ').replace('\r', ' ')
    text = re.sub(r'\\x[0-9a-fA-F]{2}', '', text)
    text = text.replace('/>', '') #remove />

    return text

In [None]:
%%time

# drop in case of re-execution
data = data.drop(["cleaned_tweet"], axis=1, errors='ignore')
data = parallelize_dataframe(data, clean_text, field_read="tweet", field_write="cleaned_tweet", keep_punctuation=True)

# Displaying the first few rows of the updated dataframe
print(data.head())

  soup = BeautifulSoup(text, "html.parser")


                                                tweet label  \
2   TLDR: Not SARS, possibly new coronavirus. Diff...   neg   
8   @tezuma75 Why #CCP keep on saying unknown caus...   neg   
11  I always feel weird hoping for another coronav...   neg   
16  @KariDebbink @Vineet321 The Frieman Scary Scal...   neg   
18  Crap, a quick blast search suggests the Wuhan ...   neg   

                                        cleaned_tweet  
2   TLDR: Not SARS, possibly new coronavirus. Diff...  
8   @tezuma Why #CCP keep on saying unknown cause ...  
11  I always feel weird hoping for another coronav...  
16  @KariDebbink @Vineet The Frieman Scary Scale m...  
18  Crap, a quick blast search suggests the Wuhan ...  
CPU times: user 2.05 s, sys: 4.92 ms, total: 2.05 s
Wall time: 2.66 s


Double check this

In [None]:
data[data["cleaned_tweet"].str.contains("/>", na=False)]

Unnamed: 0,tweet,label,cleaned_tweet


And this

In [None]:
data[~data["cleaned_tweet"].str.contains("[A-Za-z]", na=False)]

Unnamed: 0,tweet,label,cleaned_tweet


In [None]:
for col in data.columns:
    print(col, data[col].isnull().sum())

tweet 0
label 0
cleaned_tweet 0


No empty cells, perfect

Final check

In [None]:
data[~data["cleaned_tweet"].str.contains("[A-Za-z]", na=False)]

Unnamed: 0,tweet,label,cleaned_tweet


Let's find non english text

In [None]:
!pip install fasttext

import fasttext

pretrained_model = "fasttext/supervised-models/lid.176.ftz"
download(url="https://dl.fbaipublicfiles.com/"+pretrained_model, path = pretrained_model)
model = fasttext.load_model(pretrained_model)



In [None]:
def predict_lang(text):
    """Predicts the language of a sentence
    Parameters
    ----------
    text : str
        The text to predict the language
    model: fasttext model
        Fasttext model to predict the language
    Returns
    -------
    str
        The predicted language (e.g. en, de, fr, it, es, ru, ...)
    """
    predictions = model.predict(text)
    predicted_language = predictions[0][0].replace("__label__", "")  # Extract language code
    return predicted_language

In [None]:
%%time
data = parallelize_dataframe(data, predict_lang, n_jobs=5, field_read="cleaned_tweet", field_write="lang")

  return bound(*args, **kwds)


CPU times: user 181 ms, sys: 144 ms, total: 325 ms
Wall time: 1.38 s


In [None]:
data.head(3)

Unnamed: 0,tweet,label,cleaned_tweet,lang
2,"TLDR: Not SARS, possibly new coronavirus. Diff...",neg,"TLDR: Not SARS, possibly new coronavirus. Diff...",en
8,@tezuma75 Why #CCP keep on saying unknown caus...,neg,@tezuma Why #CCP keep on saying unknown cause ...,en
11,I always feel weird hoping for another coronav...,neg,I always feel weird hoping for another coronav...,en


In [None]:
# get a summary of the top 5 languages
print(data["lang"].value_counts().head(5))

lang
en    22516
de       21
es       19
fr        8
it        7
Name: count, dtype: int64


A lot of languages - let's filter english

In [None]:
# let's filter english
data = data[data["lang"] == "en"]

In [None]:
# and recheck
data["lang"].value_counts()

Unnamed: 0_level_0,count
lang,Unnamed: 1_level_1
en,22516


In [None]:
data.head(3)

Unnamed: 0,tweet,label,cleaned_tweet,lang
2,"TLDR: Not SARS, possibly new coronavirus. Diff...",neg,"TLDR: Not SARS, possibly new coronavirus. Diff...",en
8,@tezuma75 Why #CCP keep on saying unknown caus...,neg,@tezuma Why #CCP keep on saying unknown cause ...,en
11,I always feel weird hoping for another coronav...,neg,I always feel weird hoping for another coronav...,en


We also have to remove all usernames, they would distort our data

Maybe also remove hashtags? Let's revisit later

In [None]:
import re

def remove_usernames(text):
    return re.sub(r'@\w+', '', text)

data['cleaned_tweet'] = data['cleaned_tweet'].apply(remove_usernames)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['cleaned_tweet'] = data['cleaned_tweet'].apply(remove_usernames)


... and links

In [None]:
import re

def remove_links(text):
    # Regular expression to match URLs, including those at the end of words
    return re.sub(r'https?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+|http(?=\b)', '', text)

data['cleaned_tweet'] = data['cleaned_tweet'].apply(remove_links)

In [None]:
data[data['cleaned_tweet'].str.contains("http")].head(3)

Unnamed: 0,tweet,label,cleaned_tweet,lang


In [None]:
data.head(3)

Unnamed: 0,tweet,label,cleaned_tweet,lang
2,"TLDR: Not SARS, possibly new coronavirus. Diff...",neg,"TLDR: Not SARS, possibly new coronavirus. Diff...",en
8,@tezuma75 Why #CCP keep on saying unknown caus...,neg,Why #CCP keep on saying unknown cause of pneu...,en
11,I always feel weird hoping for another coronav...,neg,I always feel weird hoping for another coronav...,en


In [None]:
%%time

import pandas as pd
from google.colab import drive

if (DEV):
  # Mount Google Drive
  drive.mount('/content/drive')
  output_file_path = "/content/drive/MyDrive/COVIDSenti_cleaned.parq"  # Save to github

  # Save the DataFrame to Parquet format
  data.to_parquet(output_file_path)

save_dataframe(data, "data/COVIDSenti_cleaned.parq") # we save and load from github later

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
CPU times: user 512 ms, sys: 30.7 ms, total: 543 ms
Wall time: 2.53 s
