### NLP Pipeline

Let's start from importing basic libraries:

In [1]:
import os

import numpy as np
import pandas as pd
# import tensorflow as tf

import warnings
warnings.filterwarnings("ignore")

Now let's read our train data and create a DataFrame:

In [2]:
train_df = pd.read_csv('./data/drug_review_train.csv')
train_df.head()

Unnamed: 0.1,Unnamed: 0,patient_id,drugName,condition,review,rating,date,usefulCount,review_length
0,0,89879,Cyclosporine,keratoconjunctivitis sicca,"""i have used restasis for about a year now and...",2.0,"April 20, 2013",69,147
1,1,143975,Etonogestrel,birth control,"""my experience has been somewhat mixed. i have...",7.0,"August 7, 2016",4,136
2,2,106473,Implanon,birth control,"""this is my second implanon would not recommen...",1.0,"May 11, 2016",6,140
3,3,184526,Hydroxyzine,anxiety,"""i recommend taking as prescribed, and the bot...",10.0,"March 19, 2012",124,104
4,4,91587,Dalfampridine,multiple sclerosis,"""i have been on ampyra for 5 days and have bee...",9.0,"August 1, 2010",101,74


Downloading libraries, necessary for preprocessing:

In [ ]:
!python.exe -m pip install --upgrade pip
!pip install nltk --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet
!pip install contractions --upgrade --quiet

!pip install unidecode --upgrade --quiet
!pip install textblob --upgrade --quiet
!pip install pyspellchecker --upgrade --quiet

Let's create a new Dataframe for preprocessed data:

In [9]:
prep_df = pd.DataFrame()

prep_df['patient_id'] = train_df['patient_id']
prep_df['review'] = train_df['review']
prep_df['drugName'] = train_df['drugName'].apply(lambda x: x.lower())

Relabeling rating column:

In [4]:
def relabel_rating(rating):
    if 0 <= rating <= 4:
        return 'Negative'
    elif 5 <= rating <= 7:
        return 'Neutral'
    elif 8 <= rating <= 10:
        return 'Positive'

prep_df['rating_category'] = train_df['rating'].apply(relabel_rating)

Now let's import libraries for text preprocessing:

In [27]:
import re
import string
import contractions

from unidecode import unidecode

import nltk
from nltk.corpus import stopwords

from nltk.corpus import wordnet
from nltk import pos_tag, word_tokenize
from nltk.stem import WordNetLemmatizer

from bs4 import BeautifulSoup
from unidecode import unidecode
from textblob import TextBlob

nltk.download("stopwords")
sw_nltk = stopwords.words('english')
nltk.download("wordnet")
nltk.download('averaged_perceptron_tagger_eng')

lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Khrystyna\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Khrystyna\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Khrystyna\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


These are function for lemmatization with POS tagging.

This step will be usable in case of using the vectorization techniques like TF-IDF or Word2Vec

It is NOT expected to use this preprocessing step for Transformers like BERT

In [6]:
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN


# Lemmatize with POS tagging
def lemmatize_with_pos(text):
    words = word_tokenize(text)
    pos_tags = pos_tag(words)
    lemmatized_words = [
        lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in pos_tags
    ]
    return ' '.join(lemmatized_words)

Let's implement a class of all text preprocessing steps. It'll contain lowercase, some symbols and html tags removal, diacritics replacement, contractions expanding, spellchecking, stopwords removal and lemmatization.

In [37]:
class Pipeline:
    def __init__(self, X):
        self.X = X
        
    def to_lower(self):
        # Let's check if first element is a list
        if isinstance(self.X.iloc[0], list):     
            self.X = self.X.apply(lambda tokens: [token.lower() for token in tokens])
        else:
            self.X = self.X.apply(lambda x: x.lower())
        print("Lowercase done")
        return self
    
    def remove_numbers(self):
        if isinstance(self.X.iloc[0], list):
            self.X = self.X.apply(lambda tokens: [re.sub(r'\d+', '', token) for token in tokens])
        else:
            self.X = self.X.apply(lambda x: re.sub(r'\d+', '', x))
        print("Numbers removal done")
        return self

    def remove_dots(self):
        if isinstance(self.X.iloc[0], list):     
            self.X = self.X.apply(lambda tokens: [re.sub("[.]", "", token) for token in tokens])
        else:
            self.X = self.X.apply(lambda x: re.sub("[.]", "", x))
        print("Dots removal done")
        return self
    
    def remove_punctuation(self):
        # '!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~' 32 punctuations in python string module
        if isinstance(self.X.iloc[0], list):     
            self.X = self.X.apply(lambda tokens: [re.sub('[%s]' % re.escape(string.punctuation), '', token) for token in tokens])
        else:
            self.X = self.X.apply(lambda x: re.sub('[%s]' % re.escape(string.punctuation), '', x))
        print("Punctuation removal done")
        return self
    
    def remove_multi_whitespace(self):
        if isinstance(self.X.iloc[0], list):     
            self.X = self.X.apply(lambda tokens: [re.sub(' +', ' ', token) for token in tokens])
        else:
            self.X = self.X.apply(lambda x: re.sub(' +', ' ', x))
        print("Multi whitespaces removal done")
        return self
    
    def expand_contractions(self):
        if isinstance(self.X.iloc[0], list):
            self.X = self.X.apply(
                lambda tokens: [contractions.fix(str(token)) for token in tokens if isinstance(token, str)]
            )
        else: 
            self.X = self.X.apply(
                lambda x: " ".join([contractions.fix(str(word)) for word in x.split() if isinstance(word, str)])
            )
        print("Contractions expand done")
        return self

    # Is this step usable for current dataset?
    def remove_html_tags(self):
        self.X = self.X.apply(
            lambda x: BeautifulSoup(x, 'html.parser').get_text())
        print("HTML tags removal done")
        return self

    def replace_diacritics(self):
        def process_tokens(tokens):
            try:
                return [unidecode(str(token)) for token in tokens]
            except Exception as e:
                print(f"Error processing tokens: {tokens}. Error: {e}")
                return tokens
    
        if isinstance(self.X.iloc[0], list):
            self.X = self.X.apply(lambda tokens: process_tokens(tokens) if isinstance(tokens, list) else tokens)
        else:
            self.X = self.X.apply(lambda x: unidecode(str(x)) if isinstance(x, str) else str(x))
        
        print("Diacritics replacement done")
        return self
    
    def spellcheck(self):
        self.X = self.X.apply(lambda tokens: ' '.join(tokens))
        self.X = self.X.apply(lambda x: str(TextBlob(x).correct()))  
        self.X = self.X.apply(lambda x: x.split()) 
        print("Spellcheck done")
        return self
    
    # Will NOT be used for Transformers
    def remove_stopwords(self):
        # Possible to add custom stopwords
        # new_stopwords = ['drugs']
        # sw_nltk.extend(new_stopwords)
        # Possible to remove already existing stopwords
        sw_nltk.remove('not')
        if isinstance(self.X.iloc[0], list):
            self.X = self.X.apply(lambda tokens: [word for word in tokens if word not in sw_nltk])
        else:
            self.X = self.X.apply(lambda x: " ".join([word for word in x.split() if word not in sw_nltk]))
        print("Stopwords removal done")
        return self
    
    # Will NOT be used for Transformers
    def lemmatize(self):
        self.X = self.X.apply(lemmatize_with_pos)
        print("Lemmatization done")
        return self

#### !!! Important !!!

1. Simple word vectorizing techniques like TF-IDF, Word2Vec benefit from lemmatizing.
2. Topic Modeling benefits from Lemmatization
3. Sentiment Analysis can sometimes get hurt by lemmtization and certainly by removal of certain stop words
4. It has been empirically seen that lemmatizing sentences deteriorates accuracy of pre-trained Large Language Models in BERT etc.

Source: [Elegant Text Pre-Processing with NLTK in sklearn Pipeline](https://towardsdatascience.com/elegant-text-pre-processing-with-nltk-in-sklearn-pipeline-d6fe18b91eb8)

Now let's preprocess our data and save it to .csv file:

In [38]:
text_preprocessor = Pipeline(train_df['review'].apply(lambda x: x.split()))

prep_df['review'] = text_preprocessor.to_lower().remove_numbers().remove_dots().remove_punctuation().remove_multi_whitespace().X

Lowercase done
Numbers removal done
Dots removal done
Punctuation removal done
Multi whitespaces removal done


#### !!! Important !!!
Check order of the following steps:

In [40]:
prep_df['review'] = text_preprocessor.expand_contractions().replace_diacritics().X
# .spellcheck().X)  # works too long

Contractions expand done
Diacritics replacement done


In [41]:
prep_df.head()

Unnamed: 0,patient_id,review,drugName
0,89879,i have used restasis for about a year now and ...,cyclosporine
1,143975,my experience has been somewhat mixed i have b...,etonogestrel
2,106473,this is my second implanon would not recommend...,implanon
3,184526,i recommend taking as prescribed and the bottl...,hydroxyzine
4,91587,i have been on ampyra for days and have been s...,dalfampridine


In [45]:
prep_df.to_csv('prep_data/drug_review_train_prep.csv')

In [46]:
prep_df['review'] = text_preprocessor.remove_stopwords().lemmatize().X

Stopwords removal done
Lemmatization done


In [47]:
prep_df.to_csv('prep_data/drug_review_train_prep_full.csv')