### Data Cleaning and Pre-processing 

- Organization of the file data-cleaning-and-pre-processing.ipymb:

Code 1: Imports

Code 2: Read the dataset

Code 3: Corrupted rows (rows without the attribute condition)

Code 4: Function to clean and preprocess text (atribute review)

Code 5: Visualisation of the review (raw and cleaned) in the line 7

Code 6: Remove the 'rating' and 'review' attribute

Code 7: Export the ds to a CSV file

Code 8: Drop the atribute rating from the raw CSV file






In [1]:
# Code 1: Imports

import pandas as pd
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from string import punctuation
import re
import textwrap 

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Home\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Home\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Home\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Home\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# Code 2: Read the dataset 

ds = pd.read_csv(r'C:\Users\Home\Desktop\TESE\MEIC-TFM\MEIC-TFM\drugsComTrain_raw_or.csv')
#ds = pd.read_csv(r'C:\Users\Home\Desktop\TESE\MEIC-TFM\MEIC-TFM\drugsComTest_raw.csv')

#ds = ds.head(500)
print(ds.shape)



(161297, 7)


In [4]:
# Code 3: Corrupted rows (rows without the attribute condition)

num_corrupted_reviews = len(ds[ds['condition'].str.contains("users found this comment helpful.", na=False)])
print("Number of Corrupted Reviews: ", num_corrupted_reviews)

# Removing corrupted rows based on the attribute condition 
ds = ds[~ds['condition'].str.contains(" users found this comment helpful.", na=False)] #ds is "ds without ds['condition']..."



Number of Corrupted Reviews:  900


In [24]:
# Code 4: Function to clean and preprocess text (atribute review)

# Function to map NLTK POS tags to WordNet POS tags
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # default to noun

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove HTML entities and replace with their corresponding character
    text = re.sub(r'&#[0-9]+;', '', text)

    # Remove punctuation
    text = ''.join([char for char in text if char not in punctuation])

    # Tokenization (just to make sure) 
    tokens = word_tokenize(text)

    # Removing stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]

    # Applying POS Tagging
    pos_tagged_tokens = pos_tag(filtered_tokens)

    # Initialize WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()

    # Lemmatization with POS tags
    lemmatized_tokens = [lemmatizer.lemmatize(word, pos=get_wordnet_pos(pos_tag)) for word, pos_tag in pos_tagged_tokens]
    
    # Return the processed tokens as a string
    return ' '.join(lemmatized_tokens)

# Apply the preprocessing function to all the review column
ds['processed_review'] = ds['review'].apply(preprocess_text)

# The 'ds' now has an additional column 'processed_review' with the cleaned text

 

In [27]:
# Code 5: Visualisation of the review (raw and cleaned) in the line 7
long_review1 = ds['review'].values[7] 

long_review2 = ds['processed_review'].values[7] 

# Print a wrapped text for better visualisation
print('Raw review:')
wrapped_text = textwrap.fill(long_review1, width=100)
print(wrapped_text)
print('Processed review:')
wrapped_text = textwrap.fill(long_review2, width=100)
print(wrapped_text)

Raw review:
"Abilify changed my life. There is hope. I was on Zoloft and Clonidine when I first started Abilify
at the age of 15.. Zoloft for depression and Clondine to manage my complete rage. My moods were out
of control. I was depressed and hopeless one second and then mean, irrational, and full of rage the
next. My Dr. prescribed me 2mg of Abilify and from that point on I feel like I have been cured
though I know I&#039;m not.. Bi-polar disorder is a constant battle. I know Abilify works for me
because I have tried to get off it and lost complete control over my emotions. Went back on it and I
was golden again.  I am on 5mg 2x daily. I am now 21 and better than I have ever been in the past.
Only side effect is I like to eat a lot."
Processed review:
abilify change life hope zoloft clonidine first start abilify age 15 zoloft depression clondine
manage complete rage mood control depress hopeless one second mean irrational full rage next dr
prescribe 2mg abilify point feel like cure t

In [28]:
# Code 6: Remove the atribute 'rating' and update the atribute review

ds['review'] = ds['processed_review']
ds.drop(columns=['processed_review'], inplace=True)
ds.drop(columns=['rating'], inplace=True)


In [29]:
# Code 7: Export the ds to a CSV file
csv_path = r'C:\Users\Home\Desktop\TESE\MEIC-TFM\MEIC-TFM\drugsComTrain_cleaned.csv'
#csv_path = r'C:\Users\Home\Desktop\TESE\MEIC-TFM\MEIC-TFM\drugsComTest_cleaned.csv'

ds.to_csv(csv_path, index=False)
print(f"DataFrame exported to '{csv_path}' successfully.")


DataFrame exported to 'C:\Users\Home\Desktop\TESE\MEIC-TFM\MEIC-TFM\drugsComTrain_cleaned.csv' successfully.


In [33]:
# Code 8: Drop the atribute rating from the raw CSV file

ds = pd.read_csv(r'C:\Users\Home\Desktop\TESE\MEIC-TFM\MEIC-TFM\drugsComTrain_raw_or.csv')
ds.drop(columns=['rating'], inplace=True)

csv_path = r'C:\Users\Home\Desktop\TESE\MEIC-TFM\MEIC-TFM\drugsComTrain_raw.csv'
#csv_path = r'C:\Users\Home\Desktop\TESE\MEIC-TFM\MEIC-TFM\drugsComTest_raw.csv'

ds.to_csv(csv_path, index=False)
print(f"DataFrame exported to '{csv_path}' successfully.")

DataFrame exported to 'C:\Users\Home\Desktop\TESE\MEIC-TFM\MEIC-TFM\drugsComTrain_raw.csv' successfully.
