#   NATURAL LANGUAGE PROCESSING PROJECT
###   Text Data Analysis and Preprocessing
-   Overview
    -   This project involves Natural Language Processing (NLP). The primary objective is to analyze customer reviews, a rich and insightful source of data, to extract meaningful insights that can drive informed decisions and strategies.

-   The Project Scope encompasses several key NLP tasks and techniques:

    -   Text Cleaning: We begin by preprocessing the raw review data. This includes removing unnecessary elements such as punctuations, HTML tags, URLs, emojis, and stopwords. This step is crucial for reducing noise and standardizing the text, thereby making it more conducive to analysis.

    -   Text Analysis: Through various methods, including word count analysis and stop word frequency analysis, we seek to understand the patterns and trends in the text data. We also examine the occurrence of URLs in reviews, which can signify external references or resources.

    -   Data Visualization: Leveraging tools like matplotlib and seaborn, we visualize our findings through graphs and charts. This not only aids in better comprehension of the data but also highlights key patterns and anomalies that might warrant further investigation.

    -   Lemmatization: To capture the essence of words more effectively, we implement lemmatization, which reduces words to their base or dictionary form. Unlike stemming, lemmatization takes into account the context of words, ensuring a more accurate and meaningful representation.

-   Import neccessary libraries

In [None]:
#Import Libraries

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import stopwords
from collections import Counter
import re
from nltk.stem import WordNetLemmatizer
import warnings
plt.style.use('ggplot')
warnings.filterwarnings("ignore")

-   Load the trustpilot reviews dataset for analysis.

In [None]:
#Loading the dataset
file_path = "C:/Users/Oby/Desktop/Data Science Portfolio/trust_pilot_reviews_data_2022_06.csv"
data= pd.read_csv(file_path)
data.head(5)

-   Understanding the Data's Size, Shape, and Structure to gain a fundamental understanding of the dataset, explore its size, shape, and overall structure. 

In [None]:
#Let's get a fundamental understanding of the data's size, shape, and structure
# Obtain the shape of the data
data.info()
print(data.shape)


-   Clean up the dataset by removing columns that are not necessary for the analysis to simplify the dataset, making it easier to work with.

In [None]:
# Remove unwanted columns from the original dataset
data.drop(['company_url', 'trustpilot_url', 'description', 'author_name', 'reviewed_at', 'name', 'scraped_at', 'uniq_id'], axis=1, inplace=True)

-    Check for rmissing values 

In [None]:
#check for missing values in the dataset
data.isna().sum()

### Analyzing the Distribution of Ratings
-   Use the value_counts() method on the 'rating' column to understand the distribution of customer ratings in our dataset. This method provides a count of the number of occurrences of each unique value in the column, which in this case corresponds to the different ratings given by customers.

In [None]:
# Get the value counts of the 'rating' column
rating_counts = data['rating'].value_counts()

# Create a bar chart
plt.figure(figsize=(10, 6))
rating_counts.plot(kind='bar', color='green')

# Adding titles and labels
plt.title('Distribution of Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')

# Show the plot
plt.show()

###  Make this a binary classification problem where ratng from 1 to 3 are classified bad and rating 4 to 5 are good 

In [None]:
# Create a new column 'sentiment' based on the rating values
data['sentiment'] = data['rating'].apply(lambda x: 'Bad' if x <= 3 else 'Good')

# Now, 'data' includes a 'sentiment' column with "Bad" or "Good" labels


### Visualizing the Sentiment Distribution
-   Visualize the distribution of different sentiments in the dataset. Understanding the sentiment distribution is crucial as it provides insights into the overall sentiment trends in the data, such as whether positive, negative, or neutral sentiments are more prevalent.

In [None]:
# Count the occurrences of each sentiment
sentiment_counts = data['sentiment'].value_counts()

# Create a pie chart
plt.figure(figsize=(8, 5))
plt.pie(sentiment_counts, labels=sentiment_counts.index, autopct='%1.1f%%', startangle=140)

# Add a title
plt.title('Sentiment Distribution')

# Show the plot
plt.show()


### REVEIEW TEXT ANALYSIS
-   Data exploration and visualisation: Explore the dataset and understand the context behind each sentiment category

-   Group the dataset by the 'sentiment' column and then display a select number of review texts from each group. This approach helps to get a qualitative feel for the data, providing insights into the nature and tone of reviews associated with different sentiments.

In [None]:
# Group the dataset by "sentiment"
grouped = data.groupby('sentiment')

# Define the number of reviews to display for each sentiment
reviews_to_display = 5

# Display the first 5 review texts
for sentiment, group in grouped:
    print(f"Sentiment: {sentiment}\n")
    for i, review_text in enumerate(group['review_text']):
        if i >= reviews_to_display:
            break  # Stop after displaying the first 10 reviews
        print(review_text)
    print("\n")


In [None]:
#Get the average words in the review text
totalreviews = list(data['review_text'])
length = []
for i in range(0,len(totalreviews)):
        totalreviews[i] = str(totalreviews[i])
        a = len(totalreviews[i].split(' '))
        length.append(a)

    
print("On average a review has about:", sum(length)/len(length),"words in them")


In [None]:
#Get the average words in the review title
totalreviews = list(data['review_title'])
length = []
for i in range(0,len(totalreviews)):
        totalreviews[i] = str(totalreviews[i])
        a = len(totalreviews[i].split(' '))
        length.append(a)

    
print("On average a review title has about:", sum(length)/len(length),"words in them")

 -  Visualise the variation in the length of reviews across different sentiment categories. Analyzing the number of words per review can provide insights into whether certain sentiments are associated with longer or shorter reviews.

In [None]:
data['words per review'] = data['review_text'].str.split().apply(len)
data.boxplot('words per review', by= 'sentiment', grid=False, showfliers=False, color='Blue')
plt.suptitle('')
plt.xlabel('')
plt.show()

### Word Count Analysis for Review Text
-   Analyze the word count in review texts and compare how they vary between 'Good' and 'Bad' sentiments. This analysis can reveal whether there is a tendency for reviews with a particular sentiment to be longer or shorter in terms of word count. 

In [None]:
#WORD COUNT ANALYSIS FOR REVIEW TEXT
# Function to count words in a text sample
def count_words(str):
    words = str.split()
    return len(words)

def plot_count(count_ones,count_zeros,title_1,title_2,subtitle):
    fig,(ax1,ax2)=plt.subplots(1,2,figsize=(15,5))
    sns.distplot(count_zeros,ax=ax1,color='Blue')
    ax1.set_title(title_1)
    sns.distplot(count_ones,ax=ax2,color='Red')
    ax2.set_title(title_2)
    fig.suptitle(subtitle)
    plt.show()

# Calculate word counts for 'good' sentiment
good_sentiment_data = data[data['sentiment'] == 'Good']
good_sentiment_data['word_count'] = good_sentiment_data['review_text'].apply(count_words)


# Calculate word counts for 'bad' sentiment
bad_sentiment_data = data[data['sentiment'] == 'Bad']
bad_sentiment_data['word_count'] = bad_sentiment_data['review_text'].apply(count_words)


# Create a line chart for word counts with curves using the plot_count function
plot_count(good_sentiment_data['word_count'], bad_sentiment_data['word_count'], 'Good Sentiment', 'Bad Sentiment', 'Word Count Analysis For Review Text')



### Stop Word Count Analysis in Review Text
-   Analyse the presence of stop words in review texts and how they vary between 'Good' and 'Bad' sentiments. Stop words are commonly used words in a language (like "the", "is", "in", etc.) that are often filtered out in NLP tasks due to their low informational content. However, the frequency of stop words can sometimes provide insights into the writing style or the nature of the text.

In [None]:
# Function to count stop words in a text sample
def count_stop_words(data):
    stop_words = set(stopwords.words('english'))
    words = data.split()
    return sum(1 for word in words if word.lower() in stop_words)

# Calculate word counts for 'good' sentiment
good_sentiment_data['stop_word_count'] = good_sentiment_data['review_text'].apply(count_stop_words)

# Calculate word counts for 'bad' sentiment
bad_sentiment_data['stop_word_count'] = bad_sentiment_data['review_text'].apply(count_stop_words)

# Create a plot using the plot_count function
plot_count(good_sentiment_data['stop_word_count'], bad_sentiment_data['stop_word_count'], 'Good Sentiment', 'Bad Sentiment', 'Stop Word Count Analysis')

In [None]:
#COMMON STOP WORDS IN THE REVIEW TEXT DATA

In [None]:
# Define the function to count stop words
def count_stop_words(data):
    stop_words = set(stopwords.words('english'))
    words = data.split()
    return sum(1 for word in words if word.lower() in stop_words)

# Get the most common stop words (top 50) for 'good' sentiment
top_stop_words_good = Counter(" ".join(good_sentiment_data['review_text']).split()).most_common(50)

# Get the most common stop words (top 50) for 'bad' sentiment
top_stop_words_bad = Counter(" ".join(bad_sentiment_data['review_text']).split()).most_common(50)

# Extract words and counts for plotting
x_good, y_good = zip(*top_stop_words_good)
x_bad, y_bad = zip(*top_stop_words_bad)

# Create bar charts
plt.figure(figsize=(12, 6))
plt.subplot(121)
plt.barh(x_good, y_good, color='green')
plt.title("Most Common Stop Words in 'Good' Sentiment")
plt.xlabel('Frequency')
plt.ylabel('Words')

plt.subplot(122)
plt.barh(x_bad, y_bad, color='red')
plt.title("Most Common Stop Words in 'Bad' Sentiment")
plt.xlabel('Frequency')
plt.ylabel('Words')

plt.tight_layout()
plt.show()

-   Analyze the occurrence of URLs in review texts and compare their average counts between reviews with 'Good' and 'Bad' sentiments

In [None]:
#URL ANALYSIS IN REVIEW TEXT
# Function to count URLs in a text sample
def count_urls(data):
    return len([x for x in str(data).lower().split() if 'http' in x or 'https' in x])

# Calculate urlcounts for 'good' sentiment
good_sentiment_data['count_good_urls'] = good_sentiment_data['review_text'].apply(count_urls)

# Calculate url  counts for 'bad' sentiment
bad_sentiment_data['count_bad_urls'] = bad_sentiment_data['review_text'].apply(count_urls)

# Bar Chart for Average URL Counts
avg_good_urls = good_sentiment_data['count_good_urls'].mean()
avg_bad_urls = bad_sentiment_data['count_bad_urls'].mean()

plt.figure(figsize=(10, 6))
plt.bar(['Good Sentiment', 'Bad Sentiment'], [avg_good_urls, avg_bad_urls], color=['blue', 'red'])
plt.xlabel('Sentiment')
plt.ylabel('Average URL Count')
plt.title('Average URL Counts in Review Texts by Sentiment')
plt.show()

## REVIEW TITLE ANALYSIS
-   Explore the review titles grouped by sentiment to understand the themes and expressions commonly used in the titles of reviews with different sentiments. 

In [None]:
# Group the dataset by "sentiment"
grouped = data.groupby('sentiment')

# Define the number of reviews to display for each sentiment
reviews_to_display = 10

# Display the first 10 review texts
for sentiment, group in grouped:
    print(f"Sentiment: {sentiment}\n")
    for i, review_text in enumerate(group['review_title']):
        if i >= reviews_to_display:
            break  # Stop after displaying the first 10 reviews
        print(review_text)
    print("\n")


### Word Count Analysis for Review title
-   Analyze the word count in review titles and compare how they vary between 'Good' and 'Bad' sentiments. This analysis can reveal whether there is a tendency for tiltes with a particular sentiment to be longer or shorter in terms of word count. 

In [None]:
# Function to count words in a text sample
def count_words(data):
    words = data.split()
    return len(words)

# Calculate word counts for 'good' sentiment
good_sentiment_data = data[data['sentiment'] == 'Good']
good_sentiment_data['word_count'] = good_sentiment_data['review_title'].apply(count_words)

# Calculate word counts for 'bad' sentiment
bad_sentiment_data = data[data['sentiment'] == 'Bad']
bad_sentiment_data['word_count'] = bad_sentiment_data['review_title'].apply(count_words)

# Create a plot using the plot_count function
plot_count(good_sentiment_data['word_count'], bad_sentiment_data['word_count'], 'Good review', 'Bad review', 'Review Title Word Count Analysis')
#plot_count(good_sentiment_data['word_count'], bad_sentiment_data['word_count'],  'Word Count Analysis')

### Common stop words in the review title 
-   Analyse the presence of stop words in review titles and how they vary between 'Good' and 'Bad' sentiments. Stop words are commonly used words in a language (like "the", "is", "in", etc.) that are often filtered out in NLP tasks due to their low informational content. However, the frequency of stop words can sometimes provide insights into the writing style or the nature of the text.

In [None]:
# Function to count stop words in a text sample
def count_stop_words(data):
    stop_words = set(stopwords.words('english'))
    words = data.split()
    return sum(1 for word in words if word.lower() in stop_words)

# Calculate word counts for 'good' sentiment
good_sentiment_data['stop_word_count'] = good_sentiment_data['review_title'].apply(count_stop_words)

# Calculate word counts for 'bad' sentiment
bad_sentiment_data['stop_word_count'] = bad_sentiment_data['review_title'].apply(count_stop_words)

# Create a plot using the plot_count function
plot_count(good_sentiment_data['stop_word_count'], bad_sentiment_data['stop_word_count'], 'Good review title', 'Bad review title', 'Stop Word Count Analysis in Review Title')

In [None]:
# Define the function to count stop words
def count_stop_words(data):
    stop_words = set(stopwords.words('english'))
    words = data.split()
    return sum(1 for word in words if word.lower() in stop_words)

# Get the most common stop words (top 50) for 'good' sentiment
top_stop_words_good = Counter(" ".join(good_sentiment_data['review_title']).split()).most_common(50)

# Get the most common stop words (top 50) for 'bad' sentiment
top_stop_words_bad = Counter(" ".join(bad_sentiment_data['review_title']).split()).most_common(50)

# Extract words and counts for plotting
x_good, y_good = zip(*top_stop_words_good)
x_bad, y_bad = zip(*top_stop_words_bad)

# Create bar charts
plt.figure(figsize=(12, 6))
plt.subplot(121)
plt.barh(x_good, y_good, color='green')
plt.title("Most Common Stop Words in 'Good' Review Title")
plt.xlabel('Frequency')
plt.ylabel('Words')

plt.subplot(122)
plt.barh(x_bad, y_bad, color='red')
plt.title("Most Common Stop Words in 'Bad' Review Title")
plt.xlabel('Frequency')
plt.ylabel('Words')

plt.tight_layout()
plt.show()


### Cleaning the data
-   Text Cleaning for Review Data
    -   Preprocess the text data in the 'review_title' and 'review_text' columns of our dataset. Text cleaning is an essential step in data preprocessing for Natural Language Processing (NLP), as it helps in removing noise and irrelevant content, thereby making the text data more uniform and analyzable.

-   The text cleaning involves several steps:
    -   Defining the Text Cleaning Function:

-   Removing Punctuations and HTML Tags: Use regular expressions to remove punctuations and HTML syntaxes. This makes the text cleaner and more consistent.
    -   Removing URLs and Emojis: We also remove URLs and emojis from the text, as they are often not necessary for typical text analysis tasks.
    -   Filtering Out Stop Words: Stop words (common words that usually don't carry much meaning, like 'the', 'is', etc.) are removed. This step focuses the analysis on more meaningful words in the text.
    -   Converting to Lowercase: The text is converted to lowercase to ensure uniformity, as 'Word' and 'word' should be considered the same word.

In [None]:
# Define a function to clean text
def clean_text(text):
    # Remove punctuations, HTML syntaxes, URLs, Emojis, and stopwords
    text = re.sub(r'&', 'and', text)  # Replace '&' with 'and'
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove punctuations
    text = re.sub(r'<.*?>', '', text)  # Remove HTML syntaxes
    text = ' '.join(word for word in text.split() if not (word.startswith("http://") or word.startswith("https://")))  # Remove URLs

    # Remove Emojis
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]

    # Convert to lowercase
    text = ' '.join(filtered_words).lower()
    return text

# Apply the clean_text function to both 'review_title' and 'review_text' columns
data['review_title'] = data['review_title'].apply(clean_text)
data['review_text'] = data['review_text'].apply(clean_text)

### Lemmatization of Review Texts and review titles
-   Lemmatization is a crucial step in the preprocessing of text data for Natural Language Processing (NLP) tasks. It involves reducing words to their base or root form. Unlike stemming, lemmatization considers the context and converts the word to its meaningful base form, which is known as the lemma.

In [None]:
# Initialize the  lemmatizer
lemmatizer = WordNetLemmatizer()

# Create copies of the data for stemming and lemmatization
lemmatization_data = data.copy()

# Function for lemmatization
def lemmatize_text(text):
    words = text.split()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized_words)

# Apply lemmatization to the other copy of the data
lemmatization_data['review_title'] = lemmatization_data['review_title'].apply(lemmatize_text)
lemmatization_data['review_text'] = lemmatization_data['review_text'].apply(lemmatize_text)

#Display the first 5 rows of the lematized data
lemmatization_data.head(5)

### Exporting Lemmatized Data to a CSV File
After completing the lemmatization of the review texts, export this processed data to a CSV file. This step is crucial for preserving the lemmatized data, making it easily accessible for future analysis or use in other applications.

In [None]:
lemmatization_data.to_csv('lemmatized_data.csv', index=False)