**Task 1:** 
In this task, you will be performing sentiment analysis on Yelp reviews using the VADER sentiment analyzer. Here's what you need to do:

1. **Load the Data**: Load the Yelp reviews from a CSV file (`yelp_reviews.csv`) into a pandas DataFrame.

2. **Initialize the Sentiment Analyzer**: Set up the VADER sentiment analyzer to measure the sentiment of the reviews.

3. **Create a Sentiment Function**: Write a function that calculates the positive, negative, neutral, and overall sentiment scores for each review using VADER.

4. **Apply the Sentiment Analysis**: Apply this function to all the reviews in the dataset and add the sentiment scores as new columns to your DataFrame.

5. **Save Your Work**: Save the updated DataFrame with the sentiment scores to a new CSV file called `yelp_reviews_vader.csv`.

The goal is to analyze Yelp reviews for sentiment and store the results.


In [None]:
#Import necessary libraries
import pandas as pd
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

#Load the Yelp reviews dataset 
reviews_df = pd.read_csv("/Users/kaykaydaou/Desktop/MMA/MMA WINTER 25/W1 TEXT ANALYTICS/Labs/lab 3/yelp_reviews.csv")

#Initialize VADER sentiment analyzer 
sia = SentimentIntensityAnalyzer() 

#Function to calculate VADER sentiment for each review
def get_vader_sentiment(review):
    #Calculate sentiment using sia.polarity_scores(review)
    #Return the positive, negative, neutral, and compound sentiment scores
    sentiment = sia.polarity_scores(review)  
    return sentiment['pos'], sentiment['neg'], sentiment['neu'], sentiment['compound']  

#Apply the VADER sentiment analysis to the 'Review' column
#Using the DataFrame's 'apply' function to apply get_vader_sentiment to each review, then unpacking the scores into new columns ('pos', 'neg', 'neu', 'compound')
reviews_df['pos'], reviews_df['neg'], reviews_df['neu'], reviews_df['compound'] = reviews_df['Review'].apply(lambda x: pd.Series(get_vader_sentiment(x))) 

#Save the updated DataFrame with the VADER sentiment scores to a CSV file 
# Ensure the file is named 'yelp_reviews_vader.csv'
reviews_df.to_csv("yelp_reviews_vader.csv", index=False)  # Replace None with the correct file name


**Task 2:** In this task, you will be working on preprocessing text and performing sentiment analysis using custom rules and WordNet similarity. Here's what you need to do:

1. **Preprocess the Text**: Implement a function to clean up the review text. This involves tokenizing the text, removing stopwords, and lemmatizing (reducing words to their base form).

2. **Extract Phrases**: Write a function that identifies and extracts specific types of phrases from the text, based on rules like "adjective followed by noun" or "adverb followed by verb."

3. **Calculate Word Similarity**: Use WordNet to calculate the similarity between words. This will help in determining whether the extracted phrases are more similar to positive or negative words.

4. **Analyze Sentiment**: For each review, analyze the sentiment by checking if the extracted phrases are more similar to positive or negative reference words. Then assign a score and label (positive/negative) to each review.

5. **Apply Sentiment Analysis**: Apply your sentiment analysis function to all the reviews in the dataset and save the results (sentiment score and label) to a new CSV file.

The goal is to preprocess the text, extract meaningful phrases, and determine the sentiment of each review using custom rules and word similarity.


In [None]:
# Incomplete Text Preprocessing and Sentiment Analysis Script for Students

# TODO: Import necessary libraries (Hint: You'll need nltk for lemmatization and stop words, pandas for data handling, and WordNet for semantic similarity)

# Initialize the lemmatizer and stop words list
lemmatizer = None  # TODO: Initialize WordNetLemmatizer()
stop_words = None  # TODO: Get the list of stopwords from nltk

def preprocess_text(text):
    """
    Preprocesses the input text by tokenizing, removing stopwords, and lemmatizing.
    """
    # TODO: Tokenize the text (Hint: Use word_tokenize)
    tokens = None  # Replace None with tokenization code

    # TODO: Convert tokens to lowercase and remove non-alphanumeric characters
    # Filter out stop words and lemmatize the remaining tokens
    tokens = None  # Replace None with list comprehension for lowercasing, filtering, and lemmatization
    return tokens

def extract_phrases(tokens):
    """
    Extracts phrases from the tokens based on the provided rules.
    """
    # TODO: POS-tag the tokens (Hint: Use pos_tag)
    pos_tagged = None  # Replace None with POS tagging code

    phrases = []
    # TODO: Implement rules to extract phrases (use the given rules 1-5 as hints)
    for i in range(len(pos_tagged) - 2):
        word1, tag1 = pos_tagged[i]
        word2, tag2 = pos_tagged[i + 1]
        word3, tag3 = pos_tagged[i + 2] if i + 2 < len(pos_tagged) else None

        # Rule 1: JJ followed by NN or NNS
        if tag1.startswith('JJ') and (tag2 == 'NN' or tag2 == 'NNS'):
            phrases.append(f'{word1} {word2}')
        
        # Rule 2: RB, RBR, or RBS followed by JJ, not followed by NN or NNS
        elif (tag1 == 'RB' or tag1 == 'RBR' or tag1 == 'RBS') and tag2.startswith('JJ') and (tag3 != 'NN' and tag3 != 'NNS'):
            phrases.append(f'{word1} {word2}')
        
        # Rule 3: JJ followed by JJ, not followed by NN or NNS
        elif tag1.startswith('JJ') and tag2.startswith('JJ') and (tag3 != 'NN' and tag3 != 'NNS'):
            phrases.append(f'{word1} {word2}')
        
        # Rule 4: NN or NNS followed by JJ, not followed by NN or NNS
        elif (tag1 == 'NN' or tag1 == 'NNS') and tag2.startswith('JJ') and (tag3 != 'NN' and tag3 != 'NNS'):
            phrases.append(f'{word1} {word2}')
        
        # Rule 5: RB, RBR, or RBS followed by VB, VBD, VBN, or VBG
        elif (tag1 == 'RB' or tag1 == 'RBR' or tag1 == 'RBS') and (tag2 == 'VB' or tag2 == 'VBD' or tag2 == 'VBN' or tag2 == 'VBG'):
            phrases.append(f'{word1} {word2}')

    return phrases

def wordnet_similarity(word1, word2):
    """
    Calculates the similarity between two words using WordNet.
    """
    # TODO: Get synsets for the two words using wn.synsets
    synsets1 = None  # Replace None with code to get synsets for word1
    synsets2 = None  # Replace None with code to get synsets for word2

    if synsets1 and synsets2:
        # TODO: Calculate similarity using wn.wup_similarity
        return None  # Replace None with the similarity calculation

    return 0  # Return 0 if no similarity is found

# Define positive and negative reference words
positive_refs = ["delicious", "tasty", "amazing", "great", "wonderful", "fantastic", "excellent"]
negative_refs = ["disgusting", "bad", "terrible", "awful", "horrible", "inedible", "poor"]

def semantic_orientation(phrase):
    """
    Calculates the semantic orientation of a phrase by comparing it with
    multiple positive and negative reference words.
    """
    # TODO: Calculate the average similarity with positive and negative reference words
    pos_score = None  # Replace None with similarity calculation for positive reference words
    neg_score = None  # Replace None with similarity calculation for negative reference words

    # Return the difference to get the orientation score
    return pos_score - neg_score

def analyze_sentiment(document):
    """
    Analyzes the sentiment of a document based on its phrases' semantic orientation
    and returns both the sentiment score and the sentiment label.
    """
    tokens = preprocess_text(document)
    phrases = extract_phrases(tokens)
    total_orientation = 0

    for phrase in phrases:
        total_orientation += semantic_orientation(phrase)

    # TODO: Assign sentiment label based on total_orientation
    sentiment_label = None  # Replace None with logic for assigning 'Positive' or 'Negative' label

    return total_orientation, sentiment_label  # Return both the sentiment score and label

# TODO: Load the dataset (Hint: Use pandas to read 'yelp_reviews.csv')
data = None  # Replace None with pandas code to load the CSV file

# TODO: Apply sentiment analysis to the 'Review' column and calculate both the score and label
# Hint: Use the apply method with a lambda function to apply analyze_sentiment to each review
data[['Sentiment Score', 'Sentiment Label']] = None  # Replace None with code to apply sentiment analysis

# TODO: Save the results to a new CSV file (Hint: Use pandas to_csv method)
data.to_csv(None, index=False)  # Replace None with the correct filename


**Task 3:** In this task, you will preprocess Yelp reviews and calculate the Euclidean distances between them based on their term frequencies. Here's what you need to do:

1. **Preprocess the Text**: Implement a function that cleans up the review text by removing punctuation, converting it to lowercase, tokenizing it, and removing common stopwords. You will also lemmatize the words to reduce them to their base form.

2. **Apply Preprocessing**: Apply this text preprocessing function to each review in the dataset to prepare the text for analysis.

3. **Convert to Term Matrix**: Convert the cleaned review text into a numerical format (term frequency matrix) using `CountVectorizer`, which will help you calculate distances between the reviews.

4. **Calculate Euclidean Distances**: Use the term matrix to compute the Euclidean distances between each pair of reviews, which will help show how similar or different the reviews are from one another.

5. **Create and Save a DataFrame**: Convert the distance results into a DataFrame so it’s easier to view and save this as a CSV file.

The goal is to process the text data and then compute how close or far apart the reviews are from each other based on the words they contain.


In [None]:
# Incomplete Text Preprocessing and Distance Calculation Script for Students

# TODO: Import necessary libraries (Hint: You'll need nltk for lemmatization and stop words, pandas for data handling, re for regular expressions, and sklearn for vectorization and distance calculations)

# Step 1: Initialize the lemmatizer and stop words
lemmatizer = None  # TODO: Initialize WordNetLemmatizer
stop_words = None  # TODO: Get the set of English stop words using nltk

# Step 2: Preprocess the text
def preprocess_text(text):
    """
    Preprocesses the input text by tokenizing, removing stopwords, lemmatizing, and cleaning punctuation.
    """
    # TODO: Remove special characters (Hint: Use re.sub to remove anything that's not a letter or space)
    text = None  # Replace None with text cleaning code

    # TODO: Convert the text to lowercase
    text = None  # Replace None with code to convert text to lowercase

    # TODO: Tokenize the text (Hint: Use word_tokenize from nltk)
    tokens = None  # Replace None with tokenization code

    # TODO: Remove stopwords and lemmatize the tokens
    tokens = None  # Replace None with list comprehension for lemmatization and stopword removal
    
    # Return the processed text as a single string
    return ' '.join(tokens)

# TODO: Load the dataset (Hint: Use pandas to read 'yelp_reviews.csv')
data = None  # Replace None with code to load the dataset

# Step 3: Apply text preprocessing to the 'Review' column
# TODO: Use the apply method to apply preprocess_text to each review
data['Cleaned_Reviews'] = None  # Replace None with the code to apply preprocessing

# Step 4: Convert documents to a term frequency matrix using CountVectorizer
# TODO: Initialize the CountVectorizer and fit it to the 'Cleaned_Reviews' column
vectorizer = None  # Replace None with CountVectorizer initialization
term_matrix = None  # Replace None with the fit_transform code for the Cleaned_Reviews

# Step 5: Calculate the pairwise Euclidean distances between documents
# TODO: Use euclidean_distances from sklearn to calculate the distances between the term frequency vectors
distances = None  # Replace None with the distance calculation code

# Step 6: Convert the pairwise distances into a matrix format
# Step 7: Convert the matrix into a DataFrame for easier viewing
# TODO: Create a DataFrame from the distances matrix (Hint: Use pandas DataFrame and set the index and columns to data.index)
distance_df = None  # Replace None with DataFrame creation code

# Step 8: Save the DataFrame to a CSV file
# TODO: Save the DataFrame to a CSV file (Hint: Use to_csv method)
distance_df.to_csv(None, index=True)  # Replace None with the correct file name


**Task 4:** In this task, you will preprocess Yelp reviews and calculate the Cosine distances between them based on their term frequencies. Here's what you need to do:

1. **Preprocess the Text**: Write a function to clean up the review text by removing punctuation, converting it to lowercase, tokenizing it into individual words, and removing stopwords. You’ll also lemmatize the words to reduce them to their base forms.

2. **Apply Preprocessing**: Apply the text preprocessing function to the reviews in the dataset, so the text is cleaned and ready for analysis.

3. **Convert to Term Matrix**: Convert the cleaned review text into a numerical format (term frequency matrix) using `CountVectorizer`. This step will help us compare the text data between different reviews.

4. **Calculate Cosine Similarities**: Use the term matrix to calculate the Cosine similarity between each pair of reviews. This measures how similar the reviews are based on their word usage.

5. **Convert Similarities to Distances**: Convert the Cosine similarities into Cosine distances (1 - Cosine Similarity). Distances will help you see how far apart the reviews are from one another.

6. **Create and Save a DataFrame**: Convert the distance results into a DataFrame so it’s easier to read, and save it as a CSV file.

The goal is to preprocess the review text and then measure how similar or different the reviews are using Cosine distance.


In [None]:
# Incomplete Text Preprocessing and Cosine Distance Calculation Script for Students

# TODO: Import necessary libraries (Hint: You'll need nltk for lemmatization and stop words, pandas for data handling, re for regular expressions, and sklearn for vectorization and similarity calculations)

# Step 1: Initialize the lemmatizer and stop words
lemmatizer = None  # TODO: Initialize WordNetLemmatizer
stop_words = None  # TODO: Get the set of English stop words using nltk

# Step 2: Preprocess the text
def preprocess_text(text):
    """
    Preprocesses the input text by tokenizing, removing stopwords, lemmatizing, and cleaning punctuation.
    """
    # TODO: Remove special characters (Hint: Use re.sub to remove anything that's not a letter or space)
    text = None  # Replace None with text cleaning code

    # TODO: Convert the text to lowercase
    text = None  # Replace None with code to convert text to lowercase

    # TODO: Tokenize the text (Hint: Use word_tokenize from nltk)
    tokens = None  # Replace None with tokenization code

    # TODO: Remove stopwords and lemmatize the tokens
    tokens = None  # Replace None with list comprehension for lemmatization and stopword removal
    
    # Return the processed text as a single string
    return ' '.join(tokens)

# TODO: Load the dataset (Hint: Use pandas to read 'yelp_reviews.csv')
data = None  # Replace None with code to load the dataset

# Step 3: Apply text preprocessing to the 'Review' column
# TODO: Use the apply method to apply preprocess_text to each review
data['Cleaned_Reviews'] = None  # Replace None with the code to apply preprocessing

# Step 4: Convert documents to a term frequency matrix using CountVectorizer
# TODO: Initialize the CountVectorizer and fit it to the 'Cleaned_Reviews' column
vectorizer = None  # Replace None with CountVectorizer initialization
term_matrix = None  # Replace None with the fit_transform code for the Cleaned_Reviews

# Step 5: Calculate the pairwise Cosine similarities between documents
# TODO: Use cosine_similarity from sklearn to calculate the similarity between the term frequency vectors
cosine_similarities = None  # Replace None with the similarity calculation code

# Step 6: Convert Cosine similarities to Cosine distances (1 - Cosine Similarity)
# TODO: Subtract the cosine similarities from 1 to get distances
cosine_distances = None  # Replace None with the code to calculate cosine distances

# Step 7: Convert the pairwise distances into a matrix format and then to a DataFrame for easier viewing
# TODO: Create a DataFrame from the cosine distances matrix (Hint: Use pandas DataFrame and set the index and columns to data.index)
cosine_distance_df = None  # Replace None with DataFrame creation code

# Step 8: Save the DataFrame to a CSV file
# TODO: Save the DataFrame to a CSV file (Hint: Use to_csv method)
cosine_distance_df.to_csv(None, index=True)  # Replace None with the correct file name


**Task 5:** In this task, you will be using a k-NN (k-Nearest Neighbors) model to classify the sentiment of Yelp reviews as either positive or negative. Here's what you need to do:

1. **Load the Data**: Load the Yelp reviews dataset from a CSV file into a DataFrame so you can work with it.

2. **Convert Ratings to Sentiment**: Convert the ratings into binary sentiment labels (0 for negative, 1 for positive). You will define positive reviews as those with a rating of 3.5 or higher.

3. **Preprocess the Text**: Write a function to clean the review text by converting it to lowercase, removing punctuation, and getting rid of numbers.

4. **Vectorize the Reviews**: Convert the cleaned review text into a numerical format using `CountVectorizer`, which turns the text into a set of features based on word frequencies.

5. **Split the Data**: Divide the dataset into training and testing sets (80% for training, 20% for testing). This will help you train the model and evaluate its performance.

6. **Train the k-NN Model**: Set up the k-NN classifier with 5 neighbors and cosine similarity, and train it using the training data.

7. **Evaluate the Model**: Use the model to predict sentiment on the test set and then evaluate how well it performed using accuracy score and a classification report (which shows precision, recall, and F1-score).

8. **Predict a New Review**: Finally, test your model by predicting the sentiment of a new review (e.g., "The product was absolutely terrible").

The goal is to classify the sentiment of reviews using a machine learning model and evaluate its accuracy.


In [None]:
# Incomplete k-NN Sentiment Classification Script for Students

# TODO: Import necessary libraries (Hint: You'll need pandas for data handling, re for regular expressions, and sklearn for machine learning functions)

# Step 1: Load your dataset
# TODO: Use pandas to load the 'yelp_reviews.csv' file into a DataFrame
reviews_df = None  # Replace None with the code to load the dataset

# Step 2: Convert ratings to binary sentiment (0 for < 3.5, 1 for >= 3.5)
# TODO: Use apply to create a new column 'Sentiment' based on the 'Rating' column
reviews_df['Sentiment'] = None  # Replace None with the lambda function for binary sentiment classification

# Step 3: Preprocess the text data (cleaning reviews)
def preprocess_text(text):
    """
    Preprocesses the input text by converting to lowercase, removing punctuation, and removing numbers.
    """
    # TODO: Convert text to lowercase
    text = None  # Replace None with the code to convert text to lowercase
    
    # TODO: Remove punctuation (Hint: Use re.sub to remove non-alphanumeric characters)
    text = None  # Replace None with the regex for removing punctuation
    
    # TODO: Remove numbers
    text = None  # Replace None with the regex for removing numbers
    
    return text

# TODO: Apply preprocessing to the 'Review' column (Hint: Use apply method to call preprocess_text for each review)
reviews_df['Cleaned_Review'] = None  # Replace None with the code to apply text preprocessing

# Step 4: Vectorize the cleaned reviews using CountVectorizer
# TODO: Initialize CountVectorizer (Hint: Use stop_words='english' to remove common stopwords)
vectorizer = None  # Replace None with CountVectorizer initialization

# TODO: Fit the vectorizer to the 'Cleaned_Review' column and transform the reviews into a feature matrix
X = None  # Replace None with the code to vectorize the cleaned reviews
y = reviews_df['Sentiment']  # Target variable (Sentiment)

# Step 5: Split the data into training and test sets (80% train, 20% test)
# TODO: Use train_test_split to split X and y into training and test sets (Hint: Set test_size=0.2 and random_state=42)
X_train, X_test, y_train, y_test = None  # Replace None with the train_test_split code

# Step 6: Apply k-NN model
# TODO: Initialize and train a k-NN classifier (Hint: Set n_neighbors=5 and metric='cosine')
knn = None  # Replace None with the k-NN initialization and training code

# Step 7: Evaluate the model
# TODO: Predict the sentiment on the test set
y_pred = None  # Replace None with the code to predict on X_test

# TODO: Generate a classification report and accuracy score (Hint: Use classification_report and accuracy_score from sklearn)
classification_report_output = None  # Replace None with the code to generate the classification report
accuracy = None  # Replace None with the code to calculate accuracy

# TODO: Print the classification report and accuracy
print("Classification Report:\n", None)  # Replace None with classification report variable
print(f"Accuracy: {None}")  # Replace None with accuracy variable

# Step 8: Predict sentiment for a new document
new_document = ["The product was absolutely terrible"]

# TODO: Vectorize the new document using the same vectorizer
new_doc_vector = None  # Replace None with the code to vectorize the new document

# TODO: Predict the sentiment of the new document
predicted_sentiment = None  # Replace None with the code to predict sentiment of new document
