# Sentiment Scores Pipeline
#### Tokenize Reviews and Generate Sentiment Scores
This notebook will load review data from the ODS schema by cleaning comments and calculating sentiment scores.  The pipeline contains the following steps:
- Review data is fetched from the ODS (Operational Data Store) in Snowflake.
- Reviews are cleaned and tokenized.
- Sentiment scores are calculated for each review.
- Data is uploaded into Snowflake.

In [1]:
import pandas as pd
import os
from datetime import datetime
from dotenv import load_dotenv
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer

import sys
import os

# Get the current working directory
current_directory = os.getcwd()

# Add the parent directory to sys.path
sys.path.append(os.path.abspath(os.path.join(current_directory, '..')))

# Import the helper_functions module
from helper_functions import connect_to_snowflake, get_data, write_to_snowflake

In [2]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('vader_lexicon')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\susac\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\susac\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\susac\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\susac\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\susac\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [3]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

## 1.1 Import Data
#### 1.1 Connect to Snowflake using schema ODS

In [5]:
conn = connect_to_snowflake(schema_name='ODS')

Successfully connected to Snowflake schema ODS


## 2. Setup functions
#### 2.1 Comment cleaning function

In [6]:
# Text cleaning function
def clean_text(text):
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Tokenize
    words = word_tokenize(text)
    
    # Remove stopwords
    words = [word for word in words if word not in stop_words]
    
    # Lemmatize
    words = [lemmatizer.lemmatize(word) for word in words]
        
    return ' '.join(words)

#### 2.2 Sentiment analyzer function

In [31]:
# Initialize VADER sentiment analyzer
sid = SentimentIntensityAnalyzer()

# Function to compute sentiment score
def get_sentiment_score(text):
    return sid.polarity_scores(text)['compound']


def apply_sentiment_score(df):
    # Apply sentiment analysis to each cleaned comment
    df['sentiment_score'] = df['clean_comments'].apply(get_sentiment_score)

    # Classify sentiment based on score
    df['sentiment'] = df['sentiment_score'].apply(lambda x: 'positive' if x >= 0.05 else ('negative' if x <= -0.05 else 'neutral'))


#### 2.3 Create pipeline to process reviews to Snowflake
Processed data will be uploaded to the schema FEATURE_STORE.

In [54]:
def process_and_upload_sentiment_scores(markets):
    """
    Processes sentiment scores for each market and uploads results to Snowflake.

    Parameters:
    - markets (List of strings): List of markets to process.
    """
    # Set this to true so that table is created during first run
    create_table = True

    for market in markets:
    
        print(f"Running pipeline for market: {market}")

        # Fetch data for the current market
        print(f'Fetching data')
        sql_query = f"SELECT * FROM REVIEWS WHERE market = '{market}'"
        df = get_data(sql_query, conn)

        print('Applying clean_text function')
        # Clean text data
        df['clean_comments'] = df['comments'].apply(clean_text)

        print('Generating sentiment scores')
        # Apply sentiment scores
        df = apply_sentiment_score(df)

        # Capitalize column names prior to writing to Snowflake
        df.columns = [col.upper() for col in df.columns]

        print('Writing to Snowflake')
        # Write to Snowflake
        write_to_snowflake(df_name=df,conn=conn, schema_name='FEATURE_STORE',
                           table_name='REVIEWS_SENTIMENT_SCORES', overwrite_table=create_table)
        
        print('----------------------------------------------------------------')

        # Set overwrite_table to False after the first run so that data is appended
        if create_table:
            create_table = False


## 3. Apply processing pipeline
The first run through the pipeline auto creates a table and pushes the data to Snowflake and each subsequent run through the pipeline appends the market data.

In [55]:
markets = ['albany', 'chicago', 'los-angeles', 'new-york-city', 'san-francisco', 'seattle', 'washington-dc']

df_test = process_and_upload_sentiment_scores(markets)

Running pipeline for market: albany
Fetching data
Applying clean_text function
Generating sentiment scores
Writing to Snowflake
Table REVIEWS_SENTIMENT_SCORES created in schema FEATURE_STORE at 2024-08-11 17:2026
----------------------------------------------------------------
Running pipeline for market: chicago
Fetching data
Applying clean_text function
Generating sentiment scores
Writing to Snowflake
Data appended to table REVIEWS_SENTIMENT_SCORES in schema FEATURE_STORE at 2024-08-11 17:2227
----------------------------------------------------------------
Running pipeline for market: los-angeles
Fetching data
Applying clean_text function
Generating sentiment scores
Writing to Snowflake
Data appended to table REVIEWS_SENTIMENT_SCORES in schema FEATURE_STORE at 2024-08-11 17:3148
----------------------------------------------------------------
Running pipeline for market: new-york-city
Fetching data
Applying clean_text function
Generating sentiment scores
Writing to Snowflake
Data ap

Close connection to Snowflake.

In [56]:
conn.close()