### Extracting and Aggregating Sentiments on Targeted Product Features with Large Language Models 

### Introduction
In this notebook, we aim to extract valuable insights from chat history conversations by analyzing sentiments related to various product features. We will query chat history, leverage Fabric's built-in Azure OpenAI capabilities, identify product mentions, and determine the sentiments expressed around each product feature. This analysis will help us understand customer perceptions and preferences, enabling us to make informed decisions and enhance our product development strategies.

Additionally, we will aggregate the counts for each feature and sentiment to provide a comprehensive overview of the data. This aggregation will allow us to quantify the number of mentions, likes, and dislikes for each product feature, as well as the overall sentiment distribution.

#### Prerequisites
You need the following services to run this notebook.

- Microsoft Fabric with F64 Capacity

#### 1) Load mirrored container into a spark dataframe

This following imports various modules and functions that are essential for working with Spark DataFrames, performing data transformations, and utilizing Azure OpenAI services.

In [None]:
from datetime import datetime

from pyspark.sql import SparkSession, Row
from pyspark.sql import functions as F
from pyspark.sql.functions import (col, explode, from_json, sum as _sum, lower, trim,
                                    udf, max, expr, lit, when, array_contains)
from pyspark.sql.types import (ArrayType, MapType, StringType, StructType, StructField)

from synapse.ml.services.openai import OpenAIChatCompletion


#### Dataset
The dataset used in this demo is stored in CosmosDB. To facilitate analysis, the data is mirrored into OneLake and brought into the Lakehouse via a shortcut. This approach ensures that we are not duplicating any data, maintaining a single copy in CosmosDB while providing seamless access for processing and analysis in the Lakehouse environment.
The synchronization between CosmosDB and the Lakehouse happens under a minute, which allows near real-time analysis for many applications.


let's load the chat history data from the mirrored database into a spark dataframe.

In [None]:
chat_history_container_name = 'chat_history'
#df = spark.sql(f"SELECT * FROM fabcondemo.{chat_history_container_name} WHERE _ts > {max_last_analyzed_ts}")
df = spark.sql(f"SELECT * FROM fabcondemo.{chat_history_container_name}")
display(df.limit(2))

#### 2) Extract sentiments around product features through Fabric's built-in Azure OpenAI using SynapseML library


In this section, we demonstrate how to use the Synapse ML library to call Azure OpenAI at scale. The process involves creating a user-defined function (UDF) to transform user requests into the desired format, defining the schema for the transformed data, and applying the transformation to the DataFrame. This approach ensures scalability and efficient processing of large datasets

NOTE: Before the advent of large language models (LLMs), extracting structured information from chat data was a complex and time-consuming task. It typically involved several steps and required expertise in natural language processing (NLP) techniques such as named entity recognition (NER), sentiment analysis, and feature extraction.

1. **Named Entity Recognition (NER)**: Identifying entities like product names, brands, and types required training NER models on annotated datasets. This process involved manually labeling a large amount of data, training the model, and continuously fine-tuning it to improve accuracy.

2. **Sentiment Analysis**: Determining the sentiment around each product mentioned in the conversation required separate sentiment analysis models. These models also needed to be trained on labeled data and fine-tuned to handle the nuances of human language.

3. **Feature Extraction**: Extracting features that are liked or disliked about each product involved additional NLP techniques. This often required custom rule-based systems or machine learning models to identify and categorize features mentioned in the text.

4. **Integration and Transformation**: Combining the outputs of these different models into a structured format like JSON required additional code to integrate the results and transform them into the desired format.

With LLMs, the workflow is significantly simplified:

- **Unified Model**: LLMs can perform multiple NLP tasks within a single model. This means that the same model can handle NER, sentiment analysis, and feature extraction without the need for separate models for each task.
- **Prompt Engineering**: Instead of training and fine-tuning multiple models, you can use prompt engineering to guide the LLM to generate the desired output. This involves crafting a prompt that instructs the model on what information to extract and how to format it.
- **Reduced Manual Effort**: LLMs can understand and generate human-like text, reducing the need for extensive manual labeling and fine-tuning. This makes it easier to adapt the model to new tasks and domains.
- **Scalability**: LLMs can process large volumes of data quickly and accurately, making it easier to scale the extraction process to handle more conversations and extract more detailed information.

In summary, LLMs simplify the workflow by providing a unified model that can handle multiple NLP tasks, reducing the need for manual effort and making it easier to scale the extraction process. This allows you to focus on crafting effective prompts and integrating the model's output into your analysis.

In [None]:
# Function to create messages (same as in your example)
def make_message(role, content):
    return {"role": role, "content": content, "name": role}

prompt = "You are an expert bussiness analyst, give a conversation between client and a chatbot, you identify the list of products discussed, sentiment around each product, features that are liked or disliked and provide results as a json file.\
            when identifying product, separate type, brand, and name \
            choices for sentiment: enthusiastic, neutral, obstructionist \
            choices for product features (like or disliked): design, usability, color, size, price, brand \
            Please return a a list of jsons only. Please use the example below as a template\
            [{\"product\": {\"type\": \"Climbing\", \"brand\": \"Gravitator\", \"name\": \"Gravity Beam Climbing Rope\"}, \"sentiment\": \"enthusiastic\", \"features\": {\"like\": [\"design\", \"usability\", \"color\", \"size\", \"price\", \"brand\"], \"dislike\": []}}, {\"product\": {\"type\": \"Footwear\", \"brand\": \"Raptor Elite\", \"name\": \"Trek Xtreme Hiking Shoes\"}, \"sentiment\": \"enthusiastic\", \"features\": {\"like\": [\"design\", \"usability\", \"color\", \"size\", \"price\", \"brand\"], \"dislike\": []}}, {\"product\": {\"type\": \"Navigation\", \"brand\": \"AirStrider\", \"name\": \"VenturePro GPS Watch\"}, \"sentiment\": \"enthusiastic\", \"features\": {\"like\": [\"design\", \"usability\", \"price\", \"brand\"], \"dislike\": []}}]\
            "

# UDF to transform user request into the desired list of dictionaries
def transform_to_messages(messages):
    system_message = make_message("system", prompt)
    user_message = make_message("user", messages)
    return [system_message, user_message]

# Define the schema for the list of dictionaries
message_schema = ArrayType(
    StructType([
        StructField("role", StringType(), True),
        StructField("content", StringType(), True),
        StructField("name", StringType(), True)
    ])
)

# Register the UDF
transform_udf = udf(transform_to_messages, message_schema)

# Apply the UDF to the DataFrame
df_transformed = df.withColumn("messages", transform_udf(col("Messages")))

# Show the transformed DataFrame
display(df_transformed.limit(2))

The following code snippet highlights how to perform chat completion for all the chat conversation!

In [None]:
chat_completion = (
    OpenAIChatCompletion()
    .setDeploymentName("gpt-4-32k") # deploymentName could be one of {gpt-35-turbo-0125 or gpt-4-32k}
    .setMessagesCol("messages")
    .setErrorCol("error")
    .setOutputCol("chat_completions")
)

df_analyzed = chat_completion.transform(df_transformed).select(
        "messages", "chat_completions.choices.message.content")

display(df_analyzed.limit(2))

### 3) Defining and Applying a Schema for Targeted Sentiment Extraction from LLM Responses

The following line of code adds a new column called "json_string" to the df_analyzed DataFrame. This new column contains the concatenated values from the "content" column, combined into a single string without any separators. This can be useful for further processing or analysis where a single string representation of the content is needed.

In [None]:
df_parsed = df_analyzed.withColumn("json_string", F.concat_ws("", col("content")))

In [None]:
display(df_parsed.limit(2))

In the following section, we define the schema for parsing JSON data and apply it to the DataFrame. This schema outlines the structure of the JSON data, including product details, sentiment, and features. By using this schema, we can transform the JSON strings into structured data, making it easier to analyze and process.



In [None]:
# Define schema for JSON parsing
schema = ArrayType(
    StructType([
        StructField("product", StructType([
            StructField("type", StringType(), True),
            StructField("brand", StringType(), True),
            StructField("name", StringType(), True)
        ]), True),
        StructField("sentiment", StringType(), True),
        StructField("features", StructType([
            StructField("like", ArrayType(StringType()), True),
            StructField("dislike", ArrayType(StringType()), True)
        ]), True)
    ])
)

In [None]:
# Parse JSON string into array of structs
df_parsed_schema = df_parsed.withColumn("parsed_json", from_json(col("json_string"), schema))

Next, we explode the parsed JSON column to create a new row for each element in the array, allowing for more granular analysis of the data.

In [None]:
# Explode the parsed JSON column
df_exploded = df_parsed_schema.withColumn("exploded", explode(col("parsed_json")))

In [None]:
display(df_exploded.limit(2))

Since we have the structured data, we continue to flatten the nested information. By exploding nested information, we bring desired information onto the same row as a new column. This process ensures that each product, along with its associated sentiment and features, is represented in a single, flat row, facilitating more efficient analysis and processing.

Hence, we continue to extract the like and dislike features from the exploded JSON column.

In [None]:
# Extract `like` and `dislike` features
df_exploded = df_exploded.withColumn("features", col("exploded.features"))

In [None]:
# Extract `like` and `dislike` features
df_exploded = df_exploded.withColumn("like_features", col("features.like"))
df_exploded = df_exploded.withColumn("dislike_features", col("features.dislike"))


We iterate through a list of features and create new columns for each feature, indicating whether it is liked or disliked. Each new column contains binary values: 1 if the feature is liked or disliked, and 0 otherwise. This step further refines our dataset, making it easier to analyze specific aspects of product features

In [None]:
features_list = ["price", "size", "usability", "color", "brand", "design"]

for feature in features_list:
    df_exploded = df_exploded.withColumn(
        f"liked_{feature}",
        when(array_contains(col("like_features"), lit(feature)), 1).otherwise(0)
    ).withColumn(
        f"disliked_{feature}",
        when(array_contains(col("dislike_features"), lit(feature)), 1).otherwise(0)
    )


Continue flattening the nested information ..

In [None]:
df_exploded = df_exploded.withColumn("sentiment", col("exploded.sentiment"))

In [None]:
sentiment_list = ["enthusiastic", "neutral", "obstructionist"]

for sent in sentiment_list:
    df_exploded = df_exploded.withColumn(
        sent,
        F.when(F.col("sentiment") == sent, 1).otherwise(0)
    )

In [None]:
display(df_exploded.limit(1))

In [None]:
# Extract fields from product
df_exploded = df_exploded \
    .withColumn("product_type", col("exploded.product.type")) \
    .withColumn("product_brand", col("exploded.product.brand")) \
    .withColumn("product_name", col("exploded.product.name"))

#### 4) Integrating Product Catalog to Retrieve Product IDs"

In the next section, we join the data from the product catalog. In real-world applications, we may not always have access to the product IDs being used. Therefore, we retrieve the product catalog data to enrich our dataset with additional product information.

In [None]:
df_product_catalog = spark.sql("SELECT * FROM fabcondemo.product_catalog")
display(df_product_catalog.limit(1))

Now let's join with product catalog to identify the product ids associated with each product

In [None]:
# Example of pattern matching using SQL expressions
df_joined = df_exploded.join(
    df_product_catalog,
    (expr("lower(trim(product_name)) = lower(trim(name))")),
    "left"
)


In [None]:
# Drop unwanted columns (e.g., _ts) from the product catalog DataFrame
df_joined = df_joined.drop(df_product_catalog["_ts"])

In [None]:
display(df_joined.limit(2))

#### 5) Aggregating and Summarizing Product Feature Sentiments: Calculating Counts and Sums by Product ID

Next, we add aggregation expressions for each sentiment and the count of rows for each product ID. By aggregating the counts for each feature and sentiment, we can quantify the number of mentions, likes, and dislikes for each product feature, as well as the overall sentiment distribution. This aggregation provides a comprehensive overview of the data

In [None]:
from pyspark.sql.functions import expr, count, sum as spark_sum

# Define the features and their column names
features = {
    "price": ["liked_price", "disliked_price"],
    "size": ["liked_size", "disliked_size"],
    "usability": ["liked_usability", "disliked_usability"],
    "color": ["liked_color", "disliked_color"],
    "brand": ["liked_brand", "disliked_brand"],
    "design": ["liked_design", "disliked_design"]
}

# Initialize an empty list to collect DataFrames
aggregate_exprs = []

for feature, cols in features.items():
    liked_col, disliked_col = cols
    aggregate_exprs.append(
        spark_sum(when(col(liked_col) == 1, 1).otherwise(0)).alias(f"sum_liked_{feature}")
    )
    aggregate_exprs.append(
        spark_sum(when(col(disliked_col) == 1, 1).otherwise(0)).alias(f"sum_disliked_{feature}")
    )

# Add aggregation expressions for each sentiment
for sentiment in sentiment_list:
    aggregate_exprs.append(
        spark_sum(when(col(sentiment) == 1, 1).otherwise(0)).alias(f"sum_{sentiment}")
    )
    
# Add the count of rows for each product_id
aggregate_exprs.append(
    count("*").alias("sum_mentions")
)

# Aggregate counts for each feature
aggregated_counts = df_joined.groupBy("id").agg(*aggregate_exprs)

# Show the new DataFrame
display(aggregated_counts.limit(2))


In [None]:
# Rename the column 'id' to 'product_id'
aggregated_counts = aggregated_counts.withColumnRenamed("id", "product_id")
display(aggregated_counts.limit(2))


For optional writing back our analysis to cosmosdb, we define a UUID. This involves creating a user-defined function (UDF) to generate unique identifiers (UUIDs). 

In [None]:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import uuid

# Define a UDF to generate UUIDs
def generate_uuid():
    return str(uuid.uuid4())

uuid_udf = udf(generate_uuid, StringType())

By initializing all product counts with zero, we ensure that even products not mentioned in recent conversations are included in the dataset. This step allows us to later combine these initialized counts with the actual mentions from the chat data, ensuring a comprehensive analysis.

Additionally, we add a UUID column to uniquely identify each record.

In [None]:
selected_columns = df_product_catalog.select("Id", "Type", "Brand", "Name").withColumnRenamed("Id", "product_id")
# # Convert the id column to string
# selected_columns = selected_columns.withColumn("id", col("id").cast("string"))
# Add new columns with default values
df_analytics_counts = selected_columns.withColumn("sum_liked_price", lit(0)) \
    .withColumn("sum_disliked_price", lit(0)) \
    .withColumn("sum_liked_size", lit(0)) \
    .withColumn("sum_disliked_size", lit(0)) \
    .withColumn("sum_liked_usability", lit(0)) \
    .withColumn("sum_disliked_usability", lit(0)) \
    .withColumn("sum_liked_color", lit(0)) \
    .withColumn("sum_disliked_color", lit(0)) \
    .withColumn("sum_liked_brand", lit(0)) \
    .withColumn("sum_disliked_brand", lit(0)) \
    .withColumn("sum_liked_design", lit(0)) \
    .withColumn("sum_disliked_design", lit(0))\
    .withColumn("sum_enthusiastic", lit(0))\
    .withColumn("sum_neutral", lit(0))\
    .withColumn("sum_obstructionist", lit(0))\
    .withColumn("sum_mentions", lit(0))\



# Add a UUID column to your DataFrame
df_analytics_counts = df_analytics_counts.withColumn("id", uuid_udf())
# Show the resulting DataFrame
display(df_analytics_counts.limit(2))

Next, we perform an outer join between the product catalog initialization data and the aggregated counts. This ensures that even products not mentioned in recent conversations are included in the analysis results. We update the columns with the aggregated counts, combining the initialized counts with the actual mentions from the chat data. This step provides a comprehensive overview of the data, facilitating a complete view of the product sentiments.

In [None]:
from pyspark.sql.functions import col, coalesce, lit

# Define the features and their column names (same as before)
features = {
    "price": ["liked_price", "disliked_price"],
    "size": ["liked_size", "disliked_size"],
    "usability": ["liked_usability", "disliked_usability"],
    "color": ["liked_color", "disliked_color"],
    "brand": ["liked_brand", "disliked_brand"],
    "design": ["liked_design", "disliked_design"]
}

# Define sentiment columns and their aggregations
sentiment_list = ["enthusiastic", "neutral", "obstructionist"]

# Define the columns to update
update_columns = {
    **{f"sum_liked_{feature}": f"sum_liked_{feature}" for feature in features.keys()},
    **{f"sum_disliked_{feature}": f"sum_disliked_{feature}" for feature in features.keys()},
    **{f"sum_{sentiment}": f"sum_{sentiment}" for sentiment in sentiment_list}
}

# Perform the outer join
combined_df = df_analytics_counts.alias("analytics") \
    .join(aggregated_counts.alias("agg"), "product_id", "outer") \
    .select(
        col("analytics.id"),
        col("analytics.Type"),
        col("analytics.Brand"),
        col("analytics.Name"),
        *[
            (coalesce(col(f"analytics.{col_name}"), lit(0)) + coalesce(col(f"agg.{col_name}"), lit(0))).alias(col_name)
            for col_name in update_columns.keys()
        ]
    )

# Show the resulting DataFrame
display(combined_df.limit(2))


#### 6) Storing Aggregated Analysis Results in a Lakehouse Table

In this final step, we save the combined DataFrame to a specified path. We use the Delta format to ensure efficient storage and querying. By appending the data to the existing table, we maintain a comprehensive record of the analysis results. Storing this data in a lakehouse allows us to easily access and call upon it for future reporting purposes.

In [None]:
path = 'chat_analysis_results'
#combined_df.write.format("parquet").mode("overwrite").save(path)
combined_df.write.mode("append").format("delta").saveAsTable(path)

#### 7) Optional: Update chat_analytics container in the database 

In addition to storing the data in a lakehouse for future reporting purposes, we also can optionally use a callback to store the data in the original CosmosDB. By maintaining a copy of the data in CosmosDB, we can leverage its capabilities for fast and scalable data access, while also benefiting from the analytical capabilities of the lakehouse.

We will use Spark's OLTP connector to interact with CosmosDB. To do this, we need to specify the connection string in the hidden cell below:

nosql_conn_string = 'AccountEndpoint=ENDPOINT;AccountKey=KEY'


In [None]:
endpoint = 'https://yourcosmosdbaccount.documents.azure.com:443/'
key = 'Account Key'
nosql_conn_string = 'AccountEndpoint=Endpoint;AccountKey=key'

In [None]:
db_name_analytics = 'chatbot'
#container_name_tracker = "chat_tracker"
container_name_analysis = 'chat_analytics' 


config_analysis = {
    "spark.cosmos.accountEndpoint": endpoint,
    "spark.cosmos.accountKey": key,
    "spark.cosmos.database": db_name_analytics,
    "spark.cosmos.container": container_name_analysis 
}


# Configure Catalog Api    
spark.conf.set("spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog")
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", config_analysis["spark.cosmos.accountEndpoint"])
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountKey", config_analysis["spark.cosmos.accountKey"])


In [None]:
display(combined_df.limit(2))

In [None]:
combined_df_filtered = combined_df.filter(col('id').isNotNull())
display(combined_df_filtered.limit(5))

In [None]:
# Ingest sample data    
combined_df_filtered.write\
  .format("cosmos.oltp") \
  .options(**config_analysis) \
  .mode("APPEND") \
  .save()