# Sentiment Analysis with mixed languages on Azure Synapse Analytics

This notebook demonstrates the use of cognitive services with mixed languages for sentiment analysis.
As Greek is not currently supported for sentiment analysis with the language cognitive service, we use a translator service to first translate Greek to English.

This is a Synapse Spark notebook, and should be attached in an Azure Synapse Spark pool to run.
As a prerequisite I have provisioned a language and a translator cognitive service, and created two relevant linked services in Synapse, called "TextAnalytics1" and "translator2":

| Translator Service | Language Service |
| :-: | :-: |
| !["translator2"](img/translator_SynLS.PNG) | !["textAnalytics1"](img/textanalytics2_SynLS.PNG) |


In [1]:
#imports and definitions
import mmlspark
from mmlspark.cognitive import *
from notebookutils import mssparkutils
from pyspark.sql.functions import col, flatten, arrays_zip, col, explode, expr, lit, concat, array, collect_list

#the relevant linked services should have been created in Synapse already
translate_service_name = "translator2" #translator service
sentiment_service_name= "TextAnalytics1" #language service


StatementMeta(sparkpool1, 15, 1, Finished, Available)

In addition I have uploaded a mock input file (text, tab delimited) on Azure Data Lake, called "synapse_comments_tab.txt".
This file has a column with mock customer comments, in both English and Greek, called "comment" and a "timestamp" column as well. 

In [2]:
#read the mock input file into a data frame
df = ( spark.read    
    .option("header", "true")  
    .option("delimiter", "\t")
    .csv('abfss://matt454-adlfs-eastus@matt454adleastus.dfs.core.windows.net/synapse/workspaces/matt454-synapse-eastus/Raw_Data/synapse_comments_tab.txt')
)

#reformat the data frame as a single row with arrays of strings to use with translator service
df1 = df.agg(collect_list('comment').alias('text'))
df2 = df.agg(collect_list('timestamp').alias('timestamp'))
detectDF = df1.join(df2,how='outer')

StatementMeta(sparkpool1, 15, 2, Finished, Available)

Now we can call the translator service. I first detect the language of each comment, so I only translate Greek comments.

In [3]:
#translate greek to english

## call translator service
detect = (Detect()
    .setLinkedService(translate_service_name)
    .setTextCol("text")
    .setOutputCol("result"))

## keep original text and detected language
df_detected = ( 
    detect
    .transform(detectDF) 
    .withColumn("language", col("result.language")) 
    .select("text","language","timestamp") 
    #and format them into separate rows
    .withColumn("tmp", arrays_zip("text", "language","timestamp")) 
    .withColumn("tmp", explode("tmp")) 
    .select(col("tmp.text"), col("tmp.language"),col("tmp.timestamp")) 
    )

## keep rows in greek to translate to english
df_greek= df_detected.where(df_detected.language=="el")

## keep rows in english and 'translate to english'
df_english2english = ( df_detected
    .where(df_detected.language=="en")
    .select(col("text"),col("text").alias("translation"),col("timestamp"))
)



StatementMeta(sparkpool1, 15, 3, Finished, Available)

Now we send the df_greek dataframe, which contains the Greek records, for translation.
Then we concatenate the translated df_greek2english with the df_english2enlish (in which the original English comment was copied as is in the "translation" column) to prepare the data frame for sentiment analysis.

In [4]:

translate = (Translate()
    .setLinkedService(translate_service_name)
    .setTextCol("text")
    .setFromLanguage("el")
    .setToLanguage(["en"])
    .setOutputCol("translation")
    .setConcurrency(5))
## translate greek to english and format
df_greek2english = (
  translate
      .transform(df_greek)
      .withColumn("translation", flatten(col("translation.translations")))
      .withColumn("translation", col("translation.text"))
      #.withColumn("translation",expr("substring(translation, 3, length(translation))"))
      .withColumn("translation",explode("translation"))
      .select("text","translation","timestamp")
)

## Union greek and english
df_translated = ( df_english2english
  .union(df_greek2english)
)

StatementMeta(sparkpool1, 15, 4, Finished, Available)

We detect the sentiment of the records in df_translated and store the result (original comment, translated comment, sentiment category and timestamp) into a table in the data lake, so we can use it in e.g., Power BI to create reports.

In [5]:
#sentiment analysis

df_translated = ( df_translated
    .withColumn("lang",lit("EN-US"))
)

sentiment = (TextSentiment()
    .setLinkedService(sentiment_service_name)
    .setTextCol("translation")
    .setOutputCol("sentiment")
    .setErrorCol("error")
    .setLanguageCol("lang"))


spark.sql("DROP TABLE IF EXISTS Sentiment_mockup")
( sentiment
    .transform(df_translated)
    .select("timestamp","text","translation", col("sentiment")[0]
        .getItem("sentiment")
        .alias("sentiment"))
    .write.format("parquet").saveAsTable('Sentiment_mockup')
)



StatementMeta(sparkpool1, 15, 5, Finished, Available)

In [6]:
# we can easily see the contents of the output table using Spark SQL
display(spark.sql("select * from Sentiment_mockup"))

StatementMeta(sparkpool1, 15, 6, Finished, Available)

SynapseWidget(Synapse.DataFrame, 4e519d81-39f4-41fe-8883-d0b3b694a416)