# AI / ML Easy Button !
## *Cortex SQL & Python with Audience Segment Data*
Welcome to the AI/ML Easy Button. This notebook will allow you to explore to explore Snowflake's out of the box functions and understand how they work with custom prompts in both SQL and Python.  

[LLM Docs](https://docs.snowflake.com/en/user-guide/snowflake-cortex/llm-functions)

![](https://www.fatherhood.org/hs-fs/hubfs/Images/Blog/easy-button.png?width=585&name=easy-button.png)

In [None]:
# Import python packages.  Install them using the Package menu in the upper right --- this notebook needs snowflake-ml 
import streamlit as st
import pandas as pd
import json

# We can also use Snowpark for our analyses!
from snowflake.snowpark.context import get_active_session
session = get_active_session()
from snowflake.snowpark import functions as F
from snowflake.cortex import Complete, Sentiment, Summarize, Translate
from snowflake.ml.modeling.preprocessing import MinMaxScaler






In [None]:
--first create your own schema to work in!
create schema if not exists MY_NAME ;
use schema MY_NAME;
---this should be your name!!
select current_database(), current_schema();

### Translate  
https://docs.snowflake.com/en/sql-reference/functions/translate-snowflake-cortex

In [None]:
-- TRANSLATE
-- Translate text

---call the in built SQL function, with text, from language and to language
select
    snowflake.cortex.translate(
        'Health-conscious parents, aged 30-45, suburban residents, focused on organic foods and fitness activities.',
        'en',
        'es');

In [None]:
-- SQL Translate from Any Text to English...leave from language blank to infer it
select
    snowflake.cortex.translate(
        'Jóvenes profesionales, de 25 a 35 años, habitantes urbanos, interesados en los gadgets tecnológicos y la vida sostenible.',
        '',
        'en');

In [None]:
#PYTHON IS EASY TOO...Now we use the same function directly from Python
Translate(
    "Padres conscientes de la salud, entre 30 y 45 años, residentes suburbanos, se centraron en alimentos orgánicos y actividades físicas.",
    "",
    "en"
)

### Sentiment 
https://docs.snowflake.com/en/sql-reference/functions/sentiment-snowflake-cortex

In [None]:
select
    REVIEW_TEXT,
    snowflake.cortex.sentiment(
        snowflake.cortex.translate( ---chained together, translate and sentiment
        REVIEW_TEXT,
        '',
        'en'
        ) 
    ) as review_sentiment
from
    LAB_DATA.PUBLIC.REVIEWS limit 100

### COMPLETE  
https://docs.snowflake.com/en/sql-reference/functions/complete-snowflake-cortex

In [None]:
---SQL to leverage an LLM... its this easy!  And switch between them in seconds.
select
   AUDIENCE_DESC, snowflake.cortex.complete(
        'llama3-8b',
        'Write a  marketing email for a new computer for this audience : ' || AUDIENCE_DESC
    ) from  LAB_DATA.PUBLIC.SEGMENTS limit 1

---can also imagine joining the above to a product table to include the product details for a very customized email

In [None]:
#likewise similar functionality availabe in Python functions
pd_df = session.table("LAB_DATA.PUBLIC.SEGMENTS").limit(1).to_pandas()
Complete('llama3-8b','Write a  marketing email for a new computer for this audience : ' + pd_df["AUDIENCE_DESC"][0])

In [None]:
--can classify your text data, can also easily extract data from text
select
   AUDIENCE_DESC, snowflake.cortex.complete(
        'llama3-8b',
        '[INST]
### 
Tell me based on the following audience descriptions, would they be interested in a computer? Answer should be only one of the following words - \
"Likely" or "Unlikely" or "Unsure". Make sure there are no additional additional text.
Review -
###' || AUDIENCE_DESC) as CLEANED_REVIEWS
     from  LAB_DATA.PUBLIC.SEGMENTS limit 5

In [None]:
###can imagine combinations here as well --- give me likely audiences to by a computer (maybe some other filtering) 
# and write an email to them. 

##can access other cells as dataframes
pd_df = cleaned_data.to_pandas()
filtered_pd_df = pd_df[ (pd_df['AUDIENCE_DESC'].str.contains('active')) & (pd_df['CLEANED_REVIEWS'] == 'Likely') ]
Complete('llama3-8b','Write a  marketing email for a new computer for this audience : ' + filtered_pd_df['AUDIENCE_DESC'].iloc[0])


In [None]:
#PYTHON version over a table.  returns to a snowpark dataframe.  you can easily convert it using .to_pandas() to a pandas dataframe
#Here we ask for a text message instead of an email
df = session.table("LAB_DATA.PUBLIC.SEGMENTS").limit(3).select(
    F.col("AUDIENCE_DESC"),
    Complete(
        "mistral-large",
        F.concat(
            F.lit("""Write a marketing SMS for a new computer for this audience : 
            """),
            F.col("AUDIENCE_DESC"),
            F.lit(" "))
    ).alias("SMS_MESSAGE")
).limit(3)
df.show()

In [None]:
###easily save to table in SQL or Python
df.write.mode("overwrite").save_as_table("SMS_MESSAGES_TO_SEND")

In [None]:
select * from SMS_MESSAGES_TO_SEND limit 5;

### EMBED_TEXT
https://docs.snowflake.com/sql-reference/functions/embed_text-snowflake-cortex

In [None]:
--snowflakes SOTA model
select
    snowflake.cortex.embed_text_768(
        'snowflake-arctic-embed-m',
        'Kids who love blue shirts'
    );

In [None]:
--vector embed all the segments in seconds
create or replace table SEGMENTS_EMBED as select
    audience_desc,
    snowflake.cortex.embed_text_768(
        'snowflake-arctic-embed-m',
        audience_desc
    ) as summary_embedding
from
    LAB_DATA.PUBLIC.SEGMENTS

What text data do you have at Simon?  Have you explored vector use cases?

### VECTOR DISTANCE CALCULATIONS  
https://docs.snowflake.com/en/sql-reference/functions/vector_cosine_similarity

In [None]:
select
    vector_cosine_similarity(
        snowflake.cortex.embed_text_768('snowflake-arctic-embed-m', 'California Contemporary style'),
        snowflake.cortex.embed_text_768('snowflake-arctic-embed-m', 'California Contemporary style homes')
    );

In [None]:
select 
    audience_desc,
    vector_cosine_similarity(
        summary_embedding,
        snowflake.cortex.embed_text_768('snowflake-arctic-embed-m', 'Hipster Men living in the suburbs')
    ) as similarity
from
    SEGMENTS_EMBED
order by
    similarity desc
limit 10;

In [None]:
---check for potential audience overlap
select 
    a.audience_desc, b.audience_desc,
    vector_cosine_similarity(
        a.summary_embedding,
        b.summary_embedding
    ) as similarity
from
    SEGMENTS_EMBED a
    cross join SEGMENTS_EMBED b  
where 
1=1
and similarity < 1
and a.AUDIENCE_DESC < b.AUDIENCE_DESC
order by
    similarity desc
limit 10;

## Question: 
In the above SQL, how would I list the least similar audiences?

Vector embedding can also be used to give 'guardrails' to an LLM.  By feeding the LLM the most similar/relevant text it provides better context to model and reduces halluciations.

## Retrieval-Augmented Generation (RAG) 
is the process of optimizing the output of a large language model, so it references an authoritative knowledge base outside of its training data sources before generating a response.
We do this in two steps when querying the LLM
1. First we take the question and get the relevant data using vector cosine similarities (as shown above)
2. Next we use that relevant data in the prompt to the LLM with our question

In [None]:
#question = """What Cities does GTA V Take Place In?"""
#question = """Is Lebron James featured in any NBA 2K games?"""
question = """What year is Grand Theft Auto: Vice City set in?"""

In [None]:
#step 1 get the relvant data
relevant_titles = session.sql(f"""
   select distinct
            title, summary,
            vector_cosine_similarity(
                summary_embedding,
                snowflake.cortex.embed_text_768(
                  'snowflake-arctic-embed-m',
                  '{question}'
                )
            ) as similarity
        from
            LAB_DATA.PUBLIC.GAMES_EMBED
  order by
      similarity desc
  limit 10""") 
relevant_titles.show()

In [None]:
#step 2 feed to the LLM using a specific prompt
info = '. | '.join([x[0] for x in relevant_titles.select("*").collect()]).replace("'", "")
prompt = f"""
            You are a video game expert. Please provide knowledge and guidance to the questions in the tags <question> and </question> based on the provided 
            context found between the tags <context> and </context>.

            <context>
            '{info}'
            </context>
            <question>
            '{question}'
            </question>
            Answer: """
query = """
      select
          snowflake.cortex.complete(
              ?, 
              ?
          ) as response
      """
complete = session.sql(query, params=['mistral-large', prompt])
with st.chat_message(name="Assistant"):
    st.write(complete.collect()[0][0])

## Look alike audiences through Vector Distance

In [None]:
import numpy as np
import pandas as pd

# Set a random seed for reproducibility
np.random.seed(42)

# Generate random data for 100 rows
data = {
    'CUSTOMER_ID': np.arange(1, 101),
    'AGE': np.random.randint(18,80, size=100),
    'INCOME': np.random.randint(30000, 150000, size=100),
    'SPENDING_SCORE': np.random.randint(1, 100, size=100),
    'LIFETIME_VALUE': np.random.randint(50, 500, size=100),
    'DAYS_SINCE_LAST_PURCHASE': np.random.randint(1, 365, size=100)
}
df = pd.DataFrame(data)
sp_df = session.create_dataframe(df)


sp_df.write.mode("overwrite").save_as_table("CUSTOMERS")

In [None]:
select * from CUSTOMERS limit 3;

In [None]:
sp_df = session.table("CUSTOMERS")
columns = sp_df.columns
input_cols = [col for col in columns if col != "CUSTOMER_ID"]
mms = MinMaxScaler(input_cols=input_cols, output_cols=input_cols)
df = mms.fit(sp_df).transform(sp_df)
cust_normal = df.join(sp_df, ["CUSTOMER_ID"],lsuffix="_NORM")
cust_normal.write.mode("overwrite").save_as_table("CUSTOMER_NORMAL")
cust_normal.show(10)



In [None]:
create or replace table CUSTOMER_EMBED as select *, ["AGE_NORM", "INCOME_NORM", "SPENDING_SCORE_NORM"*3, "LIFETIME_VALUE_NORM", "DAYS_SINCE_LAST_PURCHASE_NORM"]::VECTOR(FLOAT, 5) as profile_embed from CUSTOMER_NORMAL;

In [None]:
select * from CUSTOMER_EMBED limit 3;

In [None]:
select
    best_customer.CUSTOMER_ID SEARCH_CUST_ID,
    likealike_customer.CUSTOMER_ID LOOKALIKE_CUST_ID,
        VECTOR_L2_DISTANCE(
        best_customer.PROFILE_EMBED,
        likealike_customer.PROFILE_EMBED       
    )::NUMBER(10,5) as distance,
    best_customer.AGE,
    likealike_customer.AGE as AGE_LIKE,
    best_customer.INCOME,
    likealike_customer.INCOME as INCOME_LIKE,
    best_customer.SPENDING_SCORE,
    likealike_customer.SPENDING_SCORE as SPENDING_SCORE_LIKE,
    best_customer.LIFETIME_VALUE,
    likealike_customer.LIFETIME_VALUE as LIFETIME_VALUE_LIKE,
    best_customer.DAYS_SINCE_LAST_PURCHASE,
    likealike_customer.DAYS_SINCE_LAST_PURCHASE as DAYS_SINCE_LAST_PURCHASE_LIKE,
from
    CUSTOMER_EMBED best_customer
    cross join CUSTOMER_EMBED likealike_customer
where 
 best_customer.CUSTOMER_ID = 2
 and best_customer.CUSTOMER_ID < likealike_customer.CUSTOMER_ID
order by
    distance asc
limit 20;

# Fine Tuning is in Private Preview!
Snowflake is making this easy too, you supply your training data to the model via a table and you get an additional version of the model...fine tuned to your task.
Fine tuning can be the best combo of results and cost.  Dont have training data?  Use a more expensive LLM to score a set of data and train a smaller, cheaper one. Boom cheap and effective.

[Snowflake Fine Tuning Docs](https://docs.snowflake.com/en/user-guide/snowflake-cortex/cortex-finetuning?_fsi=wWDeqSCS&_fsi=wWDeqSCS)

[Snowflake Fine Tuning Workshop](https://quickstarts.snowflake.com/guide/finetuning_llm_using_snowflake_cortex_ai/index.html?index=..%2F..index#2)


# UI to Easily Create Chat Bots with (RAG & Search) is coming!
A Summit our head of product randomly chose someone from the audience to create a chatbot in Snowflake.  They had only logged into Snowflake 7 times and were able to create a RAG chat bot using the new UI in minutes.