# Using a language model in snowflake

Example notebook to use a large language model inside snowflake.
We need the specific python transformer 4.14.1 version that is also present in the snowpark anaconda channel.

Install:
pip install transformers==4.14.1 --user
pip install torch

Basically perform the follwoing steps
* load a model with the transformer package, 
* dump the model to disk with joblib
* create a STAGE in snowflake and uplaod the dumped model there
* write a python UDF that reads/imports the model and scores a text in a table


In [1]:
#### function to see the current Snowflake Environment Details
def current_snowflake_env():
    snowflake_environment = session.sql('select current_user(), current_role(), current_database(), current_schema(), current_version(), current_warehouse()').collect()
    print('User                     : {}'.format(snowflake_environment[0][0]))
    print('Role                     : {}'.format(snowflake_environment[0][1]))
    print('Database                 : {}'.format(snowflake_environment[0][2]))
    print('Schema                   : {}'.format(snowflake_environment[0][3]))
    print('Warehouse                : {}'.format(snowflake_environment[0][5]))
    print('Snowflake version        : {}'.format(snowflake_environment[0][4]))


To connect to snowflake, give the connection paramters in a config file (connection_config_trial.py) like:

```
connection_parameters = {
    "account": "ORGNAME-ACCOUNTNAME", 
    "user": "snowflaketrialuser",
    "password": "Yourpassword!0",
    "warehouse": "COMPUTE_WH",
    "role": "ACCOUNTADMIN",
    "database": "SNOWFLAKE_SAMPLE_DATA",
    "schema": "TPCH_SF10"
}
```

If you don't have a snowflake environment, sign up for one, it will take 5 minutes to get a 30 day trial and 400 credits. [Snowflake trial](https://signup.snowflake.com/)

In [28]:
import pandas as pd

from snowflake.snowpark import Session
from snowflake.snowpark import functions as F

### import connection parameters such as account, user, password, warehouse, database, schema
from connection_config_trial import connection_parameters

#### Set up a connection with Snowflake using snowpark and see the current environment details
session = Session.builder.configs(connection_parameters).create()
current_snowflake_env()

User                     : SNOWFLAKETRIALUSER
Role                     : ACCOUNTADMIN
Database                 : SNOWFLAKE_SAMPLE_DATA
Schema                   : TPCH_SF10
Warehouse                : COMPUTE_WH
Snowflake version        : 7.19.2


## Set up DWH (in trial account)

Create a database where we put in netflix data, our LLM and python UDF

In [31]:
session.sql(query="CREATE OR REPLACE database netflix").collect()
session.sql(query="USE SCHEMA netflix.public").collect()

[Row(status='Statement executed successfully.')]

In [47]:
### get the Netlix data from my github repo
nflx = pd.read_csv('https://raw.githubusercontent.com/longhowlam/python_hobby_stuff/master/netflix.csv')
nflx.sample(4)

Unnamed: 0,SHOW_ID,TYPE,TITLE,DIRECTOR,CAST,COUNTRY,DATE_ADDED,RELEASE_YEAR,RATING,DURATION,LISTED_IN,DESCRIPTION
7355,s7356,TV Show,Unorthodox,,"Shira Haas, Amit Rahav, Jeff Wilbusch, Alex Re...",Germany,"March 26, 2020",2020,TV-MA,1 Season,"International TV Shows, TV Dramas",A Hasidic Jewish woman in Brooklyn flees to Be...
7770,s7771,Movie,Zinzana,Majid Al Ansari,"Ali Suliman, Saleh Bakri, Yasa, Ali Al-Jabri, ...","United Arab Emirates, Jordan","March 9, 2016",2015,TV-MA,96 min,"Dramas, International Movies, Thrillers",Recovering alcoholic Talal wakes up inside a s...
1506,s1507,Movie,Counterfeiting in Suburbia,Jason Bourque,"Sarah Butler, Larissa Albuquerque, Kayla Walla...",Canada,"July 1, 2018",2018,TV-14,88 min,"Dramas, Thrillers",Two teenagers print counterfeit money in their...
6370,s6371,Movie,The Forgotten,Oliver Frampton,"Clem Tibber, Elarica Johnson, James Doherty, S...",United Kingdom,"September 11, 2017",2014,TV-MA,89 min,Horror Movies,After a teenager goes to live with his father ...


In [35]:
## create a snowflake table

# quote_identifiers – By default, identifiers, specifically database, schema, table and column names (from DataFrame.columns) will be quoted. 
# If set to False, identifiers are passed on to Snowflake without quoting, i.e. identifiers will be coerced to uppercase by Snowflake.

session.write_pandas(nflx, table_name="netflix_movies", quote_identifiers = False, auto_create_table = True, overwrite= True)

<snowflake.snowpark.table.Table at 0x27acec5bb50>

## Facebooks language model
Download the facebooks bart-large-mnli language model, see [here](https://huggingface.co/facebook/bart-large-mnli)

In [8]:
from transformers import pipeline
classifier = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli"
)

## takes some time.

  from .autonotebook import tqdm as notebook_tqdm


### Example of classifying a movie description

In [9]:
sequence_to_classify = "In a postapocalyptic world, rag-doll robots hide in fear from dangerous machines out to exterminate them, until a brave newcomer joins the group"
movie_genres = [
    "Action",
    "Comedy",
    "Drama",
    "Thriller",
    "Horror",
    "Science Fiction",
    "Romance",
    "Adventure",
    "Fantasy",
    "Documentary"
]

### only one class can be predicted at a time
classifier(sequence_to_classify, movie_genres)

{'sequence': 'In a postapocalyptic world, rag-doll robots hide in fear from dangerous machines out to exterminate them, until a brave newcomer joins the group',
 'labels': ['Science Fiction',
  'Action',
  'Documentary',
  'Thriller',
  'Adventure',
  'Fantasy',
  'Horror',
  'Romance',
  'Drama',
  'Comedy'],
 'scores': [0.2539590001106262,
  0.20386196672916412,
  0.11335238069295883,
  0.10166659206151962,
  0.09251900017261505,
  0.0795774832367897,
  0.0608665831387043,
  0.038360532373189926,
  0.033384546637535095,
  0.022451993077993393]}

In [45]:
sequence_to_classify = nflx.DESCRIPTION[922]
movie_genres = [
    "Action",
    "Comedy",
    "Drama",
    "Thriller",
    "Horror",
    "Science Fiction",
    "Romance",
    "Adventure",
    "Fantasy",
    "Documentary"
]

### multiple classes can be predicted at a time
classifier(sequence_to_classify, movie_genres, multi_label=True)

{'sequence': 'In his final recorded special, the iconoclastic comedian channels Goat Boy and tackles provocative topics like British porn, pot and the priesthood.',
 'labels': ['Comedy',
  'Action',
  'Adventure',
  'Documentary',
  'Fantasy',
  'Drama',
  'Thriller',
  'Romance',
  'Horror',
  'Science Fiction'],
 'scores': [0.8249779343605042,
  0.16978448629379272,
  0.12359555810689926,
  0.07502390444278717,
  0.04510419815778732,
  0.018783850595355034,
  0.014635768719017506,
  0.0082959970459342,
  0.00552849005907774,
  0.0016239113174378872]}

In [10]:
## dump the model to disk
import joblib
joblib.dump(classifier, 'bart-large-mnli.joblib')

['bart-large-mnli.joblib']

### create stage 
We do this in the netflix database that we just created in the snowflake environment

In [36]:
session.sql("CREATE STAGE IF NOT EXISTS NETFLIX.PUBLIC.ZERO_SHOT_CLASSIFICATION").collect()  

[Row(status='Stage area ZERO_SHOT_CLASSIFICATION successfully created.')]

In [37]:
### now put the model that we dumped earlier into the snowflake STAGE
session.file.put(
   'bart-large-mnli.joblib',
   stage_location = 'NETFLIX.PUBLIC.ZERO_SHOT_CLASSIFICATION',
   overwrite=True,
   auto_compress=False
)

[PutResult(source='bart-large-mnli.joblib', target='bart-large-mnli.joblib', source_size=1630942026, target_size=1630942032, source_compression='NONE', target_compression='NONE', status='UPLOADED', message='')]

### Create a UDFs so that we can use the language model in snowflake

In [38]:
# Caching the model
import cachetools
import sys
import joblib

@cachetools.cached(cache={})
def read_model():
   import joblib 
   import_dir = sys._xoptions.get("snowflake_import_directory")
   if import_dir:
       # Load the model
       return joblib.load(f'{import_dir}/bart-large-mnli.joblib')

In [44]:
from snowflake.snowpark.functions import pandas_udf
from snowflake.snowpark.types import StringType, PandasSeriesType
@pandas_udf(  
       name='NETFLIX.PUBLIC.classify_movie_into_genre',
       session=session,
       is_permanent=True,
       replace=True,
       imports=[
           '@ZERO_SHOT_CLASSIFICATION/bart-large-mnli.joblib'
       ],
       input_types=[PandasSeriesType(StringType())],
       return_type=PandasSeriesType(StringType()),
       stage_location='NETFLIX.PUBLIC.ZERO_SHOT_CLASSIFICATION',
       packages=['joblib',  'cachetools==4.2.2', 'transformers==4.14.1'],
        max_batch_size=10
   )
def get_review_classification(sentences: pd.Series) -> pd.Series:
    # Classify using the available categories
    movie_genres = [
        "Action",
        "Comedy",
        "Drama",
        "Thriller",
        "Horror",
        "Science Fiction",
        "Romance",
        "Adventure",
        "Fantasy",
        "Documentary"
    ]
    classifier = read_model()

    # Apply the model
    predictions = []
    for sentence in sentences:
       result = classifier(sentence, movie_genres)
       if 'scores' in result and 'labels' in result:
           category_idx = pd.Series(result['scores']).idxmax()
           predictions.append(result['labels'][category_idx])
       else:
           predictions.append(None)
    return pd.Series(predictions)

The version of package joblib in the local environment is 1.2.0, which does not fit the criteria for the requirement joblib. Your UDF might not work when the package version is different between the server and your local environment
The version of package cachetools in the local environment is 5.3.0, which does not fit the criteria for the requirement cachetools==4.2.2. Your UDF might not work when the package version is different between the server and your local environment


In [40]:
### now you can run the get_review_classification function on data in SQL

SQL = """ 
SELECT
    TITLE,
    LISTED_IN,
    DESCRIPTION,
    classify_movie_into_genre(DESCRIPTION::VARCHAR)  as genre
FROM 
    NETFLIX.PUBLIC.NETFLIX_MOVIES 
WHERE TYPE = 'Movie'
LIMIT 100
"""

movies = session.sql(SQL)

In [43]:
movies.show()

---------------------------------------------------------------------------------------------------------------------------------------
|"TITLE"  |"LISTED_IN"                                         |"DESCRIPTION"                                       |"GENRE"          |
---------------------------------------------------------------------------------------------------------------------------------------
|7:19     |Dramas, International Movies                        |After a devastating earthquake hits Mexico City...  |Action           |
|23:59    |Horror Movies, International Movies                 |When an army recruit is found dead, his fellow ...  |Horror           |
|9        |Action & Adventure, Independent Movies, Sci-Fi ...  |In a postapocalyptic world, rag-doll robots hid...  |Science Fiction  |
|21       |Dramas                                              |A brilliant group of students become card-count...  |Action           |
|122      |Horror Movies, International Movies  

In [None]:
scored_movies_df = movies.to_pandas()

In [None]:
scored_movies_df

In [27]:
session.close()

In [46]:
def get_review_classification_(sentences: pd.Series) -> pd.Series:
    # Classify using the available categories
    movie_genres = [
        "Action",
        "Comedy",
        "Drama",
        "Thriller",
        "Horror",
        "Science Fiction",
        "Romance",
        "Adventure",
        "Fantasy",
        "Documentary"
    ]


    # Apply the model
    predictions = []
    for sentence in sentences:
       result = classifier(sentence, movie_genres)
       if 'scores' in result and 'labels' in result:
           category_idx = pd.Series(result['scores']).idxmax()
           predictions.append(result['labels'][category_idx])
       else:
           predictions.append(None)
    return pd.Series(predictions)

In [49]:
get_review_classification_(nflx.DESCRIPTION[0:100])

0           Adventure
1              Action
2              Horror
3     Science Fiction
4              Action
           ...       
95             Action
96          Adventure
97        Documentary
98              Drama
99             Action
Length: 100, dtype: object