<a href="https://colab.research.google.com/github/matthewpecsok/data_engineering/blob/main/tutorials/de_api_llm_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# APIs and LLMs Tutorial

This notebook is intended to be used as a template notebook for the homework assingment on APIs and LLMs. You may use all of the code provided.

Make sure your runtime is set to a GPU or TPU!

## Import the google places api token

In [None]:
from google.colab import userdata
API_KEY = userdata.get('google_maps_api')

In [None]:
import requests

## Define the business name

this is the name of the business we will identify with the findplace endpoint.

In [None]:
BUSINESS_NAME = 'sweetaly gelato fifteenth and fifteenth utah'

## Use the findplace api endpoint

We need to identify the place_id of the given business. we'll use the findplace endpoint to learn the unique identifier for the business.

Note using common name may return multiple candidate places, we'll assume the first one is the correct business.

We use a get request and the url with the business name and api key

We take the response and extract the json data from it.

In [None]:
find_place_url = f"https://maps.googleapis.com/maps/api/place/findplacefromtext/json?input={BUSINESS_NAME}&inputtype=textquery&key={API_KEY}"
response = requests.get(find_place_url)
json_response = response.json()
json_response

get the list of candidates from the dictionary key 'candidates'. if candidates isn't found return an empty list.

In [None]:
candidates = json_response.get('candidates', [])
candidates

return '' if we cannot find the place id key. assume the 1st element is the one we want.

In [None]:
place_id = candidates[0].get('place_id','')
place_id

ok great, we have a placeid (Google's unique identifier for this business)

Let's use Google's API again to retrieve the reviews for the place.

## Retrieve Business Reviews

We now simplify and and reduce the number of steps for to extract the reviews in a line of code rather than many lines.

In [None]:
details_url = f"https://maps.googleapis.com/maps/api/place/details/json?place_id={place_id}&fields=reviews&key={API_KEY}"
response = requests.get(details_url)
reviews = response.json().get('result', {}).get('reviews', [])

In [None]:
reviews

# Hugging Face Setups

This code will import the LLama model. It will take while. It's quite large.

We import the required packages, define the token variable and create a pipeline object. A pipeline defines what the model will do which model to use, what the token is and to use the GPU (device_map) if it's available.

In [None]:
import torch
from transformers import pipeline

HF_TOKEN = userdata.get('HF_TOKEN') # Your token must be in this secret.

pipe = pipeline("text-generation", "meta-llama/Meta-Llama-3-8B-Instruct", torch_dtype=torch.bfloat16, device_map="auto",token=HF_TOKEN)

## Process a single review

For now let's process just one review to get comfortable with the workflow

In [None]:
review = reviews[0]['text']
review

## LLM chat setups

create a chat list defining what we want the LLM to accomplish, and what content to be used.

The dictionary with role = system defines the behavior we intend.
The dictionary with role = user is our review text to be evaluated.

In [None]:
chat = [
    {"role": "system", "content": "What specific flavors of gelato were mentioned? Return a list of strings with each flavor an element. If none were mentioned return a list with an element saying 'flavor-missing' only return the list. nothing else."},
    {"role": "user", "content": review}
]

chat

## Run the pipeline

This will return the generated text acting on the text provided.

The return is a STRING, but it looks like a list. We'll need to convert it.

In [None]:
chat_response = pipe(chat, max_new_tokens=512)
flavor_list_string = chat_response[0]['generated_text'][-1]['content']
print(flavor_list_string)

In [None]:
type(flavor_list_string)

We'll need a new package to convert the string into an actual python list object.

This package literally interprets the string. This literal interpretion results in a list object being created.

In [None]:
import ast
flavor_list = ast.literal_eval(flavor_list_string)

In [None]:
flavor_list

In [None]:
type(flavor_list)

## Update the reviews with the chat response

put the list of flavors into the reviews dictionary

For the correct review, place the extract list of flavors mentioned into the review. We pulled the first review, so we can put the list into the first review.

In [None]:
reviews[0]['flavors'] = flavor_list

In [None]:
reviews[0]

In [None]:
import pandas as pd

In [None]:
reviews_df = pd.DataFrame(reviews)
reviews_df

## Explode the dataframe

That sounds dangerous. It's not. It simply creates multiple rows for the list object column. So, we get a row for each flavor. This will be the format we need for inserting into the database.

In [None]:
exploded_df = reviews_df.explode('flavors')
exploded_df

In [None]:
from tqdm import tqdm

# How do we loop over all the reviews?

You can use a for loop to iterate over all of the reviews for the business.

In [None]:
for review in tqdm(reviews):
  print('do stuff!')