## Step 3: QA our subset with an LLM

* We let the LLM decide if the description is of an activity (hiking, cycling, etc)
* ~We use ~Mistral 7B Instruct~ Llama 3 8B Instruct run with Ollama (see https://ollama.com/)~
* In terminal, run `ollama run llama3`

In [1]:
import pandas as pd
import requests
import json

# For progress_apply() in pandas
from tqdm import tqdm
tqdm.pandas()

In [35]:
relevant_subset = pd.read_feather('./interim/relevant_subset.feather')

In [36]:
def ask_llm(prompt, limit=9999):
    """
    Sends the `prompt` to the LLM running locally via Ollama.
    Temperature = 0 to get consistent output between runs.
    Works much faster for shorter limits, so consider
    setting to 1 for True/False queries
    """
    response = requests.post(
        url='http://localhost:11434/api/generate',
        json={
            'model': 'llama3',
            'prompt': prompt,
            'stream': False,
            'options': {
                'temperature': 0,
                'seed': 42,
                'num_predict': limit
            }
        }
    )
    
    return json.loads(response.text).get('response').strip()

def is_activity_description(text):
    prompt = f"""
    Does the text in triple quotes represent a high-quality and insightful
    route or track description, or an activity description such as hiking,
    cycling, or racing? Respond with `True` or `False`.
    If you are unsure, say `False`.
    Text: ```{text}```
    """
    return ask_llm(prompt, 1)


def contains_personal_info(text):
    prompt = f"""
    Does the text in triple quotes contain any personal information,
    such as someone's address or name? Respond with `True` or `False`.
    If you are unsure, say `True`.
    Text: ```{text}```
    """
    return ask_llm(prompt, 1)

### Determine if GPX description resembles an activity description
We only want relevant descriptions that actually describe the activity

In [37]:
relevant_subset['is_activity_description'] = (
    relevant_subset.description.progress_apply(is_activity_description)
)

100%|███████████████████████████████████████| 3096/3096 [24:56<00:00,  2.07it/s]


#### Let's look at what the LLM has decided

In [42]:
(
    relevant_subset
    .query('is_activity_description == "True" and description_lang == "en"')
    .sample(5)
    .description
    .to_list()
)

['Have a walk through the smallest city of Europe and discover its history and monuments.',
 'Route by MC Van Rossem & Thierry Dutrieux Our third Atomium ride will take us to Zemst, Rumst, Boom and Willebroek. In Boom we’ll pass by the Velodroom museum, nice street art and a hidden statue of the tin can man 🤖 We’ll be using mostly “jaagpads” along the canal and quiet country roads. This itinerary is flat but can be quite windy.',
 'This trail leading to the borough of Kukleny takes in significant industrial buildings evoking the glory of the iron and leather industries, as well as family houses, a funeral hall, and a Cubist vocational school.',
 'This route between Holborn and Angel stays away from the major sights and goes on quieter roads, including some interesting lesser known streets through Finsbury.',
 'The Gibloux botanical trail presents typical trees of our woods on a 5.4 km hike.']

In [48]:
is_activity_description('Note check opening times of Manor Park Cremitorium')

'False'

In [50]:
relevant_subset.is_activity_description.value_counts()

False    1574
True     1521
`           1
Name: is_activity_description, dtype: int64

In [49]:
# Save just in case
relevant_subset.to_feather('./interim/relevant_subset_llm_checkpoint.feather')

### For those that resemble an activity, check for personal info
We don't want any descriptions with personal information

In [51]:
idx = relevant_subset.is_activity_description.eq('True')

relevant_subset.loc[idx, 'contains_personal_info'] = relevant_subset\
    .loc[idx, 'description'].progress_apply(contains_personal_info)

100%|███████████████████████████████████████| 1521/1521 [21:42<00:00,  1.17it/s]


In [52]:
relevant_subset.contains_personal_info.value_counts()

False    1483
True       38
Name: contains_personal_info, dtype: int64

In [57]:
# Save just in case
relevant_subset.to_feather('./interim/relevant_subset_llm_checkpoint.feather')

### 💾 Save final
We reset index because feather wants it default

In [58]:
(relevant_subset
 .to_feather('./interim/relevant_subset_llm.feather')
)