## thinking through multi concept queries

Let's say we have a system where  someone can free query about medical practictioners , providers, and someone has a query like 

> "I am looking for a physical therapist who specializes in sports injuries, near Manhattan near SoHo"

Here, we have both medical and location concepts at once.

I got an interesting initial suggestion from ChatGPT to try using `spacy` for entity extraction, to cut out the location information and set everything else as medical. Interesting idea. Trying it here, 

In [2]:
import spacy

nlp = spacy.load("en_core_web_sm")
query = "I am looking for a physical therapist who specializes in sports injuries, near Manhattan near SoHo"
doc = nlp(query)

medical_concept = []
location_concept = []

for ent in doc.ents:
    if ent.label_ in ["GPE", "LOC"]:  # Geopolitical Entity, Location
        location_concept.append(ent.text)
    else:
        medical_concept.append(ent.text)

medical_concept_str = " ".join(medical_concept)
location_concept_str = " ".join(location_concept)



In [3]:

medical_concept,  location_concept


([], ['Manhattan', 'SoHo'])

In [4]:
doc.ents

(Manhattan, SoHo)

Okay so that was an interesting idea, but looks like I ought to fill `medical_concept` with everything else even if it is not in `doc.ents` , but also just in the initial query since here `medical_concept` is empty

### Anyway, looking around, also seeing a specialized medical project actually.

https://github.com/allenai/scispacy

ok , trying out, 
```
pip install scispacy
```


and perhaps?  
```
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz
```

In [7]:
nlp = spacy.load("en_core_web_sm")
query = "I am looking for some carbonara pasta , near Manhattan near NoHo"
doc = nlp(query)


In [8]:
doc.ents

(Manhattan, NoHo)

# 2024-06-09

## try out the zero shot 

In [9]:
from transformers import pipeline, BertTokenizer, BertModel
import torch

# Load Hugging Face pipeline for zero-shot classification
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

query = "I am looking for a physical therapist who specializes in sports injuries, close by"

# Predefined location-related intents
candidate_labels = ["current location", "physical location", "explicit location"]
classification = classifier(query, candidate_labels)


In [10]:
!which python

/Users/michal/opt/miniconda3/envs/pandars310/bin/python


In [11]:
classification

{'sequence': 'I am looking for a physical therapist who specializes in sports injuries, close by',
 'labels': ['physical location', 'current location', 'explicit location'],
 'scores': [0.4983280599117279, 0.25559571385383606, 0.24607622623443604]}

### lets try other classes , maybe for better results  ?

In [12]:
query = "I am looking for a physical therapist who specializes in sports injuries, close by"

# Predefined location-related intents
candidate_labels = ["relative location", "explicit location"]
classification = classifier(query, candidate_labels)
classification

{'sequence': 'I am looking for a physical therapist who specializes in sports injuries, close by',
 'labels': ['relative location', 'explicit location'],
 'scores': [0.811715841293335, 0.18828414380550385]}

In [13]:
query = "I am looking for a physical therapist who specializes in sports injuries, close by"

# Predefined location-related intents
candidate_labels = ["relative location", "explicit location", "no location provided"]
classification = classifier(query, candidate_labels)
classification

{'sequence': 'I am looking for a physical therapist who specializes in sports injuries, close by',
 'labels': ['relative location', 'explicit location', 'no location provided'],
 'scores': [0.8051754236221313, 0.18676704168319702, 0.008057523518800735]}

Ok this looks promising !?

In [14]:
query = "I am looking for a physical therapist who specializes in sports injuries"

# Predefined location-related intents
candidate_labels = ["relative location", "explicit location", "no location provided"]
classification = classifier(query, candidate_labels)
classification

{'sequence': 'I am looking for a physical therapist who specializes in sports injuries',
 'labels': ['explicit location', 'relative location', 'no location provided'],
 'scores': [0.6577290892601013, 0.2819722890853882, 0.060298554599285126]}

Haha yea that was too good to be true . This is probably difficult to use negation in a class?

In [15]:
query = "I am looking for a physical therapist who specializes in sports injuries"

# Predefined location-related intents
candidate_labels = ["relative location", "explicit location", "ambiguous location"]
classification = classifier(query, candidate_labels)
classification

{'sequence': 'I am looking for a physical therapist who specializes in sports injuries',
 'labels': ['explicit location', 'ambiguous location', 'relative location'],
 'scores': [0.5178123712539673, 0.2601984441280365, 0.22198916971683502]}

In [16]:
help(classifier)

Help on ZeroShotClassificationPipeline in module transformers.pipelines.zero_shot_classification object:

class ZeroShotClassificationPipeline(transformers.pipelines.base.ChunkPipeline)
 |  ZeroShotClassificationPipeline(args_parser=<transformers.pipelines.zero_shot_classification.ZeroShotClassificationArgumentHandler object at 0x7fad122e8b80>, *args, **kwargs)
 |  
 |  NLI-based zero-shot classification pipeline using a `ModelForSequenceClassification` trained on NLI (natural
 |  language inference) tasks. Equivalent of `text-classification` pipelines, but these models don't require a
 |  hardcoded number of potential classes, they can be chosen at runtime. It usually means it's slower but it is
 |  **much** more flexible.
 |  
 |  Any combination of sequences and labels can be passed and each combination will be posed as a premise/hypothesis
 |  pair and passed to the pretrained model. Then, the logit for *entailment* is taken as the logit for the candidate
 |  label being valid. Any

hmm try `multi_label=True` ?

In [17]:
query = "I am looking for a physical therapist who specializes in sports injuries"

# Predefined location-related intents
candidate_labels = ["relative location", "explicit location", "no location provided"]
classification = classifier(query, candidate_labels, multi_label=True)
classification

{'sequence': 'I am looking for a physical therapist who specializes in sports injuries',
 'labels': ['explicit location', 'relative location', 'no location provided'],
 'scores': [0.64714115858078, 0.3713032305240631, 0.004426070488989353]}

In [19]:
sum(classification["scores"])

1.0228704595938325

In [20]:
query = "I am looking for a physical therapist who specializes in sports injuries, near Manhattan near SoHo"

# Predefined location-related intents
candidate_labels = ["relative location", "explicit location", "no location provided"]
classification = classifier(query, candidate_labels, multi_label=True)
classification


{'sequence': 'I am looking for a physical therapist who specializes in sports injuries, near Manhattan near SoHo',
 'labels': ['relative location', 'explicit location', 'no location provided'],
 'scores': [0.9820536971092224, 0.9324496388435364, 0.00030656083254143596]}

In [21]:
query = "I am looking for a physical therapist who specializes in sports injuries, in Manhattan "

# Predefined location-related intents
candidate_labels = ["relative location", "explicit location", "no location provided"]
classification = classifier(query, candidate_labels, multi_label=True)
classification


{'sequence': 'I am looking for a physical therapist who specializes in sports injuries, in Manhattan ',
 'labels': ['relative location', 'explicit location', 'no location provided'],
 'scores': [0.9801641702651978, 0.923812210559845, 0.0002362721279496327]}

In [22]:
query = "I am looking for a physical therapist who specializes in sports injuries today "

# Predefined location-related intents
candidate_labels = ["relative location", "explicit location", "no location provided"]
classification = classifier(query, candidate_labels, multi_label=True)
classification


{'sequence': 'I am looking for a physical therapist who specializes in sports injuries today ',
 'labels': ['explicit location', 'relative location', 'no location provided'],
 'scores': [0.8038725256919861, 0.41341912746429443, 0.0036620250903069973]}

In [23]:
query = "I am looking for a physical therapist who specializes in sports injuries today "

# Predefined location-related intents
candidate_labels = ["medical", "culinary", "business", "education", "athletic", "music", "rehabilitation"]
classification = classifier(query, candidate_labels, multi_label=True)
classification


{'sequence': 'I am looking for a physical therapist who specializes in sports injuries today ',
 'labels': ['athletic',
  'medical',
  'rehabilitation',
  'business',
  'education',
  'culinary',
  'music'],
 'scores': [0.9634179472923279,
  0.5759478807449341,
  0.4954245984554291,
  0.0016168535221368074,
  4.726956467493437e-05,
  4.686843021772802e-05,
  4.063392771058716e-05]}

Ok well the simple categorization works, just not really the location part here. hmm 

In [24]:
query = "I am looking for a physical therapist who specializes in sports injuries today "

# Predefined location-related intents
candidate_labels = ["a location is provided", "there is no location provided"]
classification = classifier(query, candidate_labels, multi_label=True)
classification


{'sequence': 'I am looking for a physical therapist who specializes in sports injuries today ',
 'labels': ['a location is provided', 'there is no location provided'],
 'scores': [0.6782494783401489, 0.006060502957552671]}

In [25]:
query = "I am looking for a physical therapist who specializes with sports injuries today "

# Predefined location-related intents
candidate_labels = ["a location is provided", "there is no location provided"]
classification = classifier(query, candidate_labels, multi_label=True)
classification


{'sequence': 'I am looking for a physical therapist who specializes with sports injuries today ',
 'labels': ['a location is provided', 'there is no location provided'],
 'scores': [0.6726811528205872, 0.006378599908202887]}

In [26]:
nlp

<spacy.lang.en.English at 0x7fad19e0fd30>

# 2024-06-15

look at these concepts again actually 


In [29]:
doc, doc.ents

(I am looking for some carbonara pasta , near Manhattan near NoHo,
 (Manhattan, NoHo))

In [30]:

import spacy

nlp = spacy.load("en_core_web_sm")
query = "ok,  Manhattan near SoHo, philadelphia pa, kansas, nyc, 1st avenue and 39th street, 111 pilot road,"
doc = nlp(query)
print(doc.ents)

"""for ent in doc.ents:    
    if ent.label_ in ["GPE", "LOC"]:  # Geopolitical Entity, Location
"""


(Manhattan, SoHo, philadelphia, kansas, nyc,, 1st avenue, 39th, 111)


'for ent in doc.ents:    \n    if ent.label_ in ["GPE", "LOC"]:  # Geopolitical Entity, Location\n'

In [37]:
x = doc.ents[0]
x.doc, x.ents, x.label, x.label_, x.text

(ok,  Manhattan near SoHo, philadelphia pa, kansas, nyc, 1st avenue and 39th street, 111 pilot road,,
 [Manhattan],
 384,
 'GPE',
 'Manhattan')