## Classifier in Elasticsearch via MLT query

You can build a simple classifier in Elasticsearch without any ML models or external imports. Using the MLT query you can categorize your documents by their similarity to other provided examples.

Based on this blog: https://www.elastic.co/blog/text-classification-made-easy-with-elasticsearch 

MLT Query: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html

In [1]:
from elasticsearch import Elasticsearch, helpers
from getpass import getpass

#Connect to the elastic cloud server
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")
ELASTIC_API_KEY = getpass("Elastic API Key: ")

# Create an Elasticsearch client using the provided credentials
client = Elasticsearch(
    cloud_id=ELASTIC_CLOUD_ID,  # cloud id can be found under deployment management
    api_key=ELASTIC_API_KEY, # your username and password for connecting to elastic, found under Deplouments - Security
)

  from elasticsearch.client import MlClient


In [40]:
index_name = "20_news"

mappings= {
    "properties": {
		"description": {
			"type": "text",
			"analyzer":"english",
			"fielddata": True
		},
		"category": {
			"type": "text",
			"analyzer":"english",
			"fields": {
				"keyword": {
					"type": "keyword",
					"ignore_above": 512
				}
			}
		},
		"name": {
			"type": "text",
			"analyzer":"english",
			"fielddata": True
		}
	}
}

Using the 20newsgroup dataet - consisting of categorized news articles across 20 defined categories.

In [None]:
from datasets import load_dataset

dataset = load_dataset("SetFit/20_newsgroups",)

In [68]:
dataset["train"][0]

{'text': 'I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.',
 'label': 7,
 'label_text': 'rec.autos'}

In [41]:
client.indices.create(index=index_name)

def generate_docs(data, index_name):
    for element in data:
        element.update({"_index": index_name})
        yield element

load = helpers.bulk(client, generate_docs(dataset["train"], index_name))

Here are the first few documents in the index, for an idea of what our data looks like:

In [69]:
response = client.search(index=index_name, size=3)
for hit in response["hits"]["hits"]:
    print(hit['_source'])

{'text': 'I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.', 'label': 7, 'label_text': 'rec.autos'}
{'text': "A fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge

We will take the first entry as our query - taking the text as our "more like this" field, and saving the initial label to see how many similar responses that we get back will have the same category

In [59]:
example = response["hits"]["hits"][0]["_source"]
print(example["text"])
print(example["label_text"])

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.
rec.autos


In [60]:
query = {
    "more_like_this":{
        "fields":[
            "text",
            "label_text"
        ],
        "like":example["text"],
        "min_term_freq":1,
        "max_query_terms":20
   }
}

In [73]:
response = client.search(index=index_name, query=query)
for hit in response["hits"]["hits"]:
    print(hit['_source']["label_text"])

rec.autos
rec.autos
rec.autos
misc.forsale
rec.autos
rec.autos
rec.autos
rec.autos
rec.autos
rec.motorcycles


Awesome! looks like most returned articles are from the same category of Auto Recommendatoins. We can further run some more code to pick the "correct" label via more statistical analysis:

In [74]:
from operator import itemgetter
def get_best_category(response):
    categories = {}
    for hit in response['hits']['hits']:
        score = hit['_score']
        category = hit['_source']["label_text"]
        if category not in categories:
            categories[category] = score
        else:
            categories[category] += score
    if len(categories) > 0:
        sortedCategories = sorted(categories.items(), key=itemgetter(1), reverse=True)
        category = sortedCategories[0][0]
    return category

In [75]:
get_best_category(response)

'rec.autos'