## Preprocessing the files

In [None]:
## Documentation

- This jupyter Notebook is designed to preprocess a dataset of diseases and their associated symptoms, generate embeddings for the symptoms using a pre-trained transformer model, and index the data into Elasticsearch for efficient search and retrieval. Below is a summary of the key steps and functionalities implemented in this notebook:

### 1. Preprocessing the Dataset
- The dataset is loaded from a CSV file (`dataset.csv`).
- A new column `Combined_Symptoms` is created by combining all symptom-related columns into a single string.
- The dataset is filtered to retain only the `Disease` and `Combined_Symptoms` columns.
- The data is grouped by `Disease`, and unique symptoms are aggregated for each disease.
- The processed data is saved to a new CSV file (`disease_symptoms.csv`).

### 2. Elasticsearch Setup
- Elasticsearch is configured with a secure connection using a password and CA certificate.
- An index (`disease_prediction1`) is created in Elasticsearch with mappings for `disease`, `symptoms`, and `embedding` fields. The `embedding` field is defined as a dense vector with 384 dimensions.

### 3. Embedding Generation
- A pre-trained transformer model (`sentence-transformers/all-MiniLM-L6-v2`) is used to generate embeddings for the symptoms.
- Mean pooling is applied to the token embeddings to create a single vector representation for each symptom.

### 4. Indexing Data into Elasticsearch
- The processed dataset is indexed into Elasticsearch.
- Each document contains the disease name, combined symptoms, and the embedding vector.
- Bulk indexing is performed with error handling to ensure robustness.

### 5. Searching for Symptoms
- A search function is implemented to find the most relevant diseases based on user input symptoms.
- The function generates an embedding for the user query and performs a cosine similarity search in Elasticsearch.
- The top-k most relevant results are returned, including the disease name, symptoms, and relevance score.

### 6. Dependencies
- The notebook uses the following Python libraries:
    - `pandas` for data manipulation.
    - `transformers` for loading the pre-trained model and tokenizer.
    - `torch` for tensor operations.
    - `elasticsearch` for interacting with the Elasticsearch instance.
    - `langchain-elasticsearch` for advanced Elasticsearch functionalities.

### 7. Usage
- Run the cells sequentially to preprocess the data, set up Elasticsearch, index the data, and perform searches.
- Modify the `user_query` variable to test the search functionality with different symptoms.

This notebook provides a complete pipeline for disease prediction based on symptoms using Elasticsearch and transformer-based embeddings.

SyntaxError: invalid syntax (2708907377.py, line 5)

In [None]:
import pandas as pd
from elasticsearch import Elasticsearch, helpers
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

In [None]:
df = pd.read_csv("dataset.csv")

In [None]:
df

In [None]:
df.columns

In [None]:
df['Combined_Symptoms'] = df.apply(lambda row: ', '.join([str(val) for val in row if pd.notnull(val)]), axis=1)


In [None]:
df = df[["Disease","Combined_Combined_Symptoms"]]

In [None]:
df.Disease.value_counts()

In [None]:
df_combined = df.groupby('Disease')['Combined_Symptoms'].apply(lambda x: ', '.join(x.unique())).reset_index()

In [None]:
## checkpointers 
df_combined.to_csv("disease_symptoms.csv", index=False)

## elasticsearch feeding 

In [None]:
df =  pd.read_csv("disease_symptoms.csv")

In [None]:
from elasticsearch import Elasticsearch

# Password for the 'elastic' user generated by Elasticsearch
#!docker cp elastic:/usr/share/elasticsearch/config/certs/http_ca.crt .   for retrieving the CA certificate
ELASTIC_PASSWORD = "NAuf97gWR2bEPiI2F*rq"

# Create the client instance
client = Elasticsearch(
    "https://localhost:9200",
    ca_certs="http_ca.crt",
    basic_auth=("elastic", ELASTIC_PASSWORD)
)

# Successful response!
client.info()
# {'name': 'instance-0000000000', 'cluster_name': ...}

In [None]:
!pip install langchain langchain-elasticsearch langchain-community tiktoken


In [None]:
from langchain_elasticsearch import ElasticsearchStore
from langchain_elasticsearch import SparseVectorStrategy
from elasticsearch import Elasticsearch, exceptions


In [None]:
client.info()

In [86]:
# Define the index name for symptoms
index_name = 'disease_prediction1'

es = client

def create_index():
    if not es.indices.exists(index=index_name):
        es.indices.create(
            index=index_name,
            body={
                "mappings": {
                    "properties": {
                        "disease": {"type": "text"},
                        "symptoms": {"type": "text"},
                        "embedding": {
                            "type": "dense_vector",
                            "dims": 384  # Size of the embedding vector, depends on your model
                        }
                    }
                }
            }
        )
    else:
        print(f"Index {index_name} already exists.")

In [87]:
create_index()

In [88]:
def load_model():
    model_name = "sentence-transformers/all-MiniLM-L6-v2"  # Change to the model of your choice
    model = AutoModel.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return model, tokenizer

In [89]:

# Function to generate embeddings for symptoms
def generate_embeddings(texts, model, tokenizer):
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors='pt', max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1)  # Mean pooling of token embeddings
    return embeddings.numpy()


In [90]:

# Function to index documents (diseases and Combined_Symptoms) with embeddings
def index_documents(df):
    model, tokenizer = load_model()

    symptoms = df['Combined_Symptoms'].tolist()
    embeddings = generate_embeddings(symptoms, model, tokenizer)
    
    actions = []
    for i, (disease, symptom, embedding) in enumerate(zip(df['Disease'], symptoms, embeddings)):
        action = {
            "_op_type": "index",
            "_index": index_name,
            "_id": i,  # Optional, you can set a custom ID or let Elasticsearch auto-generate
            "_source": {
                "disease": disease,
                "symptoms": symptom,
                "embedding": embedding.tolist()  # Convert to list for JSON serialization
            }
        }
        actions.append(action)

    print(actions)
    helpers.bulk(es, actions)
    print(f"Indexed {len(df)} documents.")


# Function to index documents (diseases and symptoms) with embeddings
def index_documents(df):
    model, tokenizer = load_model()

    symptoms = df['Combined_Symptoms'].tolist()
    embeddings = generate_embeddings(symptoms, model, tokenizer)
    
    actions = []
    for i, (disease, symptom, embedding) in enumerate(zip(df['Disease'], symptoms, embeddings)):
        action = {
            "_op_type": "index",
            "_index": index_name,
            "_id": i,  # Optional, you can set a custom ID or let Elasticsearch auto-generate
            "_source": {
                "disease": disease,
                "symptoms": symptom,
                "embedding": embedding.tolist()  # Convert to list for JSON serialization
            }
        }
        actions.append(action)

    # Bulk indexing with error handling
    try:
        response = helpers.bulk(es, actions)
        print(f"Successfully indexed {len(actions)} documents.")
    except helpers.BulkIndexError as e:
        print(f"BulkIndexError: {len(e.errors)} documents failed to index.")
        for error in e.errors:
            print(error)


In [91]:
index_documents(df)

Successfully indexed 41 documents.


In [92]:
index_name

'disease_prediction1'

In [97]:
# Function to search for symptoms based on user input
def search_symptoms(query, top_k=3):
    model, tokenizer = load_model()
    
    # Generate the embedding for the user query
    query_embedding = generate_embeddings([query], model, tokenizer)[0]

    # Search for the most similar symptoms in Elasticsearch
    script_query = {
        "script_score": {
            "query": {
                "match_all": {}  # Match all documents
            },
            "script": {
                "source": "cosineSimilarity(params.query_vector, 'embedding') + 1.0",
                "params": {
                    "query_vector": query_embedding.tolist()
                }
            }
        }
    }
    
    # Perform the search and get the top-k results
    response = es.search(index=index_name, body={
        "size": top_k,
        "query": script_query
    })

    # Parse the response
    results = []
    for hit in response['hits']['hits']:
        score = hit['_score']
        disease = hit['_source']['disease']
        symptoms = hit['_source']['symptoms']
        results.append({"disease": disease, "symptoms": symptoms, "score": score})

    return results

In [99]:

# create_index()

# User input for symptom search
user_query = "fever and sore throat"
top_k = 1  # Top 3 most relevant results

results = search_symptoms(user_query, top_k)

# Display the search results
for result in results:
    print(f"Disease: {result['disease']}")
    print(f"Symptoms: {result['symptoms']}")
    print(f"Relevance Score: {result['score']:.4f}")
    print("-" * 50)

Disease: Common Cold
Symptoms: Common Cold,  continuous_sneezing,  chills,  fatigue,  cough,  high_fever,  headache,  swelled_lymph_nodes,  malaise,  phlegm,  throat_irritation,  redness_of_eyes,  sinus_pressure,  runny_nose,  congestion,  chest_pain,  loss_of_smell,  muscle_pain, Common Cold,  chills,  fatigue,  cough,  high_fever,  headache,  swelled_lymph_nodes,  malaise,  phlegm,  throat_irritation,  redness_of_eyes,  sinus_pressure,  runny_nose,  congestion,  chest_pain,  loss_of_smell,  muscle_pain, Common Cold,  continuous_sneezing,  fatigue,  cough,  high_fever,  headache,  swelled_lymph_nodes,  malaise,  phlegm,  throat_irritation,  redness_of_eyes,  sinus_pressure,  runny_nose,  congestion,  chest_pain,  loss_of_smell,  muscle_pain, Common Cold,  continuous_sneezing,  chills,  cough,  high_fever,  headache,  swelled_lymph_nodes,  malaise,  phlegm,  throat_irritation,  redness_of_eyes,  sinus_pressure,  runny_nose,  congestion,  chest_pain,  loss_of_smell,  muscle_pain, Common