# Elastic Search implementation

## Goals
* UNDERSTAND HOW ELASTIC-SEARCH WORKS
* Utilize the following information: Student Names, Mentor Names, ReadMe summarizations, Report Summarizations, Raw Readmes, Raw Reports, Years, Project Title, Domain
* Keyword Search - Student Names, Mentor Names, Domain, Project Title
* Semantic Search - ReadMe Summarization, Report Summarization, Domain, Project Title
* Fuzzy Match / autocorrect
* Filtering

CSVs to use

* overall_data.csv - Year, Domain, Project Title
* mentors.csv - Mentor
* students.csv - Students
* github.csv - readme raw, readme summarized
* report_contents.csv - raw and processed text



Things to note - Don't forget about language breakdown


In [1]:
from transformers import AutoTokenizer, AutoModel
import torch
import pandas as pd
import numpy as np
from elasticsearch import Elasticsearch, helpers, exceptions
import pickle
# from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:

print(torch.backends.cudnn.enabled)
print(torch.cuda.is_available()) #We have GPU on deck and ready
print(f"CUDA device: {torch.cuda.get_device_name(torch.cuda.current_device())}")

True
True
CUDA device: NVIDIA GeForce RTX 3060 Laptop GPU


In [3]:
model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

  return self.fget.__get__(instance, owner)()


In [4]:
ovr_DF = pd.read_csv("../data/overall_data.csv", index_col= 0)
ovr_DF.head(3)

Unnamed: 0_level_0,year_presented,domain,project_title
project_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,2020,Wikipedia & Social Analysis,Racial Bias in Film Awards Shows: Oscars & Gol...
1,2020,Wikipedia & Social Analysis,User Engagement in Wikipedia
2,2020,Wikipedia & Social Analysis,Investigating the Trustworthiness of Wikipedia...


In [5]:
mentor_DF = pd.read_csv("../data/mentors.csv")
mentor_DF.head(3)

Unnamed: 0,project_id,ucsd_or_industry,mentor_name
0,0,UCSD,Molly Roberts
1,1,UCSD,Molly Roberts
2,2,UCSD,Molly Roberts


In [6]:
students_DF = pd.read_csv("../data/students.csv")
students_DF.head(3)

Unnamed: 0,project_id,student
0,0,Rebecca Hu
1,0,Emily Kwan
2,0,Poonam Varkhedi


In [7]:
github_DF = pd.read_csv("../data/github.csv")
github_DF.head(5)
github_DF["readme_summarized"].fillna("Report Summary not available", inplace=True)

In [8]:
report_contents_DF = pd.read_csv("../data/report_contents.csv")
report_contents_DF.head(5)

Unnamed: 0,project_id,urls,text_raw,text_processed
0,27,https://dsc-capstone.org/projects-2020-2021/re...,"DSC180B Capstone Project Report\nJian Jiao, Zi...",This project report focuses on the development...
1,28,https://dsc-capstone.org/projects-2020-2021/re...,Neel Shah A15151631\n \n \nYuxuan Ma A15155201...,This paper explores the use of natural languag...
2,29,https://dsc-capstone.org/projects-2020-2021/re...,CoCoDroid: Detecting Malware By Building Commo...,The researchers developed a detection tool for...
3,30,https://dsc-capstone.org/projects-2020-2021/re...,Malware Detection\nYikai Hao\nUniversity of Ca...,The report discusses the importance of malware...
4,31,https://dsc-capstone.org/projects-2020-2021/re...,Machine Learning for Facial Analysis\nTing-Yan...,The paper discusses the use of machine learnin...


In [9]:
for i, row in ovr_DF.iterrows():
    # print(row)
    print(f"Project Title: {row['project_title']}")
    print(f"Domain: {row['domain']}")
    print(f"Year: {row['year_presented']}")

    # Mentor Portion
    mentor_subset_DF = mentor_DF[mentor_DF['project_id'] == i]
    industries = (",".join(list(set(mentor_subset_DF["ucsd_or_industry"].to_list()))))
    print(f"Industry: {industries}")

    mentors = (",".join(list(set(mentor_subset_DF["mentor_name"].fillna("Not Specified").to_list()))))
    print(f"Mentors: {mentors}")


    #Student
    student_subset_DF = students_DF[students_DF['project_id'] == i]
    students = (",".join(list(set(student_subset_DF["student"].fillna("Not Specified").to_list()))))
    print(f"Students: {students}")

    #Github
    if len(github_DF[github_DF["project_id"] == i]) == 1:
        readme_summary = str(github_DF[github_DF["project_id"] == i]["readme_summarized"])
    else:
        readme_summary = "README not available"
    print(f"Readme Summary: {(readme_summary)}")


    #Github
    if len(report_contents_DF[report_contents_DF["project_id"] == i]) == 1:
        report_summary = str(report_contents_DF[report_contents_DF["project_id"] == i]["text_processed"])
    else:
        report_summary = "Report Summary not available"
    print(f"Report Summary: {(report_summary)}")
    



    # print(f"Project Title: {row['project_title']}")
    # print(f"Project Title: {row['project_title']}")
    # print(f"Project Title: {row['project_title']}")
    print("-" * 75)
    

Project Title: Racial Bias in Film Awards Shows: Oscars & Golden Globes 
Domain: Wikipedia & Social Analysis
Year: 2020
Industry: UCSD
Mentors: Molly Roberts
Students: Poonam Varkhedi,Rebecca Hu,Emily Kwan
Readme Summary: 0    # wiki-capstone\nPublic repository for DSC 180...
Name: readme_summarized, dtype: object
Report Summary: Report Summary not available
---------------------------------------------------------------------------
Project Title: User Engagement in Wikipedia 
Domain: Wikipedia & Social Analysis
Year: 2020
Industry: UCSD
Mentors: Molly Roberts
Students: Kenny Zhu,Jonathan Lin,Salma Shaikh
Readme Summary: 1    # DSC180B Wikipedia Engagement\n\nThis project...
Name: readme_summarized, dtype: object
Report Summary: Report Summary not available
---------------------------------------------------------------------------
Project Title: Investigating the Trustworthiness of Wikipedia and the Media in the Scope of COVID-19 
Domain: Wikipedia & Social Analysis
Year: 2020
Industr

In [10]:
# Run this line below to start up an elastic search cluster
# docker run --rm -p 9200:9200 -p 9300:9300 -e "xpack.security.enabled=false" -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:8.11.0

In [11]:
es = Elasticsearch("http://localhost:9200")
es.info().body

{'name': '4f69e1b63a57',
 'cluster_name': 'docker-cluster',
 'cluster_uuid': 'nH5Z0rYyRMyJOO368zysBg',
 'version': {'number': '8.11.0',
  'build_flavor': 'default',
  'build_type': 'docker',
  'build_hash': 'd9ec3fa628c7b0ba3d25692e277ba26814820b20',
  'build_date': '2023-11-04T10:04:57.184859352Z',
  'build_snapshot': False,
  'lucene_version': '9.8.0',
  'minimum_wire_compatibility_version': '7.17.0',
  'minimum_index_compatibility_version': '7.0.0'},
 'tagline': 'You Know, for Search'}

In [12]:
# delete model if already downloaded and deployed


In [13]:
# es.ml.put_

In [14]:
#From Hugging Face Tutorials
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

In [15]:
# get_embeddings("Test sentence").detach().numpy()[0]

In [16]:
mappings = {
        "properties": {
            "year_presented": {"type": "integer"},
            "domain": {"type": "text"},
            "project_title": {"type": "text"},
            "project_title_vector": {"type" : "dense_vector", "dims" : 768, "similarity" : "cosine"},
            "industry": {"type": "text"},
            "mentors": {"type": "text"},
            "members": {"type": "text"},
            "report_text_summarization": {"type": "text"},
            "readme_summarization": {"type": "text", "analyzer" : "english"},
            "readme_vector": {"type" : "dense_vector", "dims" : 768, "similarity" : "cosine"},
            "report_vector": {"type" : "dense_vector", "dims" : 768, "similarity" : "cosine"}
    }
}

es.indices.create(index="capstones", mappings=mappings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'capstones'})

In [17]:
readme_vector_dict = pd.read_pickle("../data/readme_vector_dict.pkl")
report_vector_dict = pd.read_pickle("../data/report_vector_dict.pkl")
project_title_vector_dict = pd.read_pickle("../data/project_title_vector_dict.pkl")




for i, row in ovr_DF.iterrows():
    # print(row)

    # print(f"Project Title: {row['project_title']}")
    # print(f"Domain: {row['domain']}")
    # print(f"Year: {row['year_presented']}")

    # project_title_vector = get_embeddings(row['project_title']).detach().numpy()[0]
    project_title_vector = project_title_vector_dict[i] 

    # Mentor Portion
    mentor_subset_DF = mentor_DF[mentor_DF['project_id'] == i]
    industries = (",".join(list(set(mentor_subset_DF["ucsd_or_industry"].to_list()))))
    # print(f"Industry: {industries}")

    mentors = (",".join(list(set(mentor_subset_DF["mentor_name"].fillna("Not Specified").to_list()))))
    # print(f"Mentors: {mentors}")


    #Student
    student_subset_DF = students_DF[students_DF['project_id'] == i]
    students = (",".join(list(set(student_subset_DF["student"].fillna("Not Specified").to_list()))))
    # print(f"Students: {students}")

    #Github
    if len(github_DF[github_DF["project_id"] == i]) == 1:
        readme_summary = str(github_DF[github_DF["project_id"] == i]["readme_summarized"])
    else:
        readme_summary = "README not available"
    # print(f"Readme Summary: {(readme_summary)}")
    # readme_vector = get_embeddings(readme_summary).detach().numpy()[0]
    readme_vector = readme_vector_dict[i] 


    #Github
    if len(report_contents_DF[report_contents_DF["project_id"] == i]) == 1:
        report_summary = str(report_contents_DF[report_contents_DF["project_id"] == i]["text_processed"])
    else:
        report_summary = "Report Summary not available"
    
    # report_vector = get_embeddings(report_summary).detach().numpy()[0]
    report_vector = report_vector_dict[i]
    # print(f"Report Summary: {(report_summary)}")
    

    # "year_presented": {"type": "integer"},
    #         "domain": {"type": "text"},
    #         "project_title": {"type": "text"},
    #         "industry": {"type": "text"},
    #         "mentors": {"type": "text"},
    #         "members": {"type": "text"},
    #         "report_text_summarization": {"type": "text"},
    #         "readme_summarization": {"type": "text", "analyzer" : "english"}

    doc = {
        "year_presented": row['year_presented'],
        "domain": row["domain"],
        "project_title": row["project_title"],
        "project_title_vector": project_title_vector,
        "mentors": mentors,
        "members": students,
        "report_text_summarization": report_summary,
        "readme_summarization": readme_summary,
        "readme_vector": readme_vector,
        "report_vector": report_vector
    }
            
    es.index(index="capstones", id=i, document=doc)

    # print(f"Project Title: {row['project_title']}")
    # print(f"Project Title: {row['project_title']}")
    # print(f"Project Title: {row['project_title']}")
    # print("-" * 75)
    

In [18]:
len(readme_vector_dict)

211

In [19]:
len(report_vector_dict)

211

In [20]:
len(project_title_vector_dict)

211

In [21]:
# file = open("../data/readme_vector_dict.pkl", 'wb')

# # dump information to that file
# pickle.dump(readme_vector_dict, file)

# # close the file
# file.close()

In [22]:
# file = open("../data/project_title_vector_dict.pkl", 'wb')

# # dump information to that file
# pickle.dump(project_title_vector_dict, file)

# # close the file
# file.close()

In [23]:
# resp = es.search(
#     index="capstones",
#     query={
#             "bool": {
#                 "must": [{
#                     "multi_match": {
#                         "query": "Social Analysis",
#                         "fields" : ["project_title", "domain^2"]
#                     }
#                 }, {
#                     "query": {
#                         "mentors": "Justin Eldridge",
#                         "fuzziness" : "AUTO"
#                     }
#                 }]
#             }
#     }
# )
# resp.body

In [24]:
response = es.search(
    index="capstones",
    knn={
      "field": "report_vector",
      "query_vector": get_embeddings("Crypto currency and blockchain").detach().numpy()[0],
      "k": 10,
      "num_candidates": 100
    }
)

response.body

{'took': 124,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 10, 'relation': 'eq'},
  'max_score': 0.72083926,
  'hits': [{'_index': 'capstones',
    '_id': '149',
    '_score': 0.72083926,
    '_source': {'year_presented': 2023,
     'domain': 'Finance and Blockchain',
     'project_title': 'Servicechain.io',
     'project_title_vector': [0.17259149253368378,
      0.030665665864944458,
      -0.28255271911621094,
      -0.06737937033176422,
      -0.0009485445916652679,
      -0.059149615466594696,
      0.5087307095527649,
      -0.10423146188259125,
      0.16391713917255402,
      0.10937920212745667,
      0.18004857003688812,
      0.08394765853881836,
      -0.006359661929309368,
      0.4690544009208679,
      0.14115047454833984,
      0.12588930130004883,
      0.013608008623123169,
      0.17137986421585083,
      0.40301790833473206,
      -0.21031051874160767,
      -0.06020835041999817,
      0.1409

In [None]:
resp = es.search(
    index="capstones",
    query={
            "multi_match": {
                "query": "Jastin Eldrige",
                "fields" : ["mentors"],
                "fuzziness": "AUTO"
            }
        },            
)
resp.body

In [None]:
example_query_1 = "Justin Eldridge"
example_query_2 = "Crypto Currency"

In [None]:
# Figure out how to multi better
# Add semantic manually