# Similarity Modulation

Here we are going to implement another similarity other that the BM25 which is the default in Elastic. We want you to implement a tf-idf similarity and test it with same queries in phase2 so that you can get a sense of how well your Elastic tf-idf works. Follow the instructions and fill where ever it says # TODO.  <br>
You can contact me in case of any problems via Telegram: @mahvash_sp

In [1]:
from elasticsearch import Elasticsearch, helpers
import json
import warnings



In [2]:
# import data in json format
file_name = 'IR_data_news_12k.json'

with open(file_name) as f:
    data = json.load(f)

In [3]:
# Filter warnings
warnings.filterwarnings('ignore')

In [4]:
# data keys
data['0'].keys()

dict_keys(['title', 'content', 'tags', 'date', 'url', 'category'])

After starting your Elasticsearch on your pc (localhost:9200 is the default) we have to connect to it via the following piece of code


In [6]:
# Here we try to connect to Elastic
es = Elasticsearch("http://localhost:9200")

## Create tf-idf Index

### Create Index

In [7]:
# Name of index 
sm_index_name = 'tfidf_index'

In [8]:
# Delete index if one does exist
if es.indices.exists(index=sm_index_name):
    es.indices.delete(index=sm_index_name)

# Create index    
es.indices.create(index=sm_index_name)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'tfidf_index'})

### Add documents

In here we used the bulk doc formatter which was introduced in the first subsection of phase 3. <br>
You can find out more in [Here](https://stackoverflow.com/questions/61580963/insert-multiple-documents-in-elasticsearch-bulk-doc-formatter).

In [9]:

from elasticsearch.helpers import bulk

def bulk_sync():
    actions = [
        {
            '_index': sm_index_name,
            '_id':doc_id,
            '_source': doc
        } for doc_id,doc in data.items()
    ]
    bulk(es, actions)
    
    


In [10]:
# run the function to add documents
bulk_sync()

In [11]:
# Check index
es.count(index = sm_index_name)

ObjectApiResponse({'count': 8420, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}})

### Configuring a similarity

In order to configure a new similarity function you have to change the similarity from the settings api of the index. This can be done via the function 'put_settings' in python. What we do is to change the 'default' similarity function in Elastic so that it uses the replaced similarity instead. Type of this similarity is set to 'scripted' because tf-idf is not among the pre-defined similarity functions in Elastic anymore. As this similarity is a scripted type the source code of it must be written **by you** and passed to it.<br>
> In order for the changes to be applied, first we close the index and change the settings and then reopen it<br>

Write the tf-idf code in a string and pass it as a value to the "source" key. <br>
You can find the variables needed in your code in [Here](https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-similarity-context.html).

In [12]:
# TODO : uncomment the code bellow, write the tf-idf code in here
source_code = "double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;"

In [13]:
# closing the index
es.indices.close(index=sm_index_name)

# applying the settings
es.indices.put_settings(index=sm_index_name, 
                            settings={
                                "similarity": {
                                      "default": {
                                        "type": "scripted",
                                        "script": {
                                          # TODO : uncomment the code bellow and pass the suitable parameter
                                          "source": source_code
                                        }
                                      }
                                }
                            }
                       )

# reopening the index
es.indices.open(index=sm_index_name)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True})

### Query

In this section you have to test your index with same queries you tested phase2. The goal here is to observe how different or simillar your tf-idf Elastic implementation works.

In [14]:
# A function that creates appropriate body for our match content type query
def get_query(text):
    body ={
    "query":{  
        "match" : {
            "content" : text

            }
        }
    }
    
    return body

In [24]:
queries = [
    "صهیونیست", " تحریم‌های هسته‌ای آمریکا علیه ایران"
]

In [25]:
all_res_tfidf = []


for q in queries:
    res_tfidf = es.search(index=sm_index_name, body=get_query(q), explain=True)
    all_res_tfidf.append(dict(res_tfidf))

In [26]:
for res, q in zip(all_res_tfidf, queries):
    print(q)
    for doc in res['hits']['hits']:
        print(doc['_source']['url'])
    print("----------------------------")

صهیونیست
https://www.farsnews.ir/news/14001002000158/محکومیت-حضور-کاروان-ورزشی-رژیم-صهیونستی-در-امارات-عکس
https://www.farsnews.ir/news/14001102000416/ورزشکار-کویتی-دست-رد-به-سینه-صهیونیست‌ها-زد-عکس
https://www.farsnews.ir/news/14001119000767/رونمایی-از-جایزه-خوش-خدمتی-سعودی‌ها-به-رژیم-صهیونیستی-در-جودو-عکس
https://www.farsnews.ir/news/14001006000161/اقدام-حمایتی-عمان-از-فلسطین-سرود-و-پرچم-رژیم-اشغالگر-بایکوت-شد-عکس
https://www.farsnews.ir/news/14000928000230/اتفاق-عجیب-در-ادامه-عادی-سازی-روابط-رژیم-اشغالگر-با-جهان-اسلام-مهدوی
https://www.farsnews.ir/news/14000929000460/موج-اعتراضات-از-ایران-به-عراق-رسید-محکومیت-یونس-محمود-مثل-مهدوی‌کیا
https://www.farsnews.ir/news/14001011000158/لابی-صهیونیست‌ها-در-ورزش-هشدار-جدی-به-مالزی-به-خاطر-حمایت-از-فلسطین
https://www.farsnews.ir/news/14000929000102/مهدوی‌کیا-حامل-پرچم-رژیم-اشغالگر-شد-خشم-صهیونیست‌ها-از-ستارگان
https://www.farsnews.ir/news/14001004000131/مهدوی‌کیا-به-جای-عذرخواهی-در-دام-صهیونیست‌ها-افتاد-سکوت-شائبه‌برانگیز
https://www.farsnews.i