```
Summary by:
- A41316 - Nguyễn Hữu Khoa
- A42718 - Lê Thảo Quyên
```

> **Note:** Many source code is outdated, some parameters, function usage have been changed according to elasticsearch version 7.17.10


### Step 1: Setting up Elasticsearch locally

- Download and install `elasticsearch` and `kibana` from https://www.elastic.co/downloads/elasticsearch and https://www.elastic.co/downloads/kibana 
- Then unzip the files and run the executables from bin folder
- Elasticsearch will run on port 9200 and Kibana on port 5601 

In [21]:
from datetime import datetime
from elasticsearch import Elasticsearch
import wikipedia
import wikipediaapi
import requests

In [5]:
# Elasticsearch client used to communicate with database
client = Elasticsearch('http://localhost:9200')
indexName = "medical" #index name
client.indices.create(index=indexName) # create index

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'medical'}

In [4]:
# indexName = "medical" #index name
# client.indices.delete(index=indexName) #delete an index

In [9]:
diseaseMapping = {
    'diseases': { # specify the mapping type here
        'properties': {
            'name': {'type': 'text'},
            'title': {'type': 'text'},
            'fulltext': {'type': 'text'}
        }
    }
}


In [11]:
client.indices.put_mapping(index=indexName,doc_type='diseases',body=diseaseMapping,include_type_name = True) # put mapping

{'acknowledged': True}

In [23]:
# client.indices.delete_mapping(index=indexName,doc_type='diseases')
# es.indices.create(index=indexName)

# Wikipedia API
# STEP 2 & 3: Data retrieval & preparation

In [12]:
dl = wikipedia.page("Lists_of_diseases")
dl.links

['Airborne disease',
 'Contagious disease',
 'Cryptogenic disease',
 'Disease',
 'Disseminated disease',
 'Endocrine disease',
 'Environmental disease',
 'Eye disease',
 'Health On the Net Foundation',
 'Lifestyle disease',
 'List of abbreviations for diseases and disorders',
 'List of autoimmune diseases',
 'List of cancer types',
 'List of childhood diseases and disorders',
 'List of communication disorders',
 'List of diseases (0–9)',
 'List of diseases (A)',
 'List of diseases (B)',
 'List of diseases (C)',
 'List of diseases (D)',
 'List of diseases (E)',
 'List of diseases (F)',
 'List of diseases (G)',
 'List of diseases (H)',
 'List of diseases (I)',
 'List of diseases (J)',
 'List of diseases (K)',
 'List of diseases (L)',
 'List of diseases (M)',
 'List of diseases (N)',
 'List of diseases (O)',
 'List of diseases (P)',
 'List of diseases (Q)',
 'List of diseases (R)',
 'List of diseases (S)',
 'List of diseases (T)',
 'List of diseases (U)',
 'List of diseases (V)',
 'List o

In [26]:
wiki_wiki = wikipediaapi.Wikipedia('en')

diseaseListArray = []

for link in dl.links:
    if link.startswith("List of diseases"):
        try:
            page_title = link
            page = wiki_wiki.page(page_title)
            
            # Check if the page exists
            if page.exists():
                diseaseListArray.append(page)
            else:
                # Try alternative titles or variations
                variations = [
                    page_title,
                    f"List_of_diseases_{page_title[-1]}",
                    f"List_of_diseases_{page_title[-1].upper()}"
                ]
                for variation in variations:
                    alt_page = wiki_wiki.page(variation)
                    if alt_page.exists():
                        diseaseListArray.append(alt_page)
                        break
                else:
                    print(f"Page '{page_title}' does not exist. Skipping...")
        except:
            print(f"An error occurred while fetching the page '{link}'. Skipping...")

print(diseaseListArray)


An error occurred while fetching the page 'List of diseases (M)'. Skipping...
[List of diseases (0–9) (id: 5450474, ns: 0), List of diseases (A) (id: 236329, ns: 0), List of diseases (B) (id: 236333, ns: 0), List of diseases (C) (id: 236335, ns: 0), List of diseases (D) (id: 236337, ns: 0), List of diseases (E) (id: 236338, ns: 0), List of diseases (F) (id: 236339, ns: 0), List of diseases (G) (id: 236340, ns: 0), List of diseases (H) (id: 236342, ns: 0), List of diseases (I) (id: 236344, ns: 0), List of diseases (J) (id: 236345, ns: 0), List of diseases (K) (id: 236346, ns: 0), List of diseases (L) (id: 236349, ns: 0), List of diseases (N) (id: 61784, ns: 0), List of diseases (O) (id: 61785, ns: 0), List of diseases (P) (id: 61786, ns: 0), List of diseases (Q) (id: 61787, ns: 0), List of diseases (R) (id: 61788, ns: 0), List of diseases (S) (id: 61789, ns: 0), List of diseases (T) (id: 61790, ns: 0), List of diseases (U) (id: 61791, ns: 0), List of diseases (V) (id: 61792, ns: 0), Lis

In [28]:
diseaseListArray[0].links

{'11 beta hydroxylase deficiency': 11 beta hydroxylase deficiency (id: ??, ns: 0),
 '11 beta hydroxysteroid dehydrogenase type 2 deficiency': 11 beta hydroxysteroid dehydrogenase type 2 deficiency (id: ??, ns: 0),
 '17-beta-hydroxysteroid dehydrogenase deficiency': 17-beta-hydroxysteroid dehydrogenase deficiency (id: ??, ns: 0),
 '17 alpha hydroxylase deficiency': 17 alpha hydroxylase deficiency (id: ??, ns: 0),
 '17 beta hydroxysteroide dehydrogenase deficiency': 17 beta hydroxysteroide dehydrogenase deficiency (id: ??, ns: 0),
 '17q21.31 microdeletion syndrome': 17q21.31 microdeletion syndrome (id: ??, ns: 0),
 '18-Hydroxylase deficiency': 18-Hydroxylase deficiency (id: ??, ns: 0),
 '18p deletion syndrome': 18p deletion syndrome (id: ??, ns: 0),
 '1p36 deletion syndrome': 1p36 deletion syndrome (id: ??, ns: 0),
 '2,8 dihydroxy-adenine urolithiasis': 2,8 dihydroxy-adenine urolithiasis (id: ??, ns: 0),
 '2-Hydroxyglutaricaciduria': 2-Hydroxyglutaricaciduria (id: ??, ns: 0),
 '2-Methyla

In [30]:
# the checklist is an array containing an array of allowed "first characters". If a disease does not comply, we skip it
checkList = [["0","1","2","3","4","5","6","7","8","9"],["A"],["B"],["C"],["D"],["E"],["F"],["G"],["H"],["I"],["J"],["K"],["L"],["M"],["N"],["O"],["P"],["Q"],["R"],["S"],["T"],["U"],["V"],["W"],["X"],["Y"],["Z"]]
docType = 'diseases' # document type we will index

for diseaselistNumber, diseaselist in enumerate(diseaseListArray):  # loop through disease lists
    for disease in diseaselist.links:  # loop through lists of links for every disease list
        try:
            # first check if it is a disease, then index it
            if disease[0] in checkList[diseaselistNumber] and disease[0:3] != "List":
                currentPage = wikipedia.page(disease)
                client.index(
                    index=indexName,
                    id=disease,
                    document={
                        "name": disease,
                        "title": currentPage.title,
                        "fulltext": currentPage.content
                    }
                )
        except Exception as e:
            print(str(e))
            pass




Page id "4 hydroxyphenylacetic acidalia" does not match any pages. Try another id!
HTTPConnectionPool(host='en.wikipedia.org', port=80): Max retries exceeded with url: /w/api.php?prop=extracts%7Crevisions&explaintext=&rvprop=ids&titles=Achromatopsia&format=json&action=query (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x000001C88EDFA2C0>, 'Connection to en.wikipedia.org timed out. (connect timeout=None)'))




  lis = BeautifulSoup(html).find_all('li')


"acme" may refer to: 
Acme (album)
Catullus 45
Acme Corporation
ACME Detective Agency
Acme Studios
The ACME Laboratories Ltd
Acme Motor Co
Acme Press
Acme Space
Acme Whistles
Rockstar North
Acme Aircraft Co
Acme Aircraft Corporation
Acme Boots
Acme Bread Company
Acme Brick
ACME Comics & Collectibles
ACME Communications
Acme Fresh Market
Acme Markets
Acme (automobile)
ACME Newspictures
Acme Packet
Acme Tackle Company
Acme Truck Line
Acme United Corporation
Air Craft Marine Engineering
Acme, Alberta
Acme, Indiana
Acme, Kansas
Acme, Louisiana
Acme, Michigan
Acme Township, Michigan
Acme, North Carolina
Acme Township, Hettinger County, North Dakota
Acme, Oklahoma
Acme, Pennsylvania
Acme, Texas
Acme, Washington
Acme, West Virginia
Acme Farm Supply Building
ACME Comedy Theatre
Acme (computer virus)
ACME (health software)
Acme (text editor)
Acme thread form
Acme zone
Arginine catabolic mobile element
Automatic Certificate Management Environment
Summit
Advisory Committee on Mathematics Educatio

KeyboardInterrupt: 