- Python Version - 3.9.x
- Recommendation (Optional)
- Create a virtual environment for python 3.9 incase you have a different version installed using the below command
conda create --name py39 python=3.9
- Create a virtual environment for python 3.9 incase you have a different version installed using the below command
- Install dependencies using the following command (
Requirements.txt
is in the root folder)pip install -r requirements.txt
- Create a folder meta in the application root directory, This is where the indexes cached for quick queries!
- Run app.py - python3 app.py - Eg. python3 app.py test-data
- On querying, Application attempts to return top 10 relevant documents with their titles, filename, similarity score
Since the actual data is huge. I created a medium size version which contain 500 files. along with indexes cached. Files in meta folder can be removed to reconstruct the indexes. If you attempt to create a new index with much bigger clinical trails data. Use the link below to download the original dataset and delete the index files in meta/, so the index is re created.
- Meta folder is already present in the folder. You could use the already constructed index of original set or recreate by not using the meta folder.
- Original dataset (3095 files) - https://drive.google.com/drive/folders/1UaO7pIfw8eSMYnussKsNpG6yeT9UmCva?usp=sharing
- nltk
- punkt
- wordnet
- stopwords
- pandas
- Tokenization
- tags extracted from XML
- brief_title
- official_title
- brief_summary
- location_countries/country
- detailed_description
- Phase
- intervention/description
- intervention/intervention_type
- Condition
- Keyword
- tags extracted from XML
- Remove stop words
- Perform normalization (covid -- covid19)
- inverted index
- Document id to document name index
- Save index for future use
- Ranked retrieval using tfidf for terms
https://clinicaltrials.gov/ct2/about-studies/glossary https://clinicaltrials.gov/ct2/html/images/info/public.xsd https://clinicaltrials.gov/ct2/results/map?cond=COVID-19&map= http://www.trec-cds.org/2019.html#documents