Document text search engine

Documentation

About
How to run
Testing
What can be improved here?
License

About

A simple search engine for document texts. The data is stored in a database, the search index is in elastic.
Technical task u can find in tech-task.pdf

Database structure:

id – unique identifier for every doc;
rubrics – array of headings;
text – text of the doc;
created_date – doc creation date.

Index structure:

id – identifier from db;
text – text from db structure.

Methods:

the service must accept an arbitrary text request as input, search for the document text in the index and return the first 20 documents with all database fields, sorted by creation date;
delete doc from db and the index by id field.

Technical requirements:

README with deploy guide;
docs.json - service docs in openapi format.

If u wanna tryhard:

functional testing;
service runs in Docker;
asynchronous API calls.

How to run

If you want to change default config settings, look docker-compose.yml, Docker, config/.env. Default dataset stores in config posts.csv, and config in config/env.

Clone service.

git clone https://github.com/lusm554/document-text-search-engine.git

Set default config.

cp config/env config/.env

Run service:

chmod +x run.sh
./run.sh

The service will start in about 2 minutes (due to importing data from postgres to elasticsearch), so run tests after the server API is ready. You can check this in the docker logs or just curl localhost (for the default config).

Testing

As mentioned above, the service must be ready before testing.
If you changed some config data in docker-compose.yml or Docker or config/.env check out testing/main.py. Make sure you have requests and pytest installed or just pip install requests pytest.

chmod +x testing.sh
./testing.sh

Improvement

My thoughts on what can be improved in this service:

Use connection pools to reduce request time (at the moment i don't understand how to create global pool object, i don't fully understand how to work with asynchrony in python). Probably solution.
Probably use nginx for high concurrency.
Use production server able to communicate with Flask through a WSGI protocol.
Optimized task queues to manage long-running jobs, like search documents by arbitrary text.
How find error? Logging.
Use Quart instead of Flask async.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.docker		.docker
config		config
src		src
tests		tests
utils		utils
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
docs.json		docs.json
requirements.txt		requirements.txt
run.sh		run.sh
tech-task.pdf		tech-task.pdf
testing.sh		testing.sh
wait-to-start.sh		wait-to-start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document text search engine

Documentation

About

Database structure:

Index structure:

Methods:

Technical requirements:

If u wanna tryhard:

How to run

Testing

Improvement

License

About

Releases

Packages

Languages

License

lusm554/document-text-search-engine

Folders and files

Latest commit

History

Repository files navigation

Document text search engine

Documentation

About

Database structure:

Index structure:

Methods:

Technical requirements:

If u wanna tryhard:

How to run

Testing

Improvement

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages