A powerful, AI-driven semantic search engine that helps you find relevant content across your documents using natural language queries. The application leverages advanced AI models and machine learning to understand the meaning of your documents and queries, going far beyond simple keyword matching to deliver intelligent, context-aware search results.
ai-docs-search/
├── src/ # Python source code
│ ├── __init__.py
│ ├── app.py # Main application entry point
│ ├── config.py # Configuration settings
│ ├── document_parser.py # Document processing and text extraction
│ ├── embedding_model.py # Text vectorization using transformers
│ └── vector_db.py # Database management and vector search
├── docker/ # Docker-related files
│ ├── Dockerfile
│ ├── docker-compose.yml
│ └── .dockerignore
├── docs/ # Documentation
│ ├── project_zvit.txt # Ukrainian project report
│ ├── project_zvit.pdf # Ukrainian project report (PDF)
│ └── project_report.txt # English project report
├── tests/ # Test files
├── test_docs/ # Sample test documents
├── data/ # Database and data files
├── setup.py # Package configuration
├── requirements.txt # Python dependencies
├── README.md # This file
└── .gitignore # Git ignore rules
-
Multiple File Format Support: Process various file types:
- Text files (.txt)
- CSV files (.csv)
- XML files (.xml)
- PDF files (.pdf)
-
AI-Powered Semantic Search: Find documents based on meaning and context using advanced AI models, not just keywords
-
Machine Learning Vector Search: Uses AI-generated embeddings and Python-based similarity calculation for intelligent results
-
Natural Language Understanding: AI processes your queries to understand intent and context
-
Progress Tracking: Visual progress bars for document processing
-
Docker Support: Easy deployment using Docker
-
AI-Powered Document Processing:
- Scans your specified folder for supported files
- Extracts text content from each file
- Breaks content into meaningful sentences using AI-powered text processing
- Generates high-dimensional vector embeddings using the all-MiniLM-L6-v2 AI model
- Stores AI-generated embeddings in a SQLite database
-
Intelligent AI Search:
- Takes your natural language query
- Uses AI to convert it to the same high-dimensional embedding space
- Employs machine learning algorithms to find semantically similar content
- Returns AI-ranked relevant files and matching sentences based on semantic similarity
- Python 3.8+: Core programming language
- sentence-transformers (all-MiniLM-L6-v2):
- Pre-trained AI model for generating semantic embeddings
- 384-dimensional vector space representation
- Optimized for semantic similarity tasks
- Efficient balance between performance and resource usage
- Based on BERT architecture but 40% smaller
- NLTK: For text processing and sentence segmentation
- SQLite: For storing document embeddings
- NumPy: For efficient vector similarity calculations
- PyPDF2: For PDF file processing
- Docker: For containerization and easy deployment
The application uses the all-MiniLM-L6-v2 model, which is a lightweight but powerful transformer model:
- Architecture: Based on BERT but optimized for sentence embeddings
- Vector Size: Produces 384-dimensional embeddings
- Use Case: Specifically trained for semantic similarity and search tasks
- Performance:
- Good balance between accuracy and speed
- Can process thousands of sentences efficiently
- Requires less memory than larger models like BERT-base
- Features:
- Language Understanding: Captures semantic meaning beyond simple keywords
- Cross-lingual Capabilities: Works with multiple languages
- Context Awareness: Understands words based on their context
- Efficient Inference: Optimized for production use
The model converts text into high-dimensional vectors where:
- Similar meanings are close together in vector space
- Different meanings are far apart
- Relationships between concepts are preserved
This AI-powered approach allows the application to:
- Find relevant content even when exact words don't match using machine learning
- Understand context and meaning through AI, not just keywords
- Handle variations in language and phrasing with intelligent natural language processing
- Provide human-like understanding of document content and search intent
-
Clone the repository:
git clone <repository-url> cd ai-docs-search
-
Build and run using Docker Compose:
# Build the Docker image docker compose -f docker/docker-compose.yml build # Run the application in interactive mode docker compose -f docker/docker-compose.yml run -i ai-docs-search
Note: We use
run -i
instead ofup
to ensure proper handling of interactive input for the search queries.When prompted for a folder path, you can use
test_docs
to try the included sample files or use path to your local dir with files.
-
Clone the repository:
git clone <repository-url> cd ai-docs-search
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install the package:
pip install -e .
-
Run the application:
# Set Python path to include src directory export PYTHONPATH=src:$PYTHONPATH python src/app.py
When prompted for a folder path, you can use
test_docs
to try the included sample files or use path to your local dir with files.
-
Using Docker (Recommended):
docker compose -f docker/docker-compose.yml run -i ai-docs-search
-
Manual Installation:
export PYTHONPATH=src:$PYTHONPATH python src/app.py
Here's a complete example of using the application:
--- AI Document Search Application ---
Loading AI embedding model 'sentence-transformers/all-MiniLM-L6-v2'...
AI Model loaded. Embedding dimension: 384
Using AI-optimized Python-based similarity search for vector operations.
Enter the folder path to scan for files.
Folder path: test_docs
Scanning folder: /app/test_docs
--- Initializing Database ---
Processing CSV file: /app/test_docs/space_objects.csv
Processing TXT file: /app/test_docs/cats.txt
Processing PDF file: /app/test_docs/test_document.pdf
Processing TXT file: /app/test_docs/dogs.txt
Processing TXT file: /app/test_docs/space.txt
Processing XML file: /app/test_docs/sample_customer_orders.xml
Processing 112 sentences...
Vectorizing sentences: 100%|████████████| 112/112 [00:02<00:00, 37.70sent/s]
Database initialization complete.
Example 1: Finding Animal-Related Content
Enter your search query: give me files with animals
Searching for content related to: 'give me files with animals'
--- Top Relevant Files ---
- /app/test_docs/cats.txt
- /app/test_docs/dogs.txt
--- Top Matching Sentences ---
[File: dogs.txt] Sentence: 'Dogs are loyal pets....' (Distance: 0.6765)
[File: cats.txt] Sentence: 'Cats are independent animals....' (Distance: 0.6986)
Example 2: Finding Space-Related Content
Enter your search query: give me files with space objects
Searching for content related to: 'give me files with space objects'
--- Top Relevant Files ---
- /app/test_docs/space.txt
--- Top Matching Sentences ---
[File: space.txt] Sentence: 'He collected space memorabilia....' (Distance: 0.5963)
Example 3: No Results Found
Enter your search query: give me files with products
Searching for content related to: 'give me files with products'
No relevant files found.
- AI Distance Scores: Lower values indicate better AI-calculated matches (0.0 = perfect match, 1.0 = no similarity)
- AI Semantic Understanding: The AI system finds relevant content even when exact keywords don't match
- File Types: AI processes TXT, CSV, XML, and PDF files intelligently
- AI-Ranked Results: Shows all relevant files and the best AI-matching sentences from each
- Use descriptive phrases: "files about animals" works better than just "animals" for AI understanding
- Try different phrasings if you don't get expected results - the AI learns from natural language
- The AI system understands context, synonyms, and related concepts
- AI distance scores below 0.7 typically indicate good semantic matches
- Write queries as you would ask a human - the AI understands natural language
The repository includes sample test files in the test_docs/
directory:
- cats.txt - Information about cats and their behavior
- dogs.txt - Content about dogs as pets
- space.txt - Space-related content and astronomy
- space_objects.csv - Structured data about celestial objects
- test_document.pdf - Sample PDF document
- sample_customer_orders.xml - XML data with customer information
These files are perfect for testing the AI semantic search capabilities with natural language queries like:
- "give me files with animals" → AI finds cats.txt and dogs.txt
- "give me files with space objects" → AI finds space.txt and space_objects.csv
- "show me customer data" → AI finds sample_customer_orders.xml
- The application uses AI-powered Python-based similarity calculation with NumPy for accurate vector search
- AI processing speed depends on:
- Number and size of documents
- Available system resources for AI computations
- System memory for AI vector calculations and model operations
The application consists of main components located in the src/
directory:
src/app.py
: Main application entry point and orchestrationsrc/document_parser.py
: Document processing and text extractionsrc/embedding_model.py
: Text vectorization using transformerssrc/vector_db.py
: Database management and vector searchsrc/config.py
: Configuration settings and constants
Docker configuration is located in the docker/
directory:
docker/Dockerfile
: Container build instructionsdocker/docker-compose.yml
: Service orchestrationdocker/.dockerignore
: Files to exclude from Docker build context
Contributions are welcome! Please feel free to submit a Pull Request.
[Add your license here]