Use Gen AI and vector database to search for patent similiarities
Update 12-12-24, changed simularity calculation from scikit-learn to a simple calculation and numba, seeing a ~100x speed up for simularity calculations, from about 8 minutes to about 4 seconds on the i5 per million comparisons. Also added tqdm progress bar
Update 12-11-24, found out a fix for GPU passthrough working with Linux in ProxMox. 1. Make sure the CPU type for the VM is set to host. 2. Edit the Ollama system variable for GPU -> sudo systemctl edit ollama.service 3. Set CUDA_VISIBLE_DEVICES=0 ollama; save and then restart Ollama by systemctl restart ollama This definitely worked for me, with Ollama version 0.4.7.
This is an idea to utilize natural language processing to compare patents and patent ideas. Utilizing the data from PatentsView, https://patentsview.org/download/data-download-tables, downloaded the title and abstract description of about 9 million patents (g_patent.tsv). Took the information and passed through an embedding model to create the vectors. Initially tried a couple different models from Hugging Face, with varying success, finally settling on Ollama 3.1. This provides three different sizes of embeddings: small (all-minilm, 384 dimensional vector), medium (nomic-embed-text, 768 dimensional), and large (mxbai-embed-large, 1024 dimensional). On the embedding creation side (create_ollama_embeddings.ipynb), looped through each patent title and abstract (separately), created the embedding, and wrote to a vector database. Worked with Milvus vector database and the vector extended version of PostgreSQL. These were a bit too heavy, so utilized chroma db, and also numpy arrays. The numpy arrays made it simpler to perform the similarity testing (read_ollama_embeddings.ipynb) on multiple computers to get an understanding of the compute resources needed, and the time for execution.
On the creating embedding side, utilized a local Nvidia A2000 (6GB) GPU with a dual E5-2640v3 and 128GB of RAM, Windows 10. An attempt was made with a virtualized (ProxMox) Ubuntu machine, which was not very successful. Apparently, it is a known issue that Ollama virtualized does not utilize the GPU, thus the Windows machine. Also, an i5-6500T, 64GB RAM machine was tested, with decent results. For embedding a small number of items, say less than 1000, an i5 is decent, taking only a few minutes. For larger number of patents, the GPU is necessary.
Emedding times on the GPU system was a few seconds for 1000 items, to a few to 20 hours for one million embeddings (small to large models), and for 9 million, the time was approximately 6 days of continual run time. File size is around 6 GB per million large emeddings, with the 9 million being about 60GB.
Interestingly on the similarity testing side, the i5 machine takes around 8 minutes per 1 million similarities, with only a 10% time difference between 368 dimensional and 1024 dimensional calculations. Did not find a way to utilize the GPU to speed up similarity calculations, which is a subject for further exploration.