Study Transcriber

This is a work in progress. The goal is to automatically transcribe lecture videos using Whisper, improve the transcripts using some local LLM, and embed the results into a vector database. This can then be used to easily access lecture contents for review.

Install packages:

# Pytorch, e.g.:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install langchain pypdf fpdf sentence_transformers chromadb cryptography==3.1

Run semantic search

python ./semantic_search.py [OPTIONS] ./path/to/input_folder/

Options:

--types [type1] [type2] [...]: File types to index. The only supported types are pdf, txt. Default is pdf txt.
--lang [language]: Language for stopword removal in the query. This is only relevant for the highlighting. Use the full language name, like english, german, etc. Supported languages: arabic, azerbaijani, basque, bengali, catalan, chinese, danish, dutch, english, finnish, french, german, greek, hebrew, hinglish, hungarian, indonesian, italian, kazakh, nepali, norwegian, portuguese, romanian, russian, slovene, spanish, swedish, tajik, turkish.
--print: Also print the results to stdout.

Note that highlighting is always done in green color (as yellow is too mainstream) and is still experimental.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
README.md		README.md
page_highlighter.py		page_highlighter.py
requirements.txt		requirements.txt
semantic_search.py		semantic_search.py
transcribe_video.py		transcribe_video.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Study Transcriber

Install packages:

Run semantic search

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Study Transcriber

Install packages:

Run semantic search

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages