Embedded-Subtitles-OCR

For some movies on youtube, you can see subtitles embedded at the bottom of the movies. We want to extract the subtitle from a movie with the corresponding timestamp. We can approach this problem in this way:

Regularly capture still image from the movie from time to time.
On each captured image, focus on the bottom part of the subtitle section.
Develop a machine learning method to learn how to recognize the text.

For the above problems: 1 & 2 - Using Python and FFMPEG, capture an image of the video repeatedly after a fixed duration. Then crop the image using FFMPEG so that only the part where the subtitle is embedded remains. 3 - Use pytesseract (https://pypi.python.org/pypi/pytesseract) / tesseract-ocr (https://github.com/tesseract-ocr/) to extract the text from the image. Example: https://github.com/prabhakar267/ocr-convert-image-to-text

The following may need some considerations:

The frequency of image capture is of some importance. It should be high enough such that no subtitles have been missing. But if it is too high, the processing (redundant) overhead will increase.
The timestamps for the beginning and ending of each line of subtitle should also be recorded. This is also related to Point 1 above.
Some videos can be of different quality; some are of VCD quality while some have high resolution. We may need to consider low quality videos too.
We aim to extract Cantonese subtitles. Start with English and see how the technique can be applied to other languages.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
experiment0		experiment0
experiment1		experiment1
experiment2		experiment2
experiment3		experiment3
friends		friends
ocr-text-extraction		ocr-text-extraction
.gitignore		.gitignore
3CharacterClassifiers.pdf		3CharacterClassifiers.pdf
Embedded-Subtitles-OCR.ipynb		Embedded-Subtitles-OCR.ipynb
README.md		README.md
extract_text		extract_text
file-page1.jpg		file-page1.jpg
main.py		main.py
ocr-convert-image-to-text-README.md		ocr-convert-image-to-text-README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Embedded-Subtitles-OCR

About

Releases

Packages

Languages

krohak/Embedded-Subtitles-OCR

Folders and files

Latest commit

History

Repository files navigation

Embedded-Subtitles-OCR

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages