Converting subtitle files into txt files ready to use for corpus linguistics.
Python
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
txt
.gitattributes
.gitignore
README.md
subtitles.py

README.md

subtitles

The subtitles repository contains a folder with Spanish subtitle clean files in txt from LOTR, Star Wars, Narcos, OITNB, GoT and HIMYM (this .txt files can be used as corpus material and are ready to be uploaded to AntConc) and a Python script that converts a subtitle file (.srt, etc) into a txt file, keeping the actual subtitle and removing:

  1. time stamp 00:00:06,217 --> 00:00:07,633

  2. subtitle tags {\an8}

  3. HTML tags

  4. scene number 3421

  5. descriptive noise subtitles [breathing intensifies]