`script_scraper` - Parser for Film and TV Scripts

A lightweight Python parser built in my spare time as a potential upgrade to the current tool used by the Geena Davis Institute's Spell Check for Bias tool.

Usage

>>> from script_scraper import open_pdf, script_scraper, word_and_sentence_count
>>> # open the PDF file
>>> pdf = open_pdf(path='~/Desktop/Inception.pdf')
>>> # run the analysis
>>> words_spoken = script_scraper(pdf=pdf,
...                               remove_first_line=False)
>>> # check words spoken for each character
>>> words_spoken['ATTENDANT']
['He was delirious. But he asked for', 'you by name. And...', 'Show him.']
>>> # get word and sentence count for each character
>>> word_count, sentence_count = word_and_sentence_count(words_spoken['ATTENDANT'])
>>> word_count, sentence_count
(13, 3)

Development

Begin by installing Docker, if you have not already. Once Docker is running, run development from within the Docker container:

# build the Docker image
docker build -t script_scraper .

# run the Docker container in interactive mode
docker run \
    -it \
    --rm \
    -v "${PWD}:/script_scraper" \
    -p 8888:8888 \
    script_scraper /bin/bash

# launch JupyterLab...
jupyter lab --ip 0.0.0.0 --no-browser --allow-root --NotebookApp.token='' --NotebookApp.password=''

# ... or, now in the container, run unit tests, if you'd like
pytest -v --cov-report term --cov=script_scraper

FAQ

My PDF has a watermark across every page. What can I do? By default, these PDFs will not work in script_scraper. Here is how I have been able to run these documents through the library:

Use a tool such as this to convert the PDF to a Word document.
Open the document in Word, then save the document in XML format.
Open the XML file in a text editor, find-and-replace the watermark text with an empty string.
Open the XML file back up in Word.
Save the Word document as a PDF for online use.

Now, you can run script_scraper on the edited PDF, which no longer should have the watermark.

If you have a better way to deal with this issue (that is hopefully more automated), feel free to make a PR!

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
nbs		nbs
script_scraper		script_scraper
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
README.md		README.md
requirements-dev.txt		requirements-dev.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`script_scraper` - Parser for Film and TV Scripts

Usage

Development

FAQ

Known Bugs / Issues Progress

About

Releases

Packages

Languages

nathancooperjones/script-scraper

Folders and files

Latest commit

History

Repository files navigation

script_scraper - Parser for Film and TV Scripts

Usage

Development

FAQ

Known Bugs / Issues Progress

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`script_scraper` - Parser for Film and TV Scripts

Packages