Narrate Webcam as Sir David Attenborough project

This project is based on this one here. The original is in Python, so my main task was to get it working in Javascript using Node.js.

I completed this project as I wanted to play around with AI models a bit more. It utilizes three different models:

Llava-13b, an open-source vision model. This allows you to provide an image with a prompt. In this case my prompt was. "Describe the image", but you can ask it more specific questions about the image as well.
Mistral-7b-instruct-v0.1, an open-source generative text model fine-tuned for instructions. This is used to take the image description from the previous step and change the tone to be like David Attenborough describing a nature documentary.
ElevenLabs text-to-speech, this is used to get the read the text aloud in the voice of Sir David Attenborough. In this case I got lucky and it already had a voice like his, but if you pay a higher subscription you can also upload your own audio files and clone the voice for your purposes.

Examples

Here's a video screenshare demo of how it works. Click the "Capture Image" button, a screenshot will appear on the screen. You'll get status updates of what step it's on, then the resulting description will appear. The narration will play through your speakers.

Tech Stack

I used Node.js and Express to throw together a super simple backend. This serves the HTML page and has three different routes for capturing the image, getting the description and reading the text aloud. I broke it into three routes mostly so I could quickly provide status updates after each step, but you could easily just do it in one.

To take the webcam image I used node-webcam. As I needed to throw this together quickly for a hackathon, I opted to just take a webcam pic via button click rather than snap a pic every 5 seconds like the original. This is mostly because I am cheap. To do it like the original you could just drop the api call into a setInterval.

Important is after taking the picture you'll want to resize the image to be smaller before passing to the AI that generates a description. This saves money. I used sharp for this, which is a Node-API module that converts large images in common formats to smaller, web-friendly JPEG, PNG, WebP, GIF and AVIF images of varying dimensions.

To play the sound, I used the play-sound package. This plays sound files from Node.js via your speakers. I liked this since it checks for a bunch of different audio players. In my case this allowed me to just drop it in and have it work.

Notes on potential improvements

The biggest struggle was getting the prompt changed in tone to David Attenborough. The prompt I had the most success with was, Here is a description of an image after the word 'DESCRIPTION'. Change the tone of the existing description to sound like Sir David Attenborough narrating a nature documentary. Make it snarky and funny. Return only the narration as a string.\n\nDESCRIPTION:\n${description}.

Still, this was very finicky. Sometimes it returned it with different headers, so I did some parsing for that. Sometimes it returned the original description AND the new one, so I did some parsing for that too. It also cut off mid sentence almost always. This is based on the max_new_tokens parameter, which I set to 150 for the demo but 128 is default. I didn't want it to get too expensive as I paid for the text-to-speech model as I used up the free tier pretty quickly. I'd like to fine a way to get that more consistent and to always end in a full sentence.

That's it! I hope this provides some entertainment. It was a good chance to work with some different AI models.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.mjs		app.mjs
index.html		index.html
package.json		package.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Narrate Webcam as Sir David Attenborough project

Examples

Tech Stack

Notes on potential improvements

About

Releases

Packages

Languages

License

nabramow/narrator_hackathon

Folders and files

Latest commit

History

Repository files navigation

Narrate Webcam as Sir David Attenborough project

Examples

Tech Stack

Notes on potential improvements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages