TEDExtract

A small test crawler for TED talks. For each talk on ted.com all transcripts of all talks will be downloaded and saved within a separate file.

Requirements

As stated within the requirements.txt:

pandas
bs4

Usage

python3 ./main.py --output=/output/dir --max_pages=76 --delay=5

Through this all transcripts of all talks will be saved within the output directory with the format: $output/<TALK_NAME>.csv

The parameter delay will introduce a delay between each crawling attempt so we don't receive a 429 error. The default value is 10. If such an error (like 429) occurs you don't have to start from the beginning. We'll save a backup pickle file and check if a talk transcript was downloaded already, so just restart the script.

There is an additional script combine_csvs.py which is responsible for creating a single csv file and csv files for each language.

python3 ./combine_csvs.py --input_dir=/output/dir --outname=/final/output/dir/name

All files will then be saved to the directory /final/output/dir with the name final*. There will be one file containing all talks final.csv and one file for each language final.en.csv, final.de.vsc, etc.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
LICENSE		LICENSE
README.md		README.md
TEDExtract.py		TEDExtract.py
__init__.py		__init__.py
combine_csvs.py		combine_csvs.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TEDExtract

Requirements

Usage

About

Releases

Packages

Languages

License

naetherm/TEDExtract

Folders and files

Latest commit

History

Repository files navigation

TEDExtract

Requirements

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages