GitHub - manishanker/scrape_toi: This repository contains python code which can be used to scrape TOI(Times of India), a leading news paper in India. It picks an article from archive randomly and scrapes it. Later, the scraped data is checked for bad words and are cleaned using Catvar.

Scrape Data For Speech

This repo consists of the scripts that can be used to scrape data from TOI(Time of India) which can be used for training speech.

1. collect_news_data_toi.py

This is the main python process which uses Goose and BeautifulSoup to scrape the content and parse the divs of the given url page.

You can find more information about Goose here ( https://github.com/grangier/python-goose) and BeautifulSoup here( https://www.crummy.com/software/BeautifulSoup/bs4/doc/ )

2.deamon_collect_urls.sh

This is a shell script to check if above process is running, if not start the process. Using this shell script, we can run the above python process and make sure python processes are running always for collecting large amount of text data from newspapers.

3.log_collecturls_toi.txt

This is a text file for reference. In the above python script, random dates (an archive page like http://timesofindia.indiatimes.com/2011/11/28/archivelist/year-2011,month-11,starttime-40875.cms ) are picked to collect individual urls. This txt serves would help in recognizing the random data that was picked and from which the text from which url was extracted.

4.TOI_CONTENT.txt

This txt file has the extracted content. This has been cleaned using sed and the delimiter (###########################################################) has been removed

5.TOI_URLS.txt

This text file consists of all the URLS that were used to extract information from.

6.catvar/

This folder contains the categorichal database for english words. For instance using CVsearch.pl kill, we can find all the words that are various word forms like killed, killer etc.

7. TOI_CONTENT_BADWORDS_REMOVED.txt

This file has the content with all the bad words removed.

8.remove_bad_words.py

python script to create the words from a list and then remove the sentences containing those words from the input passages.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
catvar		catvar
.gitattributes		.gitattributes
README.md		README.md
TOI_CONTENT.txt		TOI_CONTENT.txt
TOI_CONTENT_BADWORDS_REMOVED.txt		TOI_CONTENT_BADWORDS_REMOVED.txt
TOI_URLS.txt		TOI_URLS.txt
batch2.txt		batch2.txt
clean_toi.py		clean_toi.py
clean_toi_news.py		clean_toi_news.py
collect_news_data_toi.py		collect_news_data_toi.py
deamon_collect_urls.sh		deamon_collect_urls.sh
final		final
log_collecturls_toi.txt		log_collecturls_toi.txt
remove_bad_words.py		remove_bad_words.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrape Data For Speech

1. collect_news_data_toi.py

2.deamon_collect_urls.sh

3.log_collecturls_toi.txt

4.TOI_CONTENT.txt

5.TOI_URLS.txt

6.catvar/

7. TOI_CONTENT_BADWORDS_REMOVED.txt

8.remove_bad_words.py

About

Releases

Packages

Languages

manishanker/scrape_toi

Folders and files

Latest commit

History

Repository files navigation

Scrape Data For Speech

1. collect_news_data_toi.py

2.deamon_collect_urls.sh

3.log_collecturls_toi.txt

4.TOI_CONTENT.txt

5.TOI_URLS.txt

6.catvar/

7. TOI_CONTENT_BADWORDS_REMOVED.txt

8.remove_bad_words.py

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages