This is a python package to get the textual content from online Bengali newspapers. The objective of this package is to get Bengali text mainly for the data scientists who need Bengali textual content for research purpose. This package was created for academic AND non-commercial use ONLY. The author of this package does not encourage OR suggest to use this application for anything other than academic research or experimental works.
The easiest way to install it from a Python 3 environment using the following command:
pip install banglanews
Alternatively, you can download the package banglanews-0.0.2.tar.gz and put it in a directory. From any Python 3 environment open a terminal, go to that directory where you put the package and install the package using the following command:
pip install banglanews-0.0.2.tar.gz
Right now the package only supports the leading Bengali newspaper 'Prothom Alo'. Include the module in your code like below:
from banglanews import prothomalo
The package contains one single class named scraper
. This is how you need to initialize it:
objScraper = prothomalo.scraper('2021-12-01','2021-12-05','D:\\Content')
1st argument: start_date
= The date you want the scraper to start with.
2nd argument: end_date
= the end date of getting the content.
3rd argument: output_dir
= The file system location where you want to dump the content.
Please note, all three arguments are mandatory. Also note, the dates must be in the YYYY-MM-DD format.
PrintContents
: Prints individual articles in text files in the file system
PrintURLs
: Prints the Headlines and URLs in a single pipe delimitted csv file in the file system
PrintComments
: Prints the Headlines, URLS and Comments in a single pipe delimitted csv file in the file system
1st argument (optional): search_text
= The text that you want to be searched. If you do not pass any value, all the URLs in between start_date
and end_date
will be searched.
objScraper.PrintURLs('করোনা')
The above call will create a .csv file in the output_dir
location provided in the class initialization.
Latest version: 0.0.2
Previous versions: