Skip to content

neolithian/banglanews

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

banglanews

This is a python package to get the textual content from online Bengali newspapers. The objective of this package is to get Bengali text mainly for the data scientists who need Bengali textual content for research purpose. This package was created for academic AND non-commercial use ONLY. The author of this package does not encourage OR suggest to use this application for anything other than academic research or experimental works.

How to install

The easiest way to install it from a Python 3 environment using the following command:

pip install banglanews

Alternatively, you can download the package banglanews-0.0.2.tar.gz and put it in a directory. From any Python 3 environment open a terminal, go to that directory where you put the package and install the package using the following command:

pip install banglanews-0.0.2.tar.gz

How to use it in the code

Right now the package only supports the leading Bengali newspaper 'Prothom Alo'. Include the module in your code like below:

from banglanews import prothomalo

Initializing the class

The package contains one single class named scraper. This is how you need to initialize it:

objScraper = prothomalo.scraper('2021-12-01','2021-12-05','D:\\Content')

1st argument: start_date = The date you want the scraper to start with.
2nd argument: end_date = the end date of getting the content.
3rd argument: output_dir = The file system location where you want to dump the content.

Please note, all three arguments are mandatory. Also note, the dates must be in the YYYY-MM-DD format.

Important methods:

PrintContents : Prints individual articles in text files in the file system
PrintURLs : Prints the Headlines and URLs in a single pipe delimitted csv file in the file system
PrintComments : Prints the Headlines, URLS and Comments in a single pipe delimitted csv file in the file system

A sample call

1st argument (optional): search_text = The text that you want to be searched. If you do not pass any value, all the URLs in between start_date and end_date will be searched.

objScraper.PrintURLs('করোনা')

The above call will create a .csv file in the output_dir location provided in the class initialization.

Version information

Latest version: 0.0.2
Previous versions:

About

A web content scraper for Bengali newspapers, mainly created for the researchers who need Bangla textual dataset.

Resources

License

Stars

Watchers

Forks

Packages

No packages published