Skip to content
forked from jplusplus/janus

A basic tool to retrieve the documents metadata from a domain name

Notifications You must be signed in to change notification settings

pombredanne/janus

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Janus

Extract metadata from PDFs, fast

Janus is a simple tool to extract all meta data from all PDF files on a single domain. Type in a domain name, for instance "gov.uk", and get a list of all PDFs with their metadata (e.g. Author, creation and modification date). Metadata analysis is a great source of information for investigative journalists. In the future, Janus will include other data types and go further in the analysis, clustering metadata together (like individuals who appear in the metadata). It was developed by Journalism++' Pierre Bellon and Leo Wallentin, who was an embedded news nerd there in June, 2013.

How to install it

  • be sure to have nodeJS installed on your computer

  • get the sources git clone https://github.com/jplusplus/documents-from-domains.git

  • install the dependencies

      cd janus
      npm install
    
  • copy the configuration file template

    cp config.template.json config.json 
    
  • then enter your bing account key

Launch the application

You can simply launch it by executing coffee app.coffe but I recommend you to use nodemon:

npm install -g nodemon
nodemon app.coffee

Troubleshooting

  • I get an error when I run npm install | You may have an older version of node, please make sure to have node >= 9.4.1 installed on your system

TODO

  • handle images search
  • handle doc & docx search

About

A basic tool to retrieve the documents metadata from a domain name

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published