Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Web page content extractor
Java
branch: master
Failed to load latest commit information.
.settings fix up
lib updated to tika 0.9
meaningfulweb-app
meaningfulweb-core fixed groupId of nekohtml and htmlcleaner so there is no need to run …
meaningfulweb-opengraph bug fix: MWEB-15
meaningfulweb-parent fix pom to be able to release
.classpath
.gitignore Changed gitignore for nested directories.
.project initiali checkin
LICENSE added license and readme
README.md Updated build instruction.
pom.xml Added new committer

README.md

What is Meaningful Web?

We aim to extract structured information from a web resource:

url --> meaningfulweb engine --> structured information

Homepage:

http://www.meaningfulweb.org

Artifacts:

  1. meaningfulweb-opengraph.jar <- open graph parser
  2. meaningfulweb-core.jar <-- core engine
  3. meaningfulweb-app.war <-- web application

Build:

Build and release are managed via Maven: http://maven.apache.org/

  1. build all: under meaningfulweb, do: mvn clean install
  2. start webapp: under meaningfulweb-app/, do: mvn jetty:run

application should be running at: http://localhost:8080/

the rest service should be running at: http://localhost:8080/get-meaning?url=xxx

Example:

http://localhost:8080/get-meaning?url=http://www.google.com

Sample Code:

// extract the best image representing an url

String url = "http://www.google.com"

MetaContentExtractor extractor = new MetaContentExtractor();
MeaningfulWebObject obj = extractor.extractFromUrl(url);

String bestImageURL = obj.getImage();
String title = obj.getTitle();
String description = obj.getDescription();
String domain = obj.getDomain();

...

Bugs:

File bugs here

Something went wrong with that request. Please try again.