Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
ElasticSearch Rss River

This branch is 44 commits behind dadoonet:master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
src
.gitignore
README.md
pom.xml

README.md

RSS River for Elasticsearch

Welcome to the RSS River Plugin for Elasticsearch

Versions

RSS River Plugin ElasticSearch
master (0.2.0) 0.90
0.1.0 0.90
0.0.6 0.19
0.0.5 0.18
0.0.4 0.18
0.0.3 0.18
0.0.2 0.17

Build Status

Thanks to cloudbees for the build status : build status

Getting Started

Installation

Just type :

$ bin/plugin -install fr.pilato.elasticsearch.river/rssriver/0.1.0

This will do the job...

-> Installing fr.pilato.elasticsearch.river/rssriver/0.1.0...
Trying http://download.elasticsearch.org/fr.pilato.elasticsearch.river/rssriver/rssriver-0.1.0.zip...
Trying http://search.maven.org/remotecontent?filepath=fr/pilato/elasticsearch/rssriver/rssriver/0.1.0/fsriver-0.1.0.zip...
Trying https://oss.sonatype.org/service/local/repositories/releases/content/fr/pilato/elasticsearch/river/rssriver/0.1.0/rssriver-0.1.0.zip...
Downloading ......DONE
Installed rssriver

Creating a RSS river

We create first an index to store all the feed documents :

$ curl -XPUT 'localhost:9200/lemonde/' -d '{}'

We create the river with the following properties :

$ curl -XPUT 'localhost:9200/_river/lemonde/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
        "name": "lemonde",
        "url": "http://www.lemonde.fr/rss/une.xml"
        }
    ]
  }
}'

This RSS feed follows RSS 2.0 specifications and provide a ttl entry. The update rate will be auto-adjusted following this value.

If you want to set your own refresh rate (if not provided) and force it (even if it's provided), use update_rate and ignore_ttl options:

We create the river with the following properties :

$ curl -XPUT 'localhost:9200/_river/lemonde/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
        "name": "lemonde",
        "url": "http://www.lemonde.fr/rss/une.xml",
        "update_rate": 900000,
        "ignore_ttl": true
        }
    ]
  }
}'

If you need to get multiple feeds, you can add them :

Feed1

Feed2

$ curl -XPUT 'localhost:9200/actus/' -d '{}'

$ curl -XPUT 'localhost:9200/_river/actus/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
            "name": "lemonde",
            "url": "http://www.lemonde.fr/rss/une.xml",
            "update_rate": 900000
        }, {
            "name": "lefigaro",
            "url": "http://rss.lefigaro.fr/lefigaro/laune",
            "update_rate": 1800000,
            "ignore_ttl": true
        }
    ]
  }
}'

Working with mappings

When you create your index, you can specify the mapping you want to use as follow :

$ curl -XPUT 'http://localhost:9200/lefigaro/' -d '{}'

$ curl -XPUT 'http://localhost:9200/lefigaro/page/_mapping' -d '{
    "page" : {
        "properties" : {
            "feedname" : {"type" : "string"},
            "title" : {"type" : "string", "analyzer" : "french"},
            "description" : {"type" : "string", "analyzer" : "french"},
            "author" : {"type" : "string"},
            "link" : {"type" : "string"}
        }
    }
}'

Then, your feed will use it when you create the river :

$ curl -XPUT 'localhost:9200/_river/lefigaro/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
            "url": "http://rss.lefigaro.fr/lefigaro/laune"
        }
    ]
  }
}'

Behind the scene

RSS river downloads RSS feed every update_rate milliseconds and check if there is new messages.

At first, RSS river look at the <channel> tag. It reads the optional <pubDate> tag and store it in Elastic Search to compare it on next launch.

Then, for each <item> tag, RSS river creates a new document within page type with the following properties :

XML Tag ES Mapping
<title> title
<description> description
<author> author
<link> link
<geo:lat> <geo:long> location

ID is generated from description using the UUID generator. So, each message is indexed only once.

Read RSS 2.0 Specification for more details about RSS channels.

To Do List

Many many things to do :

  • As <pubDate> tag is optional, we have to check if RSS River is working in that case and parse each feed message
  • Support more RSS <channel> sub-elements, such as <category>, <skipDays>, <skipHours>
  • Support more RSS <item> sub-elements, such as <category>, <enclosure>, <pubDate>
  • Support for multi-channel (one per language for instance)
  • Use <guid> as the text to encode to generate ID
Something went wrong with that request. Please try again.