Skip to content
This repository has been archived by the owner on Sep 26, 2023. It is now read-only.

Commit

Permalink
Added integration tests for links
Browse files Browse the repository at this point in the history
Gander was collecting the links of the main content but
we were not covering it by tests.

Improved README.md

Also, removed folha test as it fails because of the
charset. I've tried all the charsets (including the one
specified in the html) but it keeps failing. As it does not
add many value I've decided to kill it.
  • Loading branch information
albertpastrana committed Aug 4, 2015
1 parent 55bf9c7 commit b360c03
Show file tree
Hide file tree
Showing 3 changed files with 70 additions and 21 deletions.
3 changes: 3 additions & 0 deletions NOTICE
@@ -1,2 +1,5 @@
This product includes software developed by Intent HQ
(http://www.intenthq.com/).

This product includes software developed by Gravity.com
(http://www.gravity.com/).
43 changes: 42 additions & 1 deletion README.md
@@ -1,8 +1,49 @@
#Gander [![Circle CI](https://circleci.com/gh/intenthq/gander.svg?style=svg)](https://circleci.com/gh/intenthq/gander) [![Coverage Status](https://coveralls.io/repos/intenthq/gander/badge.svg?branch=master&service=github)](https://coveralls.io/github/intenthq/gander?branch=master)

Gander is a scala library that extract content from webpages.
**Gander is a scala library that extracts metadata and content from web pages.**

It is based on [Goose](https://github.com/GravityLabs/goose) with the idea to:
- Simplify its codebase by removing some of its functionality (like crawling, there are plenty of project that do it well)
- Keep it alive (goose has been inactive for several years now)
- Make its codebase more functional and take advantage of some of newer scala features

## What data does it extract?

Gander will try to extract three different kinds of data from a web page:
- Metadata: (title, meta description, meta keywords, language, canonical link, open graph data,
publish date)
- Main text for the page
- Links present in the main text of the page

## Using Gander

### Adding the dependency

The artefact is published in maven central. If you are using sbt you just need to add
the following line (remember to replace 1.0 with the latest version):
```
"com.intenthq" % "gander" % "1.0"
```
### In your code

Gander provides a single object and a single method to access its functionality
and it's pretty straight forward and intuitive to use.

This three lines of code, for example, would download the url specified (using
Guava) and extract the page information from the raw html:
```scala
val url = "http://engineering.intenthq.com/2015/03/what-is-good-code-a-scientific-definition/"
val rawHTML = Resources.toString(new URL(url), charset)
println(Gander.extract(rawHTML))

```

You can find more examples in our tests.

## Collaborate & Philosophy
Keep it simple and make 1 thing
Remove the code that was doing other stuff (downloading)
Removed images for simplicity, we may want to add it in the future.
The interface is so simple that can be easily used from Java as well.

Please, feel free to fork the repo and raise a PR.
45 changes: 25 additions & 20 deletions src/it/scala/com/intenthq/gander/GanderIT.scala
Expand Up @@ -17,14 +17,15 @@ class GanderIT extends Specification {
}

def check(pageInfo: PageInfo, title: String, metaDescription: String, metaKeywords: String,
lang: Option[String], date: Option[String], content: String, url: String) = {
lang: Option[String], date: Option[String], content: String, url: String, links: Seq[Link]) = {
pageInfo.title must_== title
pageInfo.metaDescription must_== metaDescription
pageInfo.metaKeywords must_== metaKeywords
pageInfo.lang must_== lang
pageInfo.publishDate must_== date.map(DateTime.parse(_).toDate)
pageInfo.cleanedText.get must startWith(content)
pageInfo.canonicalLink.map( _ must_== url).getOrElse(1 must_== 1)
pageInfo.links must containAllOf(links)
}

"intenthq" >> {
Expand All @@ -36,7 +37,9 @@ class GanderIT extends Specification {
metaDescription = "How would you define good code? This article gives a pseudo-scientific answer to that question after asking a sample of 65 developers that same question.",
metaKeywords = "",
lang = Some("en-GB"),
date = Some("2015-03-01"))
date = Some("2015-03-01"),
links = List(Link("Uncle Bob", "http://en.wikipedia.org/wiki/Robert_Cecil_Martin"),
Link("DRY", "http://en.wikipedia.org/wiki/Don%27t_repeat_yourself")))
}

"bbc" >> {
Expand All @@ -48,7 +51,10 @@ class GanderIT extends Specification {
metaDescription = "Disneyland Paris is facing a pricing probe following accusations that UK and German customers are being frozen out of promotions available in other European member states.",
metaKeywords = "",
lang = Some("en"),
date = None)
date = None,
links = List(Link("Financial Times said", "http://www.ft.com/cms/s/0/27e42c8e-351d-11e5-b05b-b01debd57852.html#axzz3hDFfsPCX"),
Link("said in a report", "http://www.ft.com/cms/s/0/27e42c8e-351d-11e5-b05b-b01debd57852.html#axzz3hDFfsPCX")))

}

"businessinsider" >> {
Expand All @@ -60,7 +66,8 @@ class GanderIT extends Specification {
metaDescription = "Here it is.",
metaKeywords = "",
lang = Some("en"),
date = None)
date = None,
links = List(Link("announcement", "http://www.businessinsider.com/federal-reserve-announcement-fomc-operation-twist-2011-9")))
}

"elpais" >> {
Expand All @@ -72,7 +79,13 @@ class GanderIT extends Specification {
metaDescription = "La Alianza se ha reunido este martes con carácter de urgencia a pedición de Ankara para tratar el avance del Estado Islámico",
metaKeywords = "otan, apoyar, cautela, ofensiva, turca, turco, yihadismo, alianza, haber, reunir, martes, urgencia, pedición, ankara, secretario, general, jens stoltenberg, resaltar, unidad, aliado",
lang = Some("es"),
date = Some("2015-07-29"))
date = Some("2015-07-29"),
links = List(Link("en su ofensiva contra el Estado Islámico", "http://internacional.elpais.com/internacional/2015/07/24/actualidad/1437717227_199769.html"),
Link("Jens Stoltenberg.", "http://elpais.com/tag/jens_stoltenberg/a/"),
Link("que este martes hizo estallar un tramo de un gasoducto procedente de Irán", "http://internacional.elpais.com/internacional/2015/07/28/actualidad/1438079899_805996.html"),
Link("onflicto entre Ankara y los simpatizantes del PKK", "http://internacional.elpais.com/internacional/2015/07/27/actualidad/1437986632_361510.html"),
Link("crear una zona libre de combatientes del EI", "http://internacional.elpais.com/internacional/2015/07/27/actualidad/1438026945_461718.html"),
Link("Ahmet Davutoglu", "http://elpais.com/tag/ahmet_davutoglu/a/")))
}

"corriere" >> {
Expand All @@ -84,7 +97,9 @@ class GanderIT extends Specification {
metaDescription = "Non si propone lo scioglimento ma si lascia aperta la possibilità di una «diversa valutazione»",
metaKeywords = "Ignazio Marino, Angelino Alfano",
lang = Some("it"),
date = None)
date = None,
links = List(Link("giunta guidata da Ignazio Marino", "http://roma.corriere.it/notizie/politica/15_luglio_28/giunta-marino-senatore-no-tav-esposito-assessore-trasporti-d0e76efa-34fe-11e5-984f-1e10ffe171ae.shtml")))

}

"lemonde" >> {
Expand All @@ -100,18 +115,6 @@ class GanderIT extends Specification {
pending
}

"folha" >> {
val url = "http://www1.folha.uol.com.br/esporte/2012/04/1070420-leao-critica-regulamento-do-paulista-e-poe-culpa-na-tv.shtml"
check(extract(url, Charsets.ISO_8859_1),
url = "http://www1.folha.uol.com.br/esporte/1070420-leao-critica-regulamento-do-paulista-e-poe-culpa-na-tv.shtml",
content = "Após retomar a liderança do Campeonato Paulista, com a vitória do São Paulo de virada por 4 a 2 sobre o Ituano",
title = "Leão critica regulamento do Paulista e põe culpa na TV",
metaDescription = "Após retomar a liderança do Campeonato Paulista, com a vitória do São Paulo de virada por 4 a 2 sobre o Ituano, o técnico Emerson Leão voltou a criticar a fórmula de disputa da competição e a FPF (Federação Paulista de Futebol), apontado a culpa para a emissora de televisão dona dos direitos de transmissão.",
metaKeywords = "São Paulo, Emerson Leão, Campeonato Paulista, FPF,, jornalismo, informação, economia, política, fotografia, imagem, noticiário, cultura, tecnologia, esporte, Brasil, internacional, geral, polícia, manchetes, loteria, loterias, resultados, opinião, análise, cobertura",
lang = None,
date = None)
}

"lancenet" >> {
val url = "http://www.lancenet.com.br/sao-paulo/Leao-Arena-Barueri-casa-Tricolor_0_675532605.html"
check(extract(url),
Expand All @@ -121,7 +124,8 @@ class GanderIT extends Specification {
metaDescription = "No próximo sábado, o São Paulo jogará, como mandante, na Arena Barueri diante do Mogi Mirim. Isso porque no estádio do Morumbi haverá, nesta ...",
metaKeywords = "Leao,Arena,Barueri,casa,Tricolor",
lang = Some("pt"),
date = Some("2012-04-03T18:30:00Z"))
date = Some("2012-04-03T18:30:00Z"),
links = List())
}

"globoesporte" >> {
Expand All @@ -133,7 +137,8 @@ class GanderIT extends Specification {
metaDescription = "Emerson Leão cobra liderança ao São Paulo (Foto: Mário Ângelo / Ag. Estado) Emerson Leão não foi ao campo na manhã desta terça-feira no centro de treinamento do São Paulo. Bem humorado e com roupa casual, preferiu acompanhar de longe ...",
metaKeywords = "notícias, notícia, são paulo",
lang = None,
date = Some("2012-04-01"))
date = Some("2012-04-01"),
links = List())
}

"opengraph" >> {
Expand Down

0 comments on commit b360c03

Please sign in to comment.