## Test summary

Overall: Less function than `newspaper`. Only focus on one article

Advantages:

* Can get description and keywords from `meta` tags. 

Disadvantages:

* `publish_date` all failed in this initial test.

Other notes:

* The `cleaned_text` interface removes some content in a seemingly blackbox fashion.

## Single article, official example

In [1]:
from goose3 import Goose
url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'
g = Goose()
article = g.extract(url=url)

In [2]:
article.title

'Occupy London loses eviction fight'

In [3]:
article.meta_description

"Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal."

In [4]:
article.cleaned_text[:150]

"Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid evictio"

In [7]:
article.top_image

## Single article Chinese

It does not matter whether to include `target_language` or not. This library can only extract `title` and `meta_description`

In [34]:
url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
g = Goose({'use_meta_language': False, 'target_language':'zh-CN'})
article = g.extract(url=url)

In [35]:
article.title

'港特首梁振英就住宅违建事件道歉 - BBC中文网 - 两岸三地'

In [36]:
article.meta_description

'香港行政长官梁振英在各方压力下到立法会接受质询，就其大宅的违章建筑问题道歉。'

In [37]:
article.cleaned_text

''

In [38]:
article.authors

[]

In [39]:
article.top_image

In [40]:
article.publish_date

## Single article Chinese, official example

In [42]:
from goose3.text import StopWordsChinese
url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
g = Goose({'stopwords_class': StopWordsChinese})
article = g.extract(url=url)

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/94/lf84ld9d4td2_td6nrr955nw0000gn/T/jieba.cache
Loading model cost 0.724 seconds.
Prefix dict has been built succesfully.


In [48]:
article.cleaned_text[:150]

'香港行政长官梁振英在各方压力下就其大宅的违章建筑（僭建）问题到立法会接受质询，并向香港民众道歉。\n\n梁振英在星期二（12月10日）的答问大会开始之际在其演说中道歉，但强调他在违章建筑问题上没有隐瞒的意图和动机。\n\n一些亲北京阵营议员欢迎梁振英道歉，且认为应能获得香港民众接受，但这些议员也质问梁振英有'

In [44]:
article.title

'港特首梁振英就住宅违建事件道歉 - BBC中文网 - 两岸三地'

`newspaper` can parse the `publish_date` here.

In [45]:
article.publish_date

In [46]:
article.authors

[]

## Single article Chinese

In [49]:
from goose3.text import StopWordsChinese
url = 'http://blog.sina.com.cn/s/blog_aeb058020102xuwb.html'
g = Goose({'stopwords_class': StopWordsChinese})
article = g.extract(url=url)

In [50]:
article.title

'震撼！99张老照片告诉你什么是郑州！_老照片图库_新浪博客'

This `cleaned_text` removes some content from original text. Looks clean, but one needs to worry about the recall -- whether it removes critical content or not. The algorithm is unclear.

In [51]:
article.cleaned_text

'但是却永远的停留在了历史的长河里，\n\n1923年的郑州火车站。（京汉站）这是留存最早的照片。隐约可见的天桥，连接着陇海车站\n\n1954年的郑州火车站，高大威严，广场开阔，已显出不凡的气势。当时为全国十大火车站之一\n\n1959年的郑州火车站。可见当时火车站广场的零乱、破旧。火车站对面是公共汽车站，中间的一排小屋是公共汽车调度室。图中近景处看似民房，实则餐馆、百货店\n\n20年代郑州站的月台上，旅客们在上郑州---汉口的火车\n\n1991年从南面远看郑州火车站，右下角是现在的银基\n\n它同时也使“二七塔”无可厚非地成为郑州市的标志性建筑。\n\n二七纪念塔的前身是一座木塔，高15米，是郑州市物资交流骡马大会的会标\n\n（在七十年代初被拆除了，原地点竖起了现在的二七纪念塔）\n\n70年代的二七广场，灯塔左边的语录牌处是进纪念塔的入口，灯塔右边的语录牌处是出口还有售票处，进口处向左走几步就是老“合记烩面馆”隔壁就是“二七公安分局”\n\n1973年在“二七广场”举行了一次欢送知识青年上山下乡的欢送大会\n\n二七塔下的大松柏是从远处移栽过来的\n\n郑州的巨大变化，不知不觉都在发生着，\n\n才蓦然发现，她早已不是原来的郑州了。\n\n老百货大楼，小时候去的最多的地方\n\n河南人民剧院永远从我们的视线里消失了\n\n郑州，这座古老而悠久的城市，\n\n她早已完成了新的蜕变，\n\n养育了一代又一代的郑州人，\n\n让我们一起为她而鼓掌，为她而期待。'

In [52]:
article.publish_date

## Single article Chinese

In [56]:
from goose3.text import StopWordsChinese
url = 'http://health.people.com.cn/n1/2018/0620/c14739-30068605.html'
g = Goose({'stopwords_class': StopWordsChinese, 'use_meta_language': False, 'target_language':'zh-CN'})
article = g.extract(url=url)

In [57]:
article.title

'6月应当心手足口病高发 家长及幼师要做好防范--人民健康网--人民网'

In [58]:
article.publish_date

In [59]:
article.meta_description

'“全省监测数据表明，近期湖南手足口病发病人数猛增。”昨日，湖南省疾控中心传染病防治科副科长邓志红主任医师介绍，6月份是手足口病高发季，家长及幼师应帮助孩子有效预防手足口病。\u3000\u3000邓志红表示，手足口病'

In [60]:
article.meta_keywords

'手足口病 家长 柯萨奇病毒 患儿 幼师 复课 发热 学龄前儿童 疑似病例 确诊'

In [62]:
article.authors

[]

In [63]:
article.meta_lang