
# Txt Werk - Neofonie Text Analysis Tool with RSS documents


Txt Werk ist the Text Analysis Tool from Neofonie GmbH. It allows you to annotate your text in german and english language with information about recognized named entites.

For more information, please visit the webpage http://www.txtwerk.de/ of Txt Werk.


This Notebook demonstrates some simple usages of the Txt Werk API.

No special prerequisites are necessary.


NOTE: It is expected that a file txt_werk_apikey.py exists in the directory of the notebook with a valid Txt Werk Api-Key!



## Using TxtWerkClient for requests to the Txt Werk service.



#### Txt Werk API call using the TxtWerkClient.


In [1]:
from txtwerk_client import TxtWerkClient
from IPython.core.display import display, HTML

text = "Angela Merkel wurde am 17. Juli 1954 in Hamburg als Angela Dorothea Kasner geboren."

txt_werk_client = TxtWerkClient()
txt_werk_response = txt_werk_client.check_text(text)

print("\nEntities from Txt Werk:\n\n" + str(txt_werk_client.format_entities(txt_werk_response['entities']))+ "\n")
annotatedText = txt_werk_client.check_text_html_annotated(text)


display(HTML(annotatedText))



Entities from Txt Werk:

[ PERSON, "Angela Merkel", "Angela Merkel", https://www.wikidata.org/wiki/Q567, [0,13], 47.60983657836914 ]
[ CONCEPT, "17. Juli", "17. Juli", https://www.wikidata.org/wiki/Q2729, [23,31], 39.16166687011719 ]
[ PLACE, "Hamburg", "Hamburg", https://www.wikidata.org/wiki/Q1055, [40,47], 39.6832389831543 ]
[ PERSON, "None", "Angela Dorothea Kasner", None, [52,74], 75.0 ]




# Package RSS


Simple Python script for loading RSS feeds used for generation of data from german newspapers.


## Prerequisites

Download Anaconda for Python3

- Anaconda3-4.1.1-Linux-x86_64.sh  Anaconda3-4.1.1-Linux-x86.sh

Copy files to a destination diretory of your choice.

Install some packages for python3

- pip install feedparser
- pip install boilerpipe3



In [2]:
import json
from rss import RSS


#### Establishing news data from RSS feeds.

* Class RSS fetches for an rss source the current feeds and stores the html files and the boilerpipe exracted content in files.

* The RSS Loader will create a temporary directory in the temp directory of the system and save all files into this directory.

* For every call to the update function the loader gets the actual rss feed and checks if there are new documents to fetch. If this is the case, the loader fetches the new pages and stores them in the directory.

* The loader also extracts the plain text of the html files using the boilerpipe3 classes and stores the extracts in the temporary directory.

* Calls to the function getExtracts() gives all or just a number of occurrences from the extracted contents for analysis with Txt Werk.



#### Creating the RSS feeds to be used in subsequent calls.

We create some sources to have diffenernt data to work on.

* presseportal - the dpa frontend.
* spiegelTop - the spiegel top news.
* spiegelEil - the spiegel breaking news.
* spiegel - all spiegel news.


In [3]:
print("###################################        Presseportal         ###########################################")
presseportal = RSS("presseportal", "http://www.presseportal.de/rss/presseportal.rss2")
presseportal.update()

print("###################################    Spiegel Top-Meldungen    ###########################################")
spiegelTop = RSS("spiegelTop", "http://www.spiegel.de/schlagzeilen/tops/index.rss")
spiegelTop.update()

print("####################################    Spiegel Eilmeldungen    ###########################################")
spiegelEil = RSS("spiegelEil", "http://www.spiegel.de/schlagzeilen/eilmeldungen/index.rss")
spiegelEil.update()

print("####################################    Alle Spiegel-Meldungen   ##########################################")
spiegel = RSS("spiegel", "http://www.spiegel.de/schlagzeilen/index.rss")
spiegel.update()


###################################        Presseportal         ###########################################
Fetching http://www.presseportal.de/rss/presseportal.rss2

New Post: Studie: Pidbull wirkt effizient gegen potenzialinduzierte Degradation (PID)
	http://www.presseportal.de/pm/120958/3480108

Skipping Post: SWR Fernsehen Programmhinweise und -änderungen von Freitag, 11.11.16 (Woche 45) bis Montag, 19.12.16 (Woche 51)
	http://www.presseportal.de/pm/7169/3480084

Skipping Post: Neue Action-Serie mit Martial-Arts-Elementen bei RTL II: "Into The Badlands" (FOTO)
	http://www.presseportal.de/pm/6605/3480083

Skipping Post: Puma Energy übernimmt BP-Terminal in Nordirland
	http://www.presseportal.de/pm/116390/3480068

Skipping Post: SKODA wächst im Oktober um 10,6 Prozent (FOTO)
	http://www.presseportal.de/pm/28249/3480062

Skipping Post: Neues Medikament bei fortgeschrittenem, metastasiertem Brustkrebs /
Pfizer erhält EU-Zulassung für Brustkrebsmedikament Ibrance®
	http://www.presseport


#### Requesting Txt Werk with the extracted contents of the RSS feeds.



###### Analyzing Presseportal news.


In [4]:
## Load extracts from temporary file system.
extracts = presseportal.getExtracts(10)

## Iterate over the news articles, generating annotations for the text.
if extracts is not None:
    for extract in extracts:
        print("---------------------------        Annotated Text        ----------------------------------------")
        txt_werk_response = txt_werk_client.check_text(extract)
        
        print("\nResponse from Txt Werk:\n\n" + json.dumps(txt_werk_response, indent=4) + "\n")

        

Anzahl an Extrakten: 24
---------------------------        Annotated Text        ----------------------------------------

Response from Txt Werk:

{
    "text": "Puma Energy \u00fcbernimmt BP-Terminal in Nordirland\n10.11.2016 \u2013 13:16\nSingapur (ots/PRNewswire) - Das Unternehmen verf\u00fcgt nun \u00fcber das 100. Brennstoffterminal in seinem globalen Speichernetzwerk\nPuma Energy, das global integrierte Energieunternehmen im Midstream- und Downstream-Sektor, gab heute die Vertragsunterzeichnung f\u00fcr den Kauf des BP Massenspeicher-Terminals in Belfast, Nordirland, bekannt. Mit dem neuen Terminal verf\u00fcgt Puma Energy nun \u00fcber 100 Massenspeicher-Terminals und somit \u00fcber ein Gesamtspeichervolumen von 7,9 Millionen m3. Dieser Kauf ist der n\u00e4chste nach der \u00dcbernahme des 1,4 Millionen m3 fassenden Milford Heaven Terminals im Jahr 2015. Das Wachstum von Puma Energy auf dem europ\u00e4ischen Markt sowie die Versorgung mit qualitativ hochwertigem Brennstoff in 


###### Analyzing and annotating Presseportal news.


In [5]:
## Load extracts from temporary file system.
extracts = presseportal.getExtracts(10)

## Iterate over the news articles, generating annotations for the text.
if extracts is not None:
    for extract in extracts:
        print("---------------------------        Annotated Text        ----------------------------------------")
        annotated_text = txt_werk_client.check_text_html_annotated(extract)
        
        display(HTML(annotated_text))
        

Anzahl an Extrakten: 24
---------------------------        Annotated Text        ----------------------------------------


---------------------------        Annotated Text        ----------------------------------------


---------------------------        Annotated Text        ----------------------------------------


---------------------------        Annotated Text        ----------------------------------------


---------------------------        Annotated Text        ----------------------------------------


---------------------------        Annotated Text        ----------------------------------------


---------------------------        Annotated Text        ----------------------------------------


---------------------------        Annotated Text        ----------------------------------------


---------------------------        Annotated Text        ----------------------------------------


---------------------------        Annotated Text        ----------------------------------------



###### Analyzing 10 Spiegel news .


In [6]:
## Load 10 extracts from temporary file system.
extracts = spiegel.getExtracts(10)

## Iterate over the news articles, generating annotations for the text.
if extracts is not None:
    for extract in extracts:
        print("---------------------------        Annotated Text        ----------------------------------------")
        annotatedText = txt_werk_client.check_text_html_annotated(extract)
        display(HTML(annotatedText))


Anzahl an Extrakten: 34
---------------------------        Annotated Text        ----------------------------------------


---------------------------        Annotated Text        ----------------------------------------


---------------------------        Annotated Text        ----------------------------------------


---------------------------        Annotated Text        ----------------------------------------


---------------------------        Annotated Text        ----------------------------------------


---------------------------        Annotated Text        ----------------------------------------


---------------------------        Annotated Text        ----------------------------------------


---------------------------        Annotated Text        ----------------------------------------


---------------------------        Annotated Text        ----------------------------------------


---------------------------        Annotated Text        ----------------------------------------
