# News-Stream Example Queries


Solr queries can be made with the Solr search page under 

http://hdp-node06.neofonie.de:8983/solr/#/hackathon_shard3_replica2/query .

There exists a Banana dashboard with plenty of prepared graphics and loaded data from the News-Stream system:

https://nstr.neofonie.de/dev/#/dashboard/solr/Hackathon .


In this Notebook we will show some example queries, to give an idea and easy access to all the data in the News-Stream project.


First we import some stuff we will need from python.


In [None]:
from itertools import chain
import urllib



## Querying Data from News-Stream



Please fill in the user id and the password for retrieving data from the News-Stream system.

First of all some helper functions to make the requested prameters in the rest of the notebook more readable.


In [None]:
## Load credentials
try :
    from credentials import dpa as auth
except ImportError :
    raise RuntimeError("Credentials must be supplied as dict in credentials.py. See example_credentials.py or use this as a template: dpa=dict(login='user',password='secret')")

select = "https://"+auth['login']+":"+auth['password']+"@nstr.neofonie.de/solr-dev/hackathon/select?"
print("\nUsing as base url for News-Stream:" + select + "\n")

############################################################

default_params = { 'rows': '3', 'wt': 'json', 'indent': 'on'}

def enc_query(params):
    q = ''
    for k,v in params.items():
        q += str(k) + "=" + urllib.parse.quote_plus(str(v)) + "&"
    for def_k,def_v in default_params.items():
       if def_k not in params:
        q += str(def_k) + "=" + urllib.parse.quote_plus(str(def_v)) + "&"
    return q

def exec_query(query):
    encoded = enc_query(query)
    print(select + encoded)
    !curl -k "{select + encoded}"



There exists a Banana dashboard with plenty of prepared graphics and loaded data from the News-Stream system:


In [None]:
print('\nhttps://'+auth['login']+':'+auth['password']+'@nstr.neofonie.de/dev/#/dashboard/solr/Hackathon\n')



## Examples Fetching Data with Search Words



All queries are accessible from the commandline via curl. 

All available fields are documented in the document in the githup repository: 

[EnglishHowTohackathon](https://github.com/dpa-newslab/tickertools2016/blob/master/neofonie/EnglischHowToHackathon.md)


#### Searchword: "Hillary Clinton" - All Data


In [None]:
exec_query({'q': 'Hillary Clinton'})



#### Searchword: "Hillary Clinton AND Donald Trump" - All Data


In [None]:
exec_query({'q': 'Hillary Clinton OR Donals Trump'})



#### Searchword: "Hillary Clinton" AND "Donald Trump" - Just title and text


In [None]:
exec_query(
        {
            'q': '"Hillary Clinton" AND "Donald Trump"', 
            'fl': 'title AND text',
        })



#### Searchword: "Hillary Clinton" AND "Donald Trump" -  Titles only for articles in english language.


In [None]:
exec_query(
        {
            'q': '"Hillary Clinton" AND "Donald Trump"',
            'fq': 'language: en AND sourceId:neofonie',
            'fl': 'title',
            'sort': 'publicationDate DESC',
            'rows': '10'
        })



#### Using Meta Information and some semantics of Solr search queries

In the next queries we are setting the number of results to zero, because we are just interested in the meta information

For each of the following three examples we find a different number of results depending on the semantic of the seach query.

* In the first example the query string is OR'ed and we get all results containing any occurrence of the query tokens.
* In the second example the semantics of the query is interpreted by Solr ("text:hillary +text:clinton +text:donald text:trump").
* In the third query we are searching for exact matches of "Hillary Clinton" AND "Donald Trump".

Most of the time you want the third query for results which match both politicians.


In [None]:
exec_query(
        {
            'q': 'Hillary Clinton Donald Trump', 
            'rows': '0'
        })


In [None]:
exec_query(
        {
            'q': 'Hillary Clinton AND Donald Trump', 
            'rows': '0'
        })


In [None]:
exec_query(
        {
            'q': '"Hillary Clinton" AND "Donald Trump"', 
            'rows': '0'
        })



#### Documents about "Washington" from Neofonie's news crawl not older than 24 hours


The following query returns results for all news articles containing the search term 'Washington'.

Results contain terms like 'Kamasi Washington', as 'Washington Redskins' etc.

In [None]:
exec_query(
        {
            'q': 'Washington', 
            'fq': '+sourceId:neofonie +publicationDateNOW/HOUR-24HOUR TO NOW/HOUR+1HOUR'
        })



Whereas the following search narrows the search down to all articles containing the entity with label 'Washington', which might match your initial intention of searching for the american capital in news.

Please see the next chapter for more examples using named entities.


In [None]:
exec_query(
        {
            'q': 'entityLabels: Washington', 
            'fq': '+sourceId:neofonie +publicationDateNOW/HOUR-24HOUR TO NOW/HOUR+1HOUR'
        })




#### Hourly Documents Count about "Hillary Clinton" from Neofonie's news crawl not older than 24 hours: 


In [None]:
exec_query(
        {
            'q': 'entityLabels: Hillary Clinton', 
            'fq': '+publicationDate:[NOW/HOUR-24HOUR TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'rows': '0',
            'facet': 'true',
            'facet.range': 'publicationDate',
            'facet.range.start': 'NOW/HOUR-24HOUR',
            'facet.range.end': 'NOW/HOUR+1HOUR',
            'facet.range.gap': '+1HOUR'
        })




## Examples fetching data based on named entities


#### Fetch Top 5 news with NER annotations for "Hillary Clinton" AND "Donald Trump"

In [None]:
exec_query(
        {
            'q': 'entityLabels: "Hillary Clinton" AND entityLabels: "Donald Trump"', 
            'fq': '+publicationDate:[NOW/HOUR-24HOUR TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'fl': 'neoUrl AND title AND entityLabels',
            'sort': 'publicationDate DESC',
            'rows': '5',
        })


#### Fetch TOP 5 news for "Volkswagen"

In [None]:
exec_query(
        {
            'q': 'entityLabels: Volkswagen', 
            'fl': 'title',
            'fq': '+publicationDate:[NOW/HOUR-24HOUR TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'sort': 'publicationDate DESC',
            'rows': '5',
        })


#### Fetch TOP 5 news for the last two hours with recognized Organisations

In [None]:
exec_query(
        {
            'q': 'entityTypes: ORGANISATION', 
            'fl': 'neoUrl title entityRfc4180',
            'fq': '+publicationDate:[NOW/HOUR-2HOUR TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'sort': 'publicationDate DESC',
            'rows': '5',
        })


#### Fetch TOP 5 news for which CRF recognized persons that are not already known as named entities.

In [None]:
exec_query(
        {
            'q': 'unknownTypes: PERSON', 
            'fl': 'neoUrl title entityRfc4180',
            'fq': '+publicationDate:[NOW/HOUR-2HOUR TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'sort': 'publicationDate DESC',
            'rows': '5',
        })




## Examples fetching data with facets



#### Number of documents from the different News-Stream sources


In [None]:
exec_query(
        {
            'q': '*', 
            'fq': '+publicationDate:[NOW/HOUR-30DAY TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'neoPublicationName',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
        })


#### Counts of news per hour containing the search term "Hillary Clinton" in the last 24 hours.

In [None]:

exec_query(
        {
            'q': 'Hillary Clinton', 
            'fq': '+publicationDate:[NOW/HOUR-24HOUR TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'fl': 'titles',
            'rows': '0',
            'facet': 'true',
            'facet.range': 'publicationDate',
            'facet.range.start': 'NOW/HOUR-24HOUR',
            'facet.range.end': 'NOW/HOUR+1HOUR',
            'facet.range.gap': '+1HOUR'
        })


#### Count news grouped by language for the search term "Hillary Clinton" OR "Donald Trump".

In [None]:
exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq':'publicationDate:[NOW/DAY-3DAY TO NOW/DAY+1DAY]',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'language',
            'facet.limit': '10',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'fcs'
        })


#### Counting all occurrences of named entities in news which contain NEs "Hillary Clinton" OR "Donald Trump"

In [None]:
exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq':'publicationDate:[NOW/DAY-3DAY TO NOW/DAY+1DAY]',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'knownSurfaceforms',
            'facet.limit': '10',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
         })


#### Counting all CRFs in news which contain NEs "Hillary Clinton" OR "Donald Trump"

In [None]:
exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq':'publicationDate:[NOW/DAY-3DAY TO NOW/DAY+1DAY]',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'unknownPersons',
            'facet.limit': '10',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
         })



## Examples for selecting dpa data


#### Loading dpa-News from News-Stream

In [None]:
exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq': 'sourceId:dpa',
         })


#### Loading dpa-News from News-Stream with dpa specific fields

In [None]:
exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq': 'sourceId:dpa',
            'fl': 'id dpaId publicationDate title mlRessort dpaIndustries',
            'sort': 'publicationDate DESC',
            'rows': '5'
        })


#### Aggregation of dpa news on category 'mlIndustries'

FIN -> Asset Management, Finanzdienstleister | AUT -> Automobil-/Zuliefererindustrie (Autos &amp; LKW, Ersatzteile, Reifen) | BAN -> Banken | CON -> Bau | PER -> Bekleidung, Kosmetik | MIN -> Bergbau, Rohstoffförderung (Kohle, Diamanten, Gold, Platin, Edelmetalle) | EQI -> Beteiligungsgesellschaften | EQN -> Börsennotierte Fonds (ETF, etc.) | CHM -> Chemie, Kunststoffe | CMP -> Computer, Hardware, Software, Halbleiter, Bauteile | ELU -> Elektrizitätsversorger | ELE -> Elektronik, Elektrik, Komponenten | AEG -> Erneuerbare Energien | HTH -> Gesundheitswesen, Medizintechnik, Krankenhausbedarf | BEV -> Getränke (Bier, Wein, Destillerien, Soft Drinks) | TRN -> Gütertransport, Logistik | HOU -> Haushaltswaren, Möbel, Eigenheime | PRO -> Immobilien | REF -> Lebensmittel- und Pharmahandel | ASS -> Lebensversicherer | ENG -> Maschinenbau, Starkstrom, Umwelttechnik | MET -> Metallverarbeitung- und förderung, NE-Metalle | INL -> Mischkonzerne, Verpackungsindustrie | FOO -> Nahrungsmittel (Hersteller, inkl. Agrarindustrie) | RET -> Non-Food-Einzelhandel, Endkunden-Dienstleister | PAP -> Papier, Zellulose, Holz | PHA -> Pharma, Biotechnologie | DEF -> Rüstungsindustrie, Flugzeughersteller | INS -> Sach- und Rückversicherungen | SOF -> Software, IT-Beratung, Internet, Portalbetreiber | TOB -> Tabakindustrie | TEL -> Telefongesellschaften (Festnetz) | MOB -> Telefongesellschaften (Mobilfunk) | LEI -> Tourismus, Fluggesellschaften, Bahn (Personenverkehr) | SVS -> Unternehmensdienstleister | CSM -> Verbrauchsgüter, Kosmetik, Seife, Handwerksbedarf, Möbel, Haushaltsgeräte, Unterhaltungselektronik | MED -> Verlage, Rundfunk, Info-Dienste, Zeitungen, Bücher, Werbung | UTI -> Versorger (Gas, Wasser etc.) | OIL -> Öl, Ölexploration, Gas | OES -> Öl-Anlagenbau, Pipelines |

In [None]:
exec_query(
        {
            'q': 'Siemens',
            'fq': 'sourceId:dpa',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'mlIndustries',
            'facet.limit': '10',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
         })


#### Aggregation of dpa news on category 'dpaRessort'

pl="politik", wi="wirtschaft", rs="redaktioneller service", vm="vermischtes", ku="kultur", sp="sport"

In [None]:
exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq': 'sourceId:dpa',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'dpaRessort',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
         })


#### Aggregation of dpa news on category 'dpaServices'



Kürzel für dpa Dienste:

* dpasrv:bdt -> Basisdienst
* afxsrv:ADE -> AFX Kompakt
* edi-bid -> Teil des Basisdienstes
* dpasrv:hfk -> Hörfunkdienst/ Kurznachrichtendienst und Teilmenge des Basisdienstes.
* wap- Präfix sind Varianten des jeweiligen Landesdienstes.

Bei den Landesdiensten gibt es folgende Zuordnungen:

* bwg: Baden-Württemberg
* brb: Berlin / Brandenburg
* rhs: Rheinland-Pfalz / Saarland
* bay: Bayern
* hsh: Hamburg / Schleswig-Holstein
* nwf: Nordrhein-Westfalen
* san: Sachsen
* aht: Sachsen-Anhalt
* hes: Hessen
* mbv: Mecklenburg-Vorpommern
* thg: Thüringen
* nsb: Niedersachsen / Bremen


In [None]:
exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq': 'sourceId:dpa',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'dpaServices',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
         })


#### Aggregation of dpa news on category 'dpaKeywords'

In [None]:
exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq': 'sourceId:dpa',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'dpaKeywords',
            'facet.limit': '10',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
         })
