# News-Stream Example Queries


Solr queries can be made with the Solr search page under 

http://hdp-node06.neofonie.de:8983/solr/#/hackathon_shard3_replica2/query .

There exists a Banana dashboard with plenty of prepared graphics and loaded data from the News-Stream system:

https://nstr.neofonie.de/dev/#/dashboard/solr/Hackathon .


In this Notebook we will show some example queries, to give an idea and easy access to all the data in the News-Stream project.


First we import some stuff we will need from python.




## Querying Data from News-Stream



Please fill in the user id and the password for retrieving data from the News-Stream system.

First of all some helper functions to make the requested prameters in the rest of the notebook more readable.


In [None]:
import json
from newsstream_client import NewsStreamClient

newsstreamClient = NewsStreamClient()

## Substitute function 
exec_query = newsstreamClient.exec_query

def dump(jsonData):
    print(json.dumps(jsonData, indent=4))



There exists a Banana dashboard with plenty of prepared graphics and loaded data from the News-Stream system:


In [None]:
print('\nhttps://'+newsstreamClient.auth['login']+':'+newsstreamClient.auth['password']+'@nstr.neofonie.de/dev/#/dashboard/solr/Hackathon\n')

#### Importing NVD3 for graphical output

* pip install python-nvd3



In [None]:
import datetime
import time
import random
from IPython import display as d
import nvd3
nvd3.ipynb.initialize_javascript(use_remote=True)
#help(nvd3.ipynb.initialize_javascript)




## Examples Fetching Data with Search Words



All queries are accessible from the commandline via curl. 

All available fields are documented in the document in the githup repository: 

[EnglishHowTohackathon](https://github.com/dpa-newslab/tickertools2016/blob/master/neofonie/EnglischHowToHackathon.md)


#### Searchword: "Hillary Clinton" - All Data


In [None]:
dump(exec_query({'q': 'Hillary Clinton'}))



#### Searchword: "Hillary Clinton AND Donald Trump" - All Data


In [None]:
dump(exec_query({'q': 'Hillary Clinton OR Donals Trump'}))



#### Searchword: "Hillary Clinton" AND "Donald Trump" - Just title and text


In [None]:
dump(exec_query(
        {
            'q': '"Hillary Clinton" AND "Donald Trump"', 
            'fl': 'title AND text',
        }))



#### Searchword: "Hillary Clinton" AND "Donald Trump" -  Titles only for articles in english language.


In [None]:
dump(exec_query(
        {
            'q': '"Hillary Clinton" AND "Donald Trump"',
            'fq': 'language: en AND sourceId:neofonie',
            'fl': 'title',
            'sort': 'publicationDate DESC',
            'rows': '10'
        }))



#### Using Meta Information and some semantics of Solr search queries

In the next queries we are setting the number of results to zero, because we are just interested in the meta information

For each of the following three examples we find a different number of results depending on the semantic of the seach query.

* In the first example the query string is OR'ed and we get all results containing any occurrence of the query tokens.
* In the second example the semantics of the query is interpreted by Solr ("text:hillary +text:clinton +text:donald text:trump").
* In the third query we are searching for exact matches of "Hillary Clinton" AND "Donald Trump".

Most of the time you want the third query for results which match both politicians.


In [None]:
dump(exec_query(
        {
            'q': 'Hillary Clinton Donald Trump', 
            'rows': '0'
        }))


In [None]:
dump(exec_query(
        {
            'q': 'Hillary Clinton AND Donald Trump', 
            'rows': '0'
        }))


In [None]:
dump(exec_query(
        {
            'q': '"Hillary Clinton" AND "Donald Trump"', 
            'rows': '0'
        }))



#### Documents about "Washington" from Neofonie's news crawl not older than 24 hours


The following query returns results for all news articles containing the search term 'Washington'.

Results contain terms like 'Kamasi Washington', as 'Washington Redskins' etc.

In [None]:
dump(exec_query(
        {
            'q': 'Washington', 
            'fq': '+sourceId:neofonie +publicationDateNOW/HOUR-24HOUR TO NOW/HOUR+1HOUR'
        }))



Whereas the following search narrows the search down to all articles containing the entity with label 'Washington', which might match your initial intention of searching for the american capital in news.

Please see the next chapter for more examples using named entities.


In [None]:
dump(exec_query(
        {
            'q': 'entityLabels: Washington', 
            'fq': '+sourceId:neofonie +publicationDateNOW/HOUR-24HOUR TO NOW/HOUR+1HOUR'
        }))




#### Hourly Documents Count about "Hillary Clinton" from Neofonie's news crawl not older than 24 hours: 


In [None]:
timeCounts = exec_query(
        {
            'q': 'entityLabels: Hillary Clinton', 
            'fq': '+publicationDate:[NOW/HOUR-24HOUR TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'rows': '0',
            'facet': 'true',
            'facet.range': 'publicationDate',
            'facet.range.start': 'NOW/HOUR-24HOUR',
            'facet.range.end': 'NOW/HOUR+1HOUR',
            'facet.range.gap': '+1HOUR'
        })

dump(timeCounts)



##### AreaChart with the hourly distribution for selected news.


In [None]:
from nvd3 import stackedAreaChart

timeCountList = timeCounts['facet_counts']['facet_ranges']['publicationDate']['counts']
#timeCountList = [int(x) for x in timeCountList[1::2]]
#dump("TimeCounts" + str(timeCountList))
#dump("Countings " + str(timeCountList[1::2]))

timeTuple = [datetime.datetime.strptime(str(d), "%Y-%m-%dT%H:%M:%SZ") for d in timeCountList[::2]]
timeTuple = [time.mktime(d.timetuple()) for d in timeTuple]
timeTuple = [int(t1) * 1000 for t1 in timeTuple]
#print("TimeTuple" + str(timeTuple))

name = "News for \"Hillary Clinton\" per hour for the last day"
hourlyDocsChart = nvd3.stackedAreaChart(name=name,height=450,width=500, use_interactive_guideline=False, x_is_date=True)
hourlyDocsChart.set_containerheader("\n\n<h3>" + name + "</h3>\n\n")
hourlyDocsChart.add_serie(name="Hourly documents", y=timeCountList[1::2], x=timeTuple)
hourlyDocsChart




## Examples fetching data based on named entities


#### Fetch Top 5 news with NER annotations for "Hillary Clinton" AND "Donald Trump"

In [None]:
dump(exec_query(
        {
            'q': 'entityLabels: "Hillary Clinton" AND entityLabels: "Donald Trump"', 
            'fq': '+publicationDate:[NOW/HOUR-24HOUR TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'fl': 'neoUrl AND title AND entityLabels',
            'sort': 'publicationDate DESC',
            'rows': '5',
        }))


#### Fetch TOP 5 news for "Volkswagen"

In [None]:
dump(exec_query(
        {
            'q': 'entityLabels: Volkswagen', 
            'fl': 'title',
            'fq': '+publicationDate:[NOW/HOUR-24HOUR TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'sort': 'publicationDate DESC',
            'rows': '5',
        }))


#### Fetch TOP 5 news for the last two hours with recognized Organisations

In [None]:
dump(exec_query(
        {
            'q': 'entityTypes: ORGANISATION', 
            'fl': 'neoUrl title entityRfc4180',
            'fq': '+publicationDate:[NOW/HOUR-2HOUR TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'sort': 'publicationDate DESC',
            'rows': '5',
        }))


#### Fetch TOP 5 news for which CRF recognized persons that are not already known as named entities.

In [None]:
dump(exec_query(
        {
            'q': 'unknownTypes: PERSON', 
            'fl': 'neoUrl title entityRfc4180',
            'fq': '+publicationDate:[NOW/HOUR-2HOUR TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'sort': 'publicationDate DESC',
            'rows': '5',
        }))




## Examples fetching data with facets



#### Number of documents from the different News-Stream sources


In [None]:
sourceDistributionCounts = exec_query(
        {
            'q': '*', 
            'fq': '+publicationDate:[NOW/HOUR-30DAY TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'neoPublicationName',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
        })
dump(sourceDistributionCounts)



##### Pie chart for the news distribution in News-Stream.


In [None]:
from nvd3 import discreteBarChart

sourceDistributionCountList = sourceDistributionCounts['facet_counts']['facet_fields']['neoPublicationName']
sourceDistributionCountList = sourceDistributionCountList[0:-2]
print("\n" + str(sourceDistributionCountList) + "\n")

name = 'News distribution in sources of News-Stream in the last 30 days'
sourceDistributionCountsChart = discreteBarChart(name=name, color_category='category20c', height=400, width=900)
sourceDistributionCountsChart.set_containerheader("\n\n<h3>" + name + "</h3>\n\n")
sourceDistributionCountsChart.add_serie(y=list(reversed(sourceDistributionCountList[1::2])), x=list(reversed(sourceDistributionCountList[::2])))
sourceDistributionCountsChart


#### Counts of news per hour containing the search term "Hillary Clinton" in the last 24 hours.

In [None]:
dump(exec_query(
        {
            'q': 'Hillary Clinton', 
            'fq': '+publicationDate:[NOW/HOUR-24HOUR TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'fl': 'titles',
            'rows': '0',
            'facet': 'true',
            'facet.range': 'publicationDate',
            'facet.range.start': 'NOW/HOUR-24HOUR',
            'facet.range.end': 'NOW/HOUR+1HOUR',
            'facet.range.gap': '+1HOUR'
        }))


#### Count news grouped by language for the search term "Hillary Clinton" OR "Donald Trump".

In [None]:
languageCounts = exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'language',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'fcs'
        })
dump(languageCounts)



##### Pie chart for the language distribution in selected news.


In [None]:
from nvd3 import pieChart

languageCountList = languageCounts['facet_counts']['facet_fields']['language']
languageCountList = languageCountList[0:-2]
print("\n" + str(languageCountList) + "\n")

name = 'Language distribution (entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump")'
languageDistChart = nvd3.pieChart(name=name, color_category='category20c', height=450, width=450)
languageDistChart.set_containerheader("\n\n<h3>" + name + "</h3>\n\n")

#Add the serie
extra_serie = {"tooltip": {"y_start": "", "y_end": " cal"}}
languageDistChart.add_serie(y=languageCountList[1::2], x=languageCountList[::2], extra=extra_serie)
languageDistChart

#### Counting all occurrences of named entities in news which contain NEs "Hillary Clinton" OR "Donald Trump"

In [None]:
surfaceCounts = exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq':'publicationDate:[NOW/DAY-7DAY TO NOW/DAY+1DAY]',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'knownSurfaceforms',
            'facet.limit': '10',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
         })
dump(surfaceCounts)



##### Bar chart for the top ranking surface forms in the selected news.


In [None]:
from nvd3 import discreteBarChart

surfaceCountList = surfaceCounts['facet_counts']['facet_fields']['knownSurfaceforms']
print("\n" + str(surfaceCountList) + "\n")

name = 'Top ranking surfaces in selected news'
surfaceCountsChart = discreteBarChart(name=name, color_category='category20c', height=400, width=400)
surfaceCountsChart.set_containerheader("\n\n<h3>" + name + "</h3>\n\n")
surfaceCountsChart.add_serie(y=list(reversed(surfaceCountList[1::2])), x=list(reversed(surfaceCountList[::2])))
surfaceCountsChart


#### Counting all CRFs in news which contain NEs "Hillary Clinton" OR "Donald Trump"

In [None]:
crfSurfaceCounts = exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq':'publicationDate:[NOW/DAY-3DAY TO NOW/DAY+1DAY]',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'unknownPersons',
            'facet.limit': '10',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
         })
dump(crfSurfaceCounts)



##### Bar chart for the top ranking unknown surface forms in the selected news, which were generated with CRF.


In [None]:
from nvd3 import discreteBarChart

crfSurfaceCountList = crfSurfaceCounts['facet_counts']['facet_fields']['unknownPersons']
crfSurfaceCountList = crfSurfaceCountList[0:-2]
print("\n" + str(crfSurfaceCountList) + "\n")

name = 'Top ranking unknown surfaces (CRF) in selected news'
crfSurfaceCountsChart = discreteBarChart(name=name, color_category='category20c', height=400, width=400)
crfSurfaceCountsChart.set_containerheader("\n\n<h3>" + name + "</h3>\n\n")

crfSurfaceCountsChart.add_serie(y=list(reversed(crfSurfaceCountList[1::2])), x=list(reversed(crfSurfaceCountList[::2])))
crfSurfaceCountsChart



## Examples for selecting dpa data


#### Loading dpa-News from News-Stream

In [None]:
dump(exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq': 'sourceId:dpa',
         }))


#### Loading dpa-News from News-Stream with dpa specific fields

In [None]:
dump(exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq': 'sourceId:dpa',
            'fl': 'id dpaId publicationDate title mlRessort dpaIndustries',
            'sort': 'publicationDate DESC',
            'rows': '5'
        }))


#### Aggregation of dpa news on category 'mlIndustries'

FIN -> Asset Management, Finanzdienstleister | AUT -> Automobil-/Zuliefererindustrie (Autos &amp; LKW, Ersatzteile, Reifen) | BAN -> Banken | CON -> Bau | PER -> Bekleidung, Kosmetik | MIN -> Bergbau, Rohstoffförderung (Kohle, Diamanten, Gold, Platin, Edelmetalle) | EQI -> Beteiligungsgesellschaften | EQN -> Börsennotierte Fonds (ETF, etc.) | CHM -> Chemie, Kunststoffe | CMP -> Computer, Hardware, Software, Halbleiter, Bauteile | ELU -> Elektrizitätsversorger | ELE -> Elektronik, Elektrik, Komponenten | AEG -> Erneuerbare Energien | HTH -> Gesundheitswesen, Medizintechnik, Krankenhausbedarf | BEV -> Getränke (Bier, Wein, Destillerien, Soft Drinks) | TRN -> Gütertransport, Logistik | HOU -> Haushaltswaren, Möbel, Eigenheime | PRO -> Immobilien | REF -> Lebensmittel- und Pharmahandel | ASS -> Lebensversicherer | ENG -> Maschinenbau, Starkstrom, Umwelttechnik | MET -> Metallverarbeitung- und förderung, NE-Metalle | INL -> Mischkonzerne, Verpackungsindustrie | FOO -> Nahrungsmittel (Hersteller, inkl. Agrarindustrie) | RET -> Non-Food-Einzelhandel, Endkunden-Dienstleister | PAP -> Papier, Zellulose, Holz | PHA -> Pharma, Biotechnologie | DEF -> Rüstungsindustrie, Flugzeughersteller | INS -> Sach- und Rückversicherungen | SOF -> Software, IT-Beratung, Internet, Portalbetreiber | TOB -> Tabakindustrie | TEL -> Telefongesellschaften (Festnetz) | MOB -> Telefongesellschaften (Mobilfunk) | LEI -> Tourismus, Fluggesellschaften, Bahn (Personenverkehr) | SVS -> Unternehmensdienstleister | CSM -> Verbrauchsgüter, Kosmetik, Seife, Handwerksbedarf, Möbel, Haushaltsgeräte, Unterhaltungselektronik | MED -> Verlage, Rundfunk, Info-Dienste, Zeitungen, Bücher, Werbung | UTI -> Versorger (Gas, Wasser etc.) | OIL -> Öl, Ölexploration, Gas | OES -> Öl-Anlagenbau, Pipelines |

In [None]:
industryCategoryCounts = exec_query(
        {
            'q': 'Siemens',
            'fq': 'sourceId:dpa',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'mlIndustries',
            'facet.limit': '10',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
         })
dump(industryCategoryCounts)



Bar chart for the top ranking industry categories for "Siemens".


In [None]:
from nvd3 import discreteBarChart

industryCategoryCountList = industryCategoryCounts['facet_counts']['facet_fields']['mlIndustries']
industryCategoryCountList = industryCategoryCountList[0:-2]
print("\n" + str(industryCategoryCountList) + "\n")

name = 'Top ranking industry categories for "Siemens"'
industryCategoryCountsChart = discreteBarChart(name=name, color_category='category20c', height=400, width=400)
industryCategoryCountsChart.set_containerheader("\n\n<h3>" + name + "</h3>\n\n")

industryCategoryCountsChart.add_serie(y=list(reversed(industryCategoryCountList[1::2])), x=list(reversed(industryCategoryCountList[::2])))
industryCategoryCountsChart


#### Aggregation of dpa news on category 'dpaRessort'

pl="politik", wi="wirtschaft", rs="redaktioneller service", vm="vermischtes", ku="kultur", sp="sport"

In [None]:
dpaRessortCounts = exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq': 'sourceId:dpa',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'dpaRessort',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
         })
dump(dpaRessortCounts)



Pie chart of the ressorts of the news for "entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"".

* pl="politik", wi="wirtschaft", rs="redaktioneller service", vm="vermischtes", ku="kultur", sp="sport"


In [None]:
from nvd3 import pieChart

dpaRessortCountList = dpaRessortCounts['facet_counts']['facet_fields']['dpaRessort']
dpaRessortCountList = dpaRessortCountList[0:-2]
print("\n" + str(dpaRessortCountList) + "\n")

name = 'Distribution of the ressorts for selected news (entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump")'
dpaRessortsChart = nvd3.pieChart(name=name, color_category='category20c', height=450, width=450)
dpaRessortsChart.set_containerheader("\n\n<h3>" + name + "</h3>\n\n")

#Add the serie
extra_serie = {"tooltip": {"y_start": "", "y_end": " cal"}}
dpaRessortsChart.add_serie(y=dpaRessortCountList[1::2], x=dpaRessortCountList[::2], extra=extra_serie)
dpaRessortsChart

#### Aggregation of dpa news on category 'dpaServices'


Kürzel für dpa Dienste:

* dpasrv:bdt -> Basisdienst
* afxsrv:ADE -> AFX Kompakt
* edi-bid -> Teil des Basisdienstes
* dpasrv:hfk -> Hörfunkdienst/ Kurznachrichtendienst und Teilmenge des Basisdienstes.
* wap- Präfix sind Varianten des jeweiligen Landesdienstes.

Bei den Landesdiensten gibt es folgende Zuordnungen:

* bwg: Baden-Württemberg
* brb: Berlin / Brandenburg
* rhs: Rheinland-Pfalz / Saarland
* bay: Bayern
* hsh: Hamburg / Schleswig-Holstein
* nwf: Nordrhein-Westfalen
* san: Sachsen
* aht: Sachsen-Anhalt
* hes: Hessen
* mbv: Mecklenburg-Vorpommern
* thg: Thüringen
* nsb: Niedersachsen / Bremen


In [None]:
dpaServicesCounts = exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq': 'sourceId:dpa',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'dpaServices',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
         })
dump(dpaServicesCounts)



Bar chart of the dpa services with news for "entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"".


In [None]:
from nvd3 import discreteBarChart

dpaServicesCountList = dpaServicesCounts['facet_counts']['facet_fields']['dpaServices']
dpaServicesCountList = dpaServicesCountList[0:-2]
print("\n" + str(dpaServicesCountList) + "\n")

name = 'Top ranking dpa services in selected news'
dpaServicesCountsChart = discreteBarChart(name=name, color_category='category20c', height=400, width=400)
dpaServicesCountsChart.set_containerheader("\n\n<h3>" + name + "</h3>\n\n")

dpaServicesCountsChart.add_serie(y=list(reversed(dpaServicesCountList[1::2])), x=list(reversed(dpaServicesCountList[::2])))
dpaServicesCountsChart


#### Aggregation of dpa news on category 'dpaKeywords'

In [None]:
dpaKeywordsCounts = exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq': 'sourceId:dpa',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'dpaKeywords',
            'facet.limit': '10',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
         })
dump(dpaKeywordsCounts)



Bar chart of the dpa keywords for selected news "entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"".


In [None]:
from nvd3 import discreteBarChart

dpaKeywordsCountList = dpaKeywordsCounts['facet_counts']['facet_fields']['dpaKeywords']
dpaKeywordsCountList = dpaKeywordsCountList[0:-2]
print("\n" + str(dpaKeywordsCountList) + "\n")

name = 'Top ranking dpa keywords in selected news'
dpaKeywordsCountsChart = discreteBarChart(name=name, color_category='category20c', height=400, width=400)
dpaKeywordsCountsChart.set_containerheader("\n\n<h3>" + name + "</h3>\n\n")

dpaKeywordsCountsChart.add_serie(y=list(reversed(dpaKeywordsCountList[1::2])), x=list(reversed(dpaKeywordsCountList[::2])))
dpaKeywordsCountsChart
