# A quick guide to searching with Impresso library 

## What is this notebook about?

This notebook serves as a quick guide to searching with [Impresso Python Library](https://impresso.github.io/impresso-py/). By using this, you will be able to find content items in Impresso using a variety of filters. While it is useful to introduce new users to the capabilities of the library, it can also be used as a cheatsheet, to which you can always refer to when you need a rapid solution. 

Please, note that some news headlines (titles) and transcripts might be shown as **[redacted]** depending on the copyright access your user plan allows.

---

## What will you learn?

In this notebook, you will learn how to:

- Navigate the Impresso corpus via API
- Use the diffent search functionalities available on the Impresso Python library.
- Write complex queries (including AND and OR queries)

This notebook will guide you through these core functionalities and help you get familiar with the Impresso library capabilities.

---
## Prerequisites

Run the following cell to install the `impresso` python library. You may need to restart the kernel to use updated packages. To do so, on Google Colab, go to *Runtime* and select *Restart session*.

In [None]:
from impresso import connect

impresso = connect()

### `term`

Start by searching for content items containing the keyword 'impresso'.
> In Impresso, a **Content Item** is the smallest unit of editorial content within a newspaper or radio collection. This can be an article (for newspapers) or a radio show or episode (for radio programs). Content items can also vary by type, including articles, advertisements, tables, images, and more. Please note that when a newspaper does not have segmentation (OLR) content items for this title correspond to pages.

**Important note:** on the output, you will find the option 'See this result in the Impresso App'. By clicking on the link, you will be able to visualise the same result you retrieve using this notebook by using the powerful interface of the Impresso App. This is part of the Impresso project effort of integrating the workflow of the Web App and the Datalab!  

In [None]:
impresso.search.find(term="impresso")

### `with_text_contents`

Retrieve only content items that contain textal data. 

In [None]:
impresso.search.find(term="impresso", with_text_contents=True)

### `Title`
Retrieve only content itens that have the keyword 'impresso' in the title (news headline).

In [None]:
impresso.search.find(title="impresso")

## Complex term requests


### `AND` queries
Find content items that contain more than one term.

In [None]:
from impresso import AND

impresso.search.find(title=AND("homme", "femme"))

### `OR` queries
Find content items that contain either one or another term.

In [None]:
from impresso import OR

impresso.search.find(title=OR("homme", "femme"))

### Inverted search (everything excluding term A __OR__ term B).

To find all content items containing the word "luddite" in the title that do not mention neither "textile" nor "machine", you can use `~` before **OR**

In [None]:
from impresso import OR

impresso.search.find(title="luddite", term=~OR("textile", "machine"))

### Complex combintation of terms

The following cell searches all content items with all of the the following conditions:

* mentioning 'hitler' AND 'stalin'
* also mentioning either 'molotow' OR 'ribbentrop'
* and not mentioning 'churchill'

In [None]:
from impresso import AND, OR

impresso.search.find(term=AND("hitler", "stalin") & OR("molotow", "ribbentrop") & ~OR("churchill"))

### `front_page`

Retrieve content items published on the newspaper's front page only

In [None]:
impresso.search.find(term="impresso", front_page=True)

### `entity_id`

Search by entity ID. All entities in the Impresso corpus have a specific id. You can use that id to retrieve content items where this specific entity is mention. 

In [None]:
impresso.search.find(entity_id="aida-0001-54-Switzerland")

But where do I find the entity id? You can simply make a search for the entity using the cell below and the `entity_id` will be shown in the first column.

PS: There are various entities mentioning Switzerland. We are looking for the Country, so we use the fist one in the output. The others refer to cities or Cantons. 

In [None]:
impresso.entities.find(term="Switzerland")

You can retrieve all content items that mention both the entities Switzerland **AND** Albert Einstein.

In [None]:
impresso.search.find(entity_id=AND("aida-0001-54-Switzerland", "aida-0001-50-Albert_Einstein"))

Also, find all content items that mention either Switzerland **OR** Albert Einstein.

In [None]:
impresso.search.find(entity_id=OR("aida-0001-54-Switzerland", "aida-0001-50-Albert_Einstein"))

### `newspaper_id`

Retrieve content items that have been published by a specific newspapers. In the case below, either by EXP (L'Express) or GDL (Gazette de Lausanne)

In [None]:
impresso.search.find(term="independence", newspaper_id=OR("EXP", "GDL"))

But how do I find the newspapers' acronyms? 

You can simply use Facet search method to retrieve all newspapers that are relevant to your keyword search. 

In [None]:
df_newspapers = impresso.search.facet("newspaper", term="independence", limit=100)
df_newspapers.df

### `DateRange`

By using `DateRange`, you can delimit a timeframe for your search. In the example below, we will find content items mentioning the word 'independence', published between 21st May 1921 and 2nd January 2001. 

In [None]:
from impresso import DateRange

impresso.search.find(term="independence", date_range=DateRange("1921-05-21", "2001-01-02"))

You can also search for content items published outside the date range by using the `~` .

In [None]:
from impresso import DateRange

impresso.search.find(term="independence", date_range=~DateRange("1921-05-21", "2001-01-02"))

### `language`

Search for the term "Paris" in content items in German or English language.

In [None]:
impresso.search.find(term="Paris", language=OR("de", "en"))

And now search for the word "banana" in any language **except** English or German.

In [None]:
impresso.search.find(term="banana", language=~OR("de", "en"))

### `topic_id`

Find content items that match either of the two topics.

In [None]:
impresso.search.find(topic_id=OR("tm-fr-all-v2.0_tp07_fr", "tm-fr-all-v2.0_tp48_fr")) 

But how do I know the topic_id?

You can search for a specific term using a facet search method. The `topic_id` will be displayed in the first column. See example below:

In [None]:
impresso.search.facet("topic", term="Paris")

### `collection_id`
You can display all content items you saved in one of your collections.

In [None]:
impresso.search.find(collection_id="ADD_COLLECTION_ID_HERE")

To find the id of one of your collections, you can use the code below:

In [None]:
impresso.search.facet("collection")

### `country`

Find all content items published in either of the two specified countries.

In [None]:
impresso.search.find(term="Schengen", country=OR("FR", "CH"))

### `partner_id`

Limit search to content items provided by a specific partner of the Impresso project.

In [None]:
impresso.search.find(term="Schengen", partner_id="Migros")

### `text_reuse_cluster_id`

Find all content items that are part of a specific text reuse cluster.

In [None]:
from impresso import OR
impresso.search.find(text_reuse_cluster_id=OR("tr-nobp-all-v01-c29"))

## Facets

Facets search is a way to get statistics for your search using Impresso metadata. Facets search method has the same attributes as the search method.

Some of these Facet search methods have been used above already. Here you will find more:

### Date range

Get the number of content items that mention 'Impresso', published on a particular date.

In [None]:
impresso.search.facet("daterange", term="impresso")

### Year

Get the number of content items that mention 'impresso', published during a particular year.

In [None]:
impresso.search.facet("year", term="impresso")

### Content length

Get the number of content items that mention 'impresso', grouped by content length.

Results are grouped by 100 words. This way, 0 means content items containing between 0 - 100 words. 

In [None]:
impresso.search.facet("contentLength", term="impresso") 

### Month

Get the number of content items that mention 'impresso', published during a particular month.

PS: Months are represented by numbers in column 'value'. 1 = January...

In [None]:
impresso.search.facet("month", term="impresso") 

### Country

Get the number of content items that mention 'impresso', grouped by country of publication.

In [None]:
impresso.search.facet("country", term="impresso")

### Type

Get the number of items that mention 'impresso', grouped by type of item.

Dictionary:

* ad = advertisement
* ar = article
* ob = obtuary 

In [None]:
impresso.search.facet("type")

### Topic

Find topics that the content items mentioning 'impresso' are related to.

In [None]:
impresso.search.facet("topic", term="pomme")

### Collection

Find within your collections, a collection in which a content items containing 'pomme' is stored. This will just work if you have a collection in Impresso. 

In [None]:
impresso.search.facet("collection", term="pomme")

### Newspaper

Find newspapers in which content items mentioning the keyword 'Schengen' have been published in.

In [None]:
impresso.search.facet("newspaper", term="Schengen")

### Language

Find the languages of content items mentioning the keyword 'impresso'.

In [None]:
impresso.search.facet("language", term="Schengen")

### Person

Find all persons mentioned in content items that contain the keyword 'Schengen'. 

In [None]:
impresso.search.facet("person", term="Schengen", offset=7140)

### Location

Find all locations mentioned in content items that mention 'Schengen'.

In [None]:
impresso.search.facet("location", term="Schengen", offset=3310)

### NAG

Find all entities without a known entity type (not tagged as person or location, for example) mentioned in content items that contain the keywords 'homme' and 'femme'.

In [None]:
from impresso import AND
impresso.search.facet("nag", title=AND("homme", "femme"))

### Access rights

Get access rights of content items mentioning 'pomme'.

In [None]:
impresso.search.facet("accessRight", term="pomme")

### Partner

Get Impresso partners that provided content items mentioning 'pomme'.

In [None]:
impresso.search.facet("partner", term="pomme")

## Conclusion
That's it for now! Next, you can explore:

- the [Introduction to the Impresso Python Library](https://github.com/impresso/impresso-datalab-notebooks/blob/main/starter/basics_ImpressoAPI.ipynb) notebook, which demonstrates how to use the Impresso Library further, including managing collections and much more.
- if you want to learn more about the Impresso Python library, you can find its [documentation here](https://impresso.github.io/impresso-py/)

---
## Project and License info

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.

### Copyright

Copyright (C) 2024 The Impresso team.

### License

This program is provided as open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.

---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>