# A quick guide to searching with Impresso library 

<a target="_blank" href="https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/starter/search_ImpressoAPI.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## What is this notebook about?

This notebook serves as a quick guide to searching with [Impresso Python Library](https://impresso.github.io/impresso-py/). By using this, you will be able to find content items in Impresso using a variety of filters. While it is useful to introduce new users to the capabilities of the library, it can also be used as a cheatsheet, to which you can always refer to when you need a rapid solution. 

Please, note that some news headlines (titles) and transcripts might be shown as **[redacted]** depending on the copyright access your user plan allows.

## Why is this useful?

While the [Impresso Web App](https://impresso-project.ch/app/) provides users with a powerful user-friendly interface, accessing Impresso data via the API allows for more flexibility in your research pipelines. This means you can, for example, create visualisations beyond those provided by the interface, such as the one described in detail in the Datalab notebook [Visualising Place Entities on Maps](https://github.com/impresso/impresso-datalab-notebooks/blob/main/explore-vis/place-entities_map.ipynb).

## What will you learn?

In this notebook, you will learn how to:

- Navigate the Impresso corpus via API
- Use the diffent search functionalities available on the Impresso Python library.
- Write complex queries (including AND and OR queries)

This notebook will guide you through these core functionalities and help you get familiar with the Impresso library capabilities.

## Prerequisites

Run the following cell to install the `impresso` python library. You may need to restart the kernel to use updated packages. To do so, on Google Colab, go to *Runtime* and select *Restart session*.

In [None]:
%pip install -q impresso

## Initialising an Impresso Client

By running the following cell, we create an instance of the Impresso client and authenticate it with the Impresso API.

> The `impresso` variable stores an instance of `ImpressoClient`, which establishes a connection to the API using your authentication token. With this object, you can interact with the API to perform operations such as searching for content items, retrieving entities, and fetching facets.

The following command will prompt you to enter your Impresso token if it has not been authenticated recently (it expires after 8 hours).

To get access to an Impresso API token, go to [Impresso Datalab](https://impresso-project.ch/datalab/) and select *Get API Token* on the menu.

In [None]:
from impresso import connect

impresso = connect()

### `term`

Start by searching for content items containing the keyword 'titanic'.
> In Impresso, a **Content Item** is the smallest unit of editorial content within a newspaper or radio collection. This can be an article (for newspapers) or a radio show or episode (for radio programs). Content items can also vary by type, including articles, advertisements, tables, images, and more. Please note that when a newspaper does not have segmentation (OLR) content items for this title correspond to pages.

**Important note:** on the output, you will find the option 'See this result in the Impresso App'. By clicking on the link, you will be able to visualise the same result you retrieved using this notebook by using the powerful interface of the Impresso Web App. This is part of the Impresso project effort to integrate the workflow of the Web App with the Datalab.

By default, results are limited to 100. You can use the parameter `limit` to extend your results. Eg. (limit = 300). Maximum is 1000.

In [None]:
impresso.search.find(term="titanic")

### `with_text_contents`

Retrieve only content items that contain textual data. 

In [None]:
impresso.search.find(term="titanic", with_text_contents=True)

### `Title`
Retrieve only content items that have the keyword 'titanic' in the title (news headline).

In [None]:
impresso.search.find(title="titanic")

## Complex term requests


### `AND` queries
Find content items that contain more than one term in the news headline.

In [None]:
from impresso import AND

impresso.search.find(title=AND("titanic", "oscars"))

### `OR` queries
Find content items that contain either one or another term.

In [None]:
from impresso import OR

impresso.search.find(title=OR("titanic", "naufrage"))

### Inverted search (everything excluding term A __OR__ term B).

To find all content items containing the word "titanic" in the title that do not mention neither "film" nor "pellicule", you can use `~` before **OR**

In [None]:
from impresso import OR

impresso.search.find(title="titanic", term=~OR("film", "pellicule"))

### Complex combintation of terms

The following cell searches all content items with all of the the following conditions:

* mentioning 'titanic' AND '1912'
* also mentioning either 'eisberg' OR 'iceberg'
* and not mentioning 'film' nor 'pellicule'

In [None]:
from impresso import AND, OR

impresso.search.find(term=AND("titanic", "1912") & OR("eisberg", "iceberg") & ~OR("film", "pellicule"))

### `front_page`

Retrieve content items published on the newspaper's front page only

In [None]:
impresso.search.find(term="titanic", front_page=True)

### `entity_id`

Search by entity ID. All entities in the Impresso corpus have a specific id. You can use that id to retrieve content items where this specific entity is mentioned. 

In [None]:
impresso.search.find(entity_id="aida-0001-50-James_Cameron")

But where do I find the entity id? You can simply make a search for the entity using the cell below and the `entity_id` will be shown in the first column.

PS: There are various entities (10) mentioning James Cameron. We are looking for the film director, so we use the fist one in the output. You can check to whom each of the outputs refer by clicking on 'see this result in the Impresso App'. 

In [None]:
impresso.entities.find(term="James Cameron")

You can retrieve all content items that mention both the entities James Cameron **AND** Leonardo Dicaprio.

In [None]:
impresso.search.find(entity_id=AND("aida-0001-50-James_Cameron", "aida-0001-50-Leonardo_DiCaprio"))

Also, find all content items that mention the word 'titanic' and either the entities James Cameron **OR** Barbara Stanwyck (actress in the film [Titanic (1953)](https://en.wikipedia.org/wiki/Titanic_(1953_film)))

In [None]:
impresso.search.find(term="titanic", entity_id=OR("aida-0001-50-James_Cameron", "aida-0001-50-Barbara_Stanwyck"))

### `newspaper_id`

Retrieve content items that have been published by a specific newspapers. In the case below, either by EXP (L'Express) or GDL (Gazette de Lausanne)

In [None]:
impresso.search.find(term="titanic", newspaper_id=OR("EXP", "GDL"))

But how do I find the newspapers' acronyms? 

You can simply use Facet search method to retrieve all newspapers that are relevant to your keyword search. 

In [None]:
df_newspapers = impresso.search.facet("newspaper", term="titanic", limit=100)
df_newspapers.df

### `DateRange`

By using `DateRange`, you can delimit a timeframe for your search. In the example below, you will find content items mentioning the word 'titanic', published between 15th April 1912 (date of the accident) and 1st January 1913. 

In [None]:
from impresso import DateRange

impresso.search.find(term="titanic", date_range=DateRange("1912-04-15", "1913-01-01"))

You can also search for content items published outside the date range by using the `~` .

In [None]:
from impresso import DateRange

impresso.search.find(term="titanic", date_range=~DateRange("1912-04-15", "1913-01-01"))

### `language`

Search for the term "titanic" in content items in German or English language.

In [None]:
impresso.search.find(term="titanic", language=OR("de", "en"))

### `collection_id`
You can display all content items you saved in one of your collections.

In [None]:
impresso.search.find(collection_id="ADD_COLLECTION_ID_HERE")

To find the id of one of your collections, you can use the code below:

In [None]:
impresso.search.facet("collection")

### `country`

Find all content items published in either of the two specified countries.

In [None]:
impresso.search.find(term="titanic", country=OR("FR", "CH"))

### `partner_id`

Limit search to content items provided by a specific partner of the Impresso project.

In [None]:
impresso.search.find(term="titanic", partner_id="Migros")

## Facets

Facets search is a way to get statistics for your search using Impresso metadata. Facets search method has the same attributes as the search method.

Some of these Facet search methods have been used above already. Here you will find more:

### Date range

Get the number of content items that mention 'titanic', published on a particular date.

In [None]:
impresso.search.facet("daterange", term="titanic")

### Year

Get the number of content items that mention 'titanic', published during a particular year.

In [None]:
impresso.search.facet("year", term="titanic")

### Content length

Get the number of content items that mention 'titanic', grouped by content length.

Results are grouped by 100 words. This way, 0 means content items containing between 0 - 100 words. 

In [None]:
impresso.search.facet("contentLength", term="titanic") 

### Month

Get the number of content items that mention 'titanic', published during a particular month.

PS: Months are represented by numbers in column 'value'. 1 = January...

In [None]:
impresso.search.facet("month", term="titanic") 

### Country

Get the number of content items that mention 'titanic', grouped by country of publication.

In [None]:
impresso.search.facet("country", term="titanic")

### Type

Get the number of items that mention 'titanic', grouped by type of item.

Dictionary:

* ad = advertisement
* ar = article
* ob = obtuary 

In [None]:
impresso.search.facet("type", term="titanic")

### Topic

Find topics that the content items mentioning 'titanic' are related to.

In [None]:
impresso.search.facet("topic", term="titanic")

### Collection

Find within your collections, a collection in which a content items containing 'titanic' is stored. This will just work if you have a collection in Impresso. 

In [None]:
impresso.search.facet("collection", term="titanic")

### Newspaper

Find newspapers in which content items mentioning the keyword 'titanic' have been published in.

In [None]:
impresso.search.facet("newspaper", term="titanic")

### Language

Find the languages of content items mentioning the keyword 'titanic'.

In [None]:
impresso.search.facet("language", term="titanic")

In [None]:
impresso.search.facet("location", term="titanic")

### Access rights

Get access rights of content items mentioning 'titanic'.

In [None]:
impresso.search.facet("accessRight", term="titanic")

### Partner

Get Impresso partners that provided content items mentioning 'titanic'.

In [None]:
impresso.search.facet("partner", term="titanic")

## Conclusion
That's it for now! Next, you can explore:

- the [Introduction to the Impresso Python Library](https://github.com/impresso/impresso-datalab-notebooks/blob/main/starter/basics_ImpressoAPI.ipynb) notebook, which demonstrates how to use the Impresso Library further, including managing collections and much more.
- if you want to learn more about the Impresso Python library, you can find its [documentation here](https://impresso.github.io/impresso-py/)

---
## Project and License info

### Notebook credits [CreditLogo.png](https://credit.niso.org/)
**Writing - Original draft:** Roman Kalyakin. **Conceptualization:** Roman Kalyakin, Maud Ehrmann. **Software:** Roman Kalyakin. **Writing - Review & Editing:** Caio Mello. **Validation:** Sarah Oberbichler. **Datalab editorial board:** Caio Mello (Managing), Pauline Conti, Emanuela Boros, Marten Düring, Juri Opitz, Martin Grandjean, Estelle Bunout. **Data curation & Formal analysis:** Maud Ehrmann, Emanuela Boros, Pauline Conti, Simon Clematide, Juri Opitz, Andrianos Michail. **Methodology:** Roman Kalyakin. **Supervision:** Marten Düring, Maud Ehrmann. **Funding aquisition:** Maud Ehrmann, Simon Clematide, Marten Düring, Raphaëlle Ruppen Coutaz.

<br><a target="_blank" href="https://creativecommons.org/licenses/by/4.0/">
  <img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by.png"  width="100" alt="Open In Colab"/>
</a> 

This notebook is published under [CC BY 4.0 License](https://creativecommons.org/licenses/by/4.0/)

For feedback on this notebook, please send an email to info@impresso-project.ch

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.
<br></br>
### License

All Impresso code is published open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.


---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>
