<a href="https://colab.research.google.com/github/irina-he/Draft-Rep-Hausarbeit/blob/main/Semantic_search_1%262.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. API Search on the DBB Zeitungsportal

We start by querying the DBB Zeitungsportal (using the newspaper‐issues index) for pages from 1914–1918 that mention "schnee" (snow) and include war-related or military terms. For example, we might look for pages that mention both "schnee" and words such as "berg" (mountain), "gebirg" (alpine region), "truppe" (troop), or "verlust" (loss), as these might indicate discussions about snow-related events affecting soldiers during World War I, especially in mountainous combat zones.

This search aims to identify reports that discuss natural hazards (such as snow, avalanches, or harsh winter conditions) in relation to military units and human losses, allowing us to study how newspapers portrayed such incidents during wartime.



In [None]:
!pip install pysolr
!pip install pandas

Collecting pysolr
  Downloading pysolr-3.10.0.tar.gz (59 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/59.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: pysolr
  Building wheel for pysolr (pyproject.toml) ... [?25l[?25hdone
  Created wheel for pysolr: filename=pysolr-3.10.0-py2.py3-none-any.whl size=20158 sha256=ac98e4d2abaf664559cbfa316c10e6fb3485fed05a2a9b4e8ecb1298c2945879
  Stored in directory: /root/.cache/pip/wheels/74/db/d1/c64399119d95d40b618e2a4d4fadbf3fff65062c9a05185cc1
Successfully built pysolr
Installing collected packages: pysolr
Successfully installed pysolr-3.10

In [None]:
import pysolr
import pandas as pd

# Define the API endpoint for the newspaper-issues index
solr_url = 'https://api.deutsche-digitale-bibliothek.de/2/search/index/newspaper-issues'

# Initialize the pysolr client
solr = pysolr.Solr(solr_url, timeout=60)

# Construct the query:
# - 'zdb_id:2149754-0' can be used to target a specific newspaper if needed (adjust as appropriate)
# - 'type:page' restricts the search to individual pages
# - 'publication_date' is set to cover the WWI period (1914-1918)
# - 'plainpagefulltext' searches for avalanche-related terms AND war-related terms.
q = {
    'q': 'type:page AND publication_date:[1914-01-01T00:00:00Z TO 1918-12-31T23:59:59Z] '
         'AND plainpagefulltext:(schnee OR berg OR gebirg) AND plainpagefulltext:(truppe OR verlust)',
    'rows': 1000
}

# Execute the query
results = solr.search(**q)

# Convert the results to a DataFrame
df_api = pd.DataFrame(results.docs)
print("API results from DBB Zeitungsportal:")
print(df_api.head())


API results from DBB Zeitungsportal:
                                                  id  pagenumber  \
0  7NG6C2BV6HOFASHTA4RDUYKYZP6ZVNKI-ALTO10329294_...           2   
1  OI7NATFMPYUF5AIXAPVEFDIBDWWTYWMO-ALTO6403343_D...           2   
2  TNAEUQ572KWXMWY6XQ5WOJH4ZZUDNRJ4-ALTO10329645_...           6   
3  WARFVPB4RX2UIQU3NLCUGKZP34KZGEEK-ALTO2509215_D...           1   
4  CMPBYSKJJVFDJCBFBEFT3AUT4XM5637E-ALTO10119147_...           1   

                                         paper_title  \
0                 Hamborner Volks-Zeitung. 1911-1929   
1      Gelsenkirchener allgemeine Zeitung. 1904-1943   
2                 Hamborner Volks-Zeitung. 1911-1929   
3  Karlsruher Tagblatt, Unterhaltungs-Beilage zum...   
4       Rheinischer Merkur : Kölnische Landeszeitung   

                    provider_ddb_id  \
0  VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW   
1  4EV676FQPACNVNHFEJHGKUY55BXC3QMB   
2  VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW   
3  INLVDM4I3AMZLTG6AE6C5GZRJKGOF75K   
4  VKNQFFAKOR4XZWJJKUX

# 2. Semantic Search to Identify New Keywords and Filter Articles
Next, we apply a semantic search pipeline using a transformer model to find semantically related keywords and to further filter the articles based on how they discuss the loss of life.

### 2.1 Discovering New Relevant Keywords
For example, we can take a target term like “schnee” (snow) or a combined query phrase such as “gebirg und truppenverlust” (mountains and troop loss) and use a transformer model to find semantically similar words in our corpus.

This may reveal additional keywords that newspapers used at the time to describe such events—terms that reflect either a natural framing (e.g., snow, mountain, fate) or that hint at military consequences or responsibility (e.g., troop loss, suffering, command decisions).

By exploring these related terms, we can better understand how snow-related incidents during wartime—especially in alpine combat zones—were discussed in the press, and whether they were portrayed as inevitable natural tragedies or connected to wartime decisions and structures.

### 2.2 Document-Level Semantic Filtering
We can also use document-level semantic search to prioritize articles that discuss loss of life in snow-related incidents during wartime in mountainous regions.

For example, using a query such as “Verlust von Soldaten im Gebirge durch Schnee” or “Truppen verschüttet in den Alpen” can help identify documents that go beyond simple mentions of snow or mountains and instead focus on human suffering and military impact.

This filtering helps highlight the most relevant texts—those that not only mention environmental conditions like snow or terrain, but also directly or indirectly connect them to military presence, troop movements, or casualties, offering insight into how such events were framed and understood at the time.

In [None]:
!pip install --upgrade torch
!pip install --upgrade transformers
!pip install --upgrade sentence-transformers

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [None]:
import pysolr
import pandas as pd
import re
from collections import Counter
from sentence_transformers import SentenceTransformer, util
import torch

# Textvorverarbeitung: Nutze 'plainpagefulltext', falls vorhanden, sonst 'title'
def preprocess_text(text):
    if pd.isna(text):
        return ""
    text = str(text).lower()
    text = re.sub(r'[^a-zäöüß\s]', '', text)
    return text

# Wende Vorverarbeitung auf die Artikeltexte an
if 'plainpagefulltext' in df_api.columns:
    df_api['processed_text'] = df_api['plainpagefulltext'].apply(preprocess_text)
else:
    df_api['processed_text'] = df_api['title'].apply(preprocess_text)

# Extrahiere alle einzigartigen Wörter
def get_unique_words(text):
    return list(set(text.split()))

all_words = []
for text in df_api['processed_text']:
    all_words.extend(get_unique_words(text))
unique_words = list(Counter(all_words).keys())

# Filtere Wörter nach Mindesthäufigkeit
min_freq = 5
word_freq = Counter(all_words)
filtered_words = [word for word, freq in word_freq.items() if freq >= min_freq]

print(f"Anzahl der Wörter vor Filterung: {len(word_freq)}")
print(f"Anzahl der Wörter nach Filterung (mindestens {min_freq} Vorkommen): {len(filtered_words)}")

# Lade das Transformer-Modell zur Wortähnlichkeitsanalyse
model_word = SentenceTransformer('sentence-transformers/LaBSE', device='cuda' if torch.cuda.is_available() else 'cpu')

# Zielbegriff für semantische Wortsuche (angepasst)
target_term = "verschüttung von truppen"
target_embedding = model_word.encode([target_term], batch_size=32, show_progress_bar=True)
word_embeddings = model_word.encode(unique_words, batch_size=32, show_progress_bar=True)

# Berechne Cosinus-Ähnlichkeit zu allen Wörtern
similarities = util.cos_sim(target_embedding, word_embeddings)[0].tolist()
word_sim_df = pd.DataFrame({
    'word': unique_words,
    'similarity': similarities
})

# Zeige die Top 20 ähnlichen Begriffe
top_similar = word_sim_df.sort_values('similarity', ascending=False).head(20)
print("Neue relevante Schlüsselwörter:")
print(top_similar)

# Modell für Dokumentvergleich laden
model_doc = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')

# Semantische Suchanfrage (angepasst)
semantic_query = "tote soldaten nach schneefall im gebirge"

# Berechne Einbettungen für Artikel und Suchanfrage
article_embeddings = model_doc.encode(df_api['processed_text'].tolist(), convert_to_tensor=True)
query_embedding = model_doc.encode(semantic_query, convert_to_tensor=True)

# Berechne Ähnlichkeit zwischen Artikel und Anfrage
similarities = util.pytorch_cos_sim(query_embedding, article_embeddings)[0]
df_api['similarity'] = similarities.cpu().numpy()

# Filtere relevante Artikel (Ähnlichkeit > 0.4)
filtered_articles = df_api[df_api['similarity'] > 0.4].sort_values('similarity', ascending=False)

# Anzeige-Spalten abhängig vom Titelfeld
if 'paper_title' in filtered_articles.columns:
    display_columns = ['id', 'paper_title', 'similarity']
else:
    display_columns = ['id', 'title', 'similarity']

print("Top semantically relevant articles on troop loss in snow-covered mountain regions during wartime:")
print(filtered_articles[display_columns].to_string(index=False))


Anzahl der Wörter vor Filterung: 307772
Anzahl der Wörter nach Filterung (mindestens 5 Vorkommen): 32859


modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/2.02k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/804 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/397 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/5.22M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.62M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/9618 [00:00<?, ?it/s]

Neue relevante Schlüsselwörter:
                         word  similarity
50560   truppenverschtebungen    0.869153
238043      truppenentfernung    0.848471
124587   truppenverchiebungen    0.845450
11334   truppenverschiebungen    0.833620
244831       truppenabziehung    0.832317
297974       truppengattungen    0.827912
207673         truppengattung    0.825703
72075        truppensendungen    0.825510
205804       truppennachschub    0.822368
172748     truppenbedörsnisse    0.818187
53109      truppenabtoilungen    0.815697
143912         truppenübungen    0.813099
97872        truppenverbänden    0.807375
124896     truppenausladungen    0.806530
263680     truppenabbeilungen    0.804243
244555       truppeneinheilen    0.801981
4590             truppenzügen    0.799326
84580        truppenlandungen    0.796704
208348         truppenführung    0.796223
273482    truppenvertärkungen    0.796041


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.89k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Top semantically relevant articles on troop loss in snow-covered mountain regions during wartime:
                                                                                     id                                                                                                                                                                                                                                                                                                      paper_title  similarity
                              G2LL4R6KDU3EJ7RYJEJLJR6DD7MDXINN-ALTO9168840_DDB_FULLTEXT                                                                                                                                                                                                                                                                                     Kölnische Zeitung. 1803-1945    0.617723
                              YOJQ5RRLBVUYGDRLMQ7P4ZWLTZN2HID6-ALTO3829078_DDB_FULLTEXT     

This filtering helps pinpoint which articles discuss the loss of life in avalanches—and by examining their language, one can assess whether they frame the events as unavoidable acts of nature or subtly (or overtly) attribute them to military circumstances or enemy actions.