# SGV 12 Keyword Analysis

A notebook that explores different methods to analyze the keywords in the collection Ernst Brunner. 

In [7]:
import pandas as pd
from collections import Counter
import json

input_dir = "./export/sgv-12_keywords_ids.json"

## Load the Objects with Keywords

In [8]:
with open(input_dir, "r") as file:
    data = json.load(file)

print("Available images with keywords:", len(data))

# Convert the data to a pandas DataFrame
df = pd.DataFrame(data)
print(df)

Available images with keywords: 22364
      schema:identifier                      schema:about
0         SGV_12N_07501             [Murbacherstrasse 31]
1         SGV_12N_07502             [Murbacherstrasse 31]
2         SGV_12N_07503             [Murbacherstrasse 31]
3         SGV_12N_07773  [Valorisierung: Museum Burgrain]
4         SGV_12N_07833                           [Suone]
...                 ...                               ...
22359     SGV_12N_24898                        [Friedhof]
22360     SGV_12N_24899                        [Friedhof]
22361     SGV_12N_24900                        [Friedhof]
22362     SGV_12N_24971                        [Friedhof]
22363     SGV_12N_24972                        [Friedhof]

[22364 rows x 2 columns]


## Co-occurence Analysis

A matrix that analyzes all keywords against all keywords and counts their co-occurance.

In [17]:
# the dataframe to create the co-occurence matrix from
# either analyze all 1578 unique keywords or only the top n
cooc_df = df

In [18]:
# Get a list of all unique keywords

# Step 1: Create an empty list to store all keywords
all_keywords = []

# Step 2: Loop through each row in the DataFrame
for row in cooc_df['schema:about']:
    # Step 3: Add the keywords from this row to the all_keywords list
    all_keywords.extend(row)

# Step 4: Remove duplicate keywords by converting the list to a set
unique_keywords = set(all_keywords)

# Step 5: Convert the set back to a list
keyword_list = list(unique_keywords)
print("unique keywords", len(unique_keywords))

# Create an empty DataFrame to store co-occurrence counts
# pd.DataFrame() is a function from the Pandas library that allows us to create a table (or DataFrame).
# 0: This sets the initial values in the table to 0. This means that every cell in the DataFrame will start with a value of zero.
# index=keyword_list: This means that the rows of the table will be labeled with the keywords from the keyword_list.
# columns=keyword_list: This means that the columns of the table will also be labeled with the keywords from the keyword_list.
co_occurrence_matrix = pd.DataFrame(0, index=keyword_list, columns=keyword_list)

unique keywords 1578


1. Outer Loop (for keywords in df['keywords']):

    This loop goes through each row in the column df['keywords'].
    In your data, df['keywords'] contains lists of keywords for each file (like ["apple", "banana", "orange"] for one file).
    The variable keywords will be a list of keywords from one file in each iteration of the loop.

2. First Inner Loop (for i in range(len(keywords))):

    This loop iterates over each keyword in the keywords list by its position (index).
    len(keywords) gives the total number of keywords in that list.
    The variable i will represent the position of the first keyword in the list.

3. Second Inner Loop (for j in range(i + 1, len(keywords))):

    This second loop starts from the next keyword (after i) and goes through the rest of the list.
    The variable j represents the position of the second keyword in the list.
    By starting at i + 1, this loop ensures we are only looking at pairs of different keywords and avoids counting the same keyword with itself.

4. Keyword Pairing (kw1, kw2 = keywords[i], keywords[j]):

    The variables kw1 and kw2 are assigned the values of the two keywords at positions i and j in the list.
    For example, if keywords = ["apple", "banana", "orange"], and i = 0 and j = 1, then kw1 = "apple" and kw2 = "banana".

5. Updating the Co-occurrence Matrix:

    co_occurrence_matrix.at[kw1, kw2] += 1: This increases the value in the matrix at the position where the row is kw1 and the column is kw2 by 1.
    co_occurrence_matrix.at[kw2, kw1] += 1: This does the same for the reverse position (row kw2, column kw1) to ensure that the matrix is symmetrical. This is important because the co-occurrence of "apple" and "banana" is the same as the co-occurrence of "banana" and "apple."

In [19]:
# Iterate through the keywords of each file and update co-occurrence counts
for keywords in cooc_df['schema:about']:
    for i in range(len(keywords)):
        for j in range(i + 1, len(keywords)):
            kw1, kw2 = keywords[i], keywords[j]
            co_occurrence_matrix.at[kw1, kw2] += 1
            co_occurrence_matrix.at[kw2, kw1] += 1  # Ensure symmetry

# # Display the co-occurrence matrix
print(co_occurrence_matrix)

                                                    Traubenlese  Obstbaum  \
Traubenlese                                                   0         0   
Obstbaum                                                      0         0   
Gamelle                                                       0         0   
Ruderboot                                                     0         0   
Baselstab                                                     0         0   
...                                                         ...       ...   
Weinherstellung                                               0         0   
Gabel                                                         0         0   
Zürcher Kantonale Landwirtschafts- und Gewerbea...            0         0   
Schlafzimmer                                                  0         0   
Sägen (Tätigkeit)                                             0         0   

                                                    Gamelle  Ruderboot  \
T

In [20]:
# Assuming co_occurrence_matrix is your co-occurrence DataFrame
matrix_data = []

# Convert the DataFrame to a suitable JSON format
for row_keyword in co_occurrence_matrix.index:
    for col_keyword in co_occurrence_matrix.columns:
        value = int(co_occurrence_matrix.at[row_keyword, col_keyword])  # Convert value to int
        if value > 0:  # Optionally filter out zero counts
            matrix_data.append({
                "row": row_keyword,
                "col": col_keyword,
                "value": value
            })
    
# Sort the matrix data by value in descending order
matrix_data_sorted = sorted(matrix_data, key=lambda x: x['value'], reverse=True)

# Export the matrix data to a JSON file
with open('./export/sgv-12-keywords_cooccurence_all.json', 'w') as f:
    json.dump(matrix_data_sorted, f, indent=4)

# Print a snippet of the matrix data for confirmation
print(json.dumps(matrix_data_sorted[:5], indent=4))

[
    {
        "row": "Winter",
        "col": "Schnee",
        "value": 1081
    },
    {
        "row": "Schnee",
        "col": "Winter",
        "value": 1081
    },
    {
        "row": "Dorf",
        "col": "Haus",
        "value": 879
    },
    {
        "row": "Haus",
        "col": "Dorf",
        "value": 879
    },
    {
        "row": "Landschaft",
        "col": "Berg",
        "value": 856
    }
]
