## RQ2 Text Analysis

General plan:
- Follow a similar approach to the Chai-Allah "Data Mining..." paper to create clusters from the data
    - Just use tokenisation first, if 2 big then use lemmatisation
    - for now, just use all words (don't worry about filtering for high frequency words)
    - Use a German version of the word2vec model (see what the baroni one is based on and try to find a German equivalent)
    - use k-means or ward's for the clustering

Open Questions
- How should I handle placenames? See if they form clusters? Filter using gazetteer?
- **HOW DO I CONNECT TO CLASS 3 AND 6 AREAS?** Do I run them separately (but then end up with different clusters for each) or do I run the model on everything to create a classifier and then run that on the two separate corpora? I think the latter makes more sense but I don't really understand how this would work.


Steps:
1. Initial Cleaning
2. Language Handling 
3. Tokenisation
4. Pre-processing (filtering stop words etc, lower-case)
5. Semantic analysis (word2vec)
6. Clustering


In [70]:
# SETUP

# Import packages
import pandas as pd
import pickle

import spacy 
from langdetect import detect
from deep_translator import GoogleTranslator
from collections import Counter
from gensim.models import KeyedVectors  # requires scipy version 1.12 (anything newer doesn't work)

from sklearn.cluster import KMeans


### Step 1: Initial Cleaning

General text preparation to get the data in a format which works for translation and for use with spacy.

In [None]:
# STEP 1: LOAD & CLEAN

# Load the master CSV from rq2_step1_data_collection
master = pd.read_csv("./processing/master.csv")

# Add ['None'] to any blank rows
# this is necessary for the next step, but then they will be removed later
master.fillna("['None']", inplace=True)

# Extract only the columns with text and combine them into a single column
# I am keeping the descriptions & photo captions separate from the comments as sometimes
# the comments are in a different language (so they need to be translated)
raw_text = pd.DataFrame()
raw_text["desc_capt"] = master["description text"] + " " + master["photo_captions"]
raw_text["comments"] = master["comments"] 

# Now remove all the ['None'] text from both columns
raw_text["desc_capt"] = raw_text["desc_capt"].str.replace(r"\['None'\]", "", regex=True)
raw_text["comments"] = raw_text["comments"].str.replace(r"\['None'\]", "", regex=True)

# Remove certain special characters: [, ], ', |, /, \ 
raw_text["desc_capt"] = raw_text["desc_capt"].str.replace(r"\[", "", regex=True)
raw_text["desc_capt"] = raw_text["desc_capt"].str.replace(r"\]", "", regex=True)
raw_text["desc_capt"] = raw_text["desc_capt"].str.replace(r"\'", "", regex=True)
raw_text["desc_capt"] = raw_text["desc_capt"].str.replace(r"\|", "", regex=True)
raw_text["desc_capt"] = raw_text["desc_capt"].str.replace(r"\\", "", regex=True)
raw_text["desc_capt"] = raw_text["desc_capt"].str.replace(r"\/", "", regex=True)

raw_text["comments"] = raw_text["comments"].str.replace(r"\[", "", regex=True)
raw_text["comments"] = raw_text["comments"].str.replace(r"\]", "", regex=True)
raw_text["comments"] = raw_text["comments"].str.replace(r"\'", "", regex=True)
raw_text["comments"] = raw_text["comments"].str.replace(r"\|", "", regex=True)
raw_text["comments"] = raw_text["comments"].str.replace(r"\\", "", regex=True)
raw_text["comments"] = raw_text["comments"].str.replace(r"\/", "", regex=True)

# This is to address a specific issue in one of the entries
raw_text["desc_capt"] = raw_text["desc_capt"].str.replace(r"\n", " ", regex=True)

# Create lists from the 2 columns
raw_text_p1 = raw_text["desc_capt"].astype(str).values.tolist()
raw_text_p2 = raw_text["comments"].astype(str).values.tolist()

# Convert entries which are just a space (" ") to be empty ("") - only needed for p1
raw_text_p1 = [x.strip(' ') for x in raw_text_p1]

# Remove all empty entries
raw_text_p1 = list(filter(None, raw_text_p1))
raw_text_p2 = list(filter(None, raw_text_p2))

# Combine into 1 list
raw_text_list = raw_text_p1 + raw_text_p2

# Check
raw_text_list

### Step 2: Language Handling: Translate to German

As the vast majority of text is in German, I will use this as the base language. I have the option now to either remove anything not in German or to translate it. Although translation is not ideal (maybe some of original meaning is lost/altered), I think this is a better option than just removing the other languages entirely. So here I will tranlsate everything into German. 

langdetect package: https://anaconda.org/conda-forge/langdetect 
- Use to check if already German

deep-translator package: https://pypi.org/project/deep-translator/
- Using Google Translate as it doesn't require API key and can auto detect the input language (DeepL requires API key and I don't think it has an auto-detect option)


In [None]:
# STEP 2: TRANSLATE TO GERMAN (RUN ONCE!)

# Create list for all translated text
raw_text_de = []

# Check language, translate as needed and append to list 
for text_chunk in raw_text_list:
    # Check language
    input_lang = detect(text_chunk) 
    # If already German, append to German list
    if input_lang == "de":
        raw_text_de.append(text_chunk)
    # If not German, translate and append to German list    
    else:
        # Translate using Google Translate, use auto-detection for input language
        translated_chunk = GoogleTranslator(source='auto', target='de').translate(text=text_chunk)
        raw_text_de.append(translated_chunk)

# Check
raw_text_de


In [4]:
detect('Zorge Foto, Foto, Foto, Foto, Foto, Foto')

'it'

In [5]:
detect('Photo')

'vi'

## PROBLEM: 
"Foto" sometimes gets converted to "Fotos"! This is because langDetect assumes "Foto" is Italian and then translates that to "Fotos" in German

Testing with "Photo" also shows that it outputs Vietnamese??

Solutions?
- different language detector?
- remove "Fotos" as well as "Foto" (as stop word)  -- for now this is what I've done
- remove "Foto" earlier on in pre-processing

In [7]:
# STEP 2: SAVE TRANSLATED TEXT

pickle.dump(raw_text_de, open("./processing/raw_text_de.p", "wb"))

In [8]:
# STEP 2: LOAD TRANSLATED TEXT

raw_text_de = pickle.load(open("./processing/raw_text_de.p", "rb"))

### Step 3: Tokenisation

German spacy model options: https://spacy.io/models/de 

de_core_news_sm (I've installed this one so far)
de_core_news_md
de_core_news_lg
de_dep_news_trf

Which is best to use?

**REVISIT THIS LATER**


In [None]:
# STEP 3: TOKENISATION

# Load the spacy model
nlp = spacy.load("de_core_news_sm")

# Create an empty list to store the tokens
doc = []

# Tokenise the raw_text input
for string in raw_text_de:
    doc.extend(nlp(string))

# Print the tokens to check
for token in doc:
    print(token)


Finndorff-Blockland-St
.
Jürgen-Ritterhude-Findorff
Pause
am
Wümme
Deich
,
Am
Wümme
Deich
,
Weitere
Picknick
Möglichkeit
mit
Bank
,
Foto
,
Gastronomie
mit
WC
,
Gastronomie
mit
WC
,
Foto
,
Weitere
Einkehr
Möglichkeit
mit
ECHTErrr
,
Foto
,
Kirche
St.
Jürgen
,
Möglichkeit
für
Picknick
auf
der
Wiese
!
,
Eiscafe
in
Ritterhude
,
Wümme
Brücke
,
Dammsiel
,
Seitenwechsel
über
die
Wümme
Flughafenrunde
Foto
,
Foto
,
Foto
,
Abzweig
zum
Park
links
der
Weser
,
ausgeschildert
!
!
,
Abzweig
zum
Hotel
Robben
,
Foto
,
Foto
,
Foto
,
Ochtumdeich
,
Fußweg
,
ein
Stück
schieben
,
wunderbarer
Weg
!
,
Abzeig
zu
Picknickplatz
,
Foto
,
Foto
,
Foto
,
Badestrand
,
Foto
Ohz
,
Cafe
,
Foto
WSV
Hasenbühren
-
Bremen
,
Stephaniebrücke
über
Strom
+
Huchting
Knotenpunkt
Abzweig
zur
Flughafenrunde
,
Unser
Weg
heute
Richtung
Centrum
,
Endpunkt
Stephaniebrücke
auf
der
Neustadtseite
Rund
um
Bremen
Start
,
Silbersee
mit
Bade-
 
Picknick
Möglichkeit
und
WC
,
Wassermühle
Barrien
,
Picknick
Platz
Radtour
Foto
,
Foto
,
Foto
,
Foto

### Step 4: Pre-processing 

Remove stop words, punctuation & numbers from the token list. 

Convert all to lower-case.

**Also:** 
- remove the word "Foto" as this is just placeholder text

In [None]:
# STEP 4: STOP WORDS ETC

# Add word "Foto" to stop list
nlp.vocab["Foto"].is_stop = True
nlp.vocab["Fotos"].is_stop = True

# Filter out tokens that are stop words (is_stop), puncutation (is_punct), 
# numbers (is_digit & like_num) OR spaces (is_space)
filtered_tokens = [token.text for token in doc if not token.is_stop | token.is_punct | 
                   token.is_digit | token.like_num | token.is_space]

# Empty list for lower-case versions
filtered_tokens_lc = []

# Convert to lower-case
for token in filtered_tokens:
    token_lc = token.lower()
    filtered_tokens_lc.append(token_lc)

# Check
for token in filtered_tokens_lc:
    print(token)

print(filtered_tokens_lc[0])

finndorff-blockland-st
jürgen-ritterhude-findorff
pause
wümme
deich
wümme
deich
picknick
möglichkeit
bank
gastronomie
wc
gastronomie
wc
einkehr
möglichkeit
echterrr
kirche
st.
jürgen
möglichkeit
picknick
wiese
eiscafe
ritterhude
wümme
brücke
dammsiel
seitenwechsel
wümme
flughafenrunde
abzweig
park
links
weser
ausgeschildert
abzweig
hotel
robben
ochtumdeich
fußweg
stück
schieben
wunderbarer
weg
abzeig
picknickplatz
badestrand
ohz
cafe
wsv
hasenbühren
bremen
stephaniebrücke
strom
+
huchting
knotenpunkt
abzweig
flughafenrunde
weg
richtung
centrum
endpunkt
stephaniebrücke
neustadtseite
bremen
start
silbersee
bade-
picknick
möglichkeit
wc
wassermühle
barrien
picknick
platz
radtour
alte
luege
zorge
clausthal
zellerfeld
nah
lautenthal
hauptschacht
lautenthal
maaßener
geipel
rundweg
runde
frühlingshaften
harlyberg
start-
endpunkt
parkplatz
auto
fahrrad
alten
forsthaus
nordwestlich
klostergut
wöltingerode
abzweig
mammutbaum
momentan
getarnt
gefällten
baum
bismarck-denkmal
blütenpracht
gottsched

In [35]:
# STEP 4: CHECKING SOME RESULTS? :) 

word_freq = Counter(filtered_tokens_lc)

common_words = word_freq.most_common(20)

common_words

[('hwn', 79),
 ('braunschweig', 65),
 ('weg', 37),
 ('harzer', 28),
 ('links', 27),
 ('rechts', 24),
 ('schöne', 22),
 ('brocken', 22),
 ('parkplatz', 21),
 ('wandernadel', 21),
 ('sachsenhagen', 21),
 ('start', 19),
 ('nummer', 19),
 ('wald', 18),
 ('st.', 17),
 ('bad', 16),
 ('tour', 16),
 ('export', 16),
 ('km', 15),
 ('route', 15)]

I think HWN stands for Harzer Wandernadel: a hiking badge system for the Harz mountains https://en.wikipedia.org/wiki/Harzer_Wandernadel

### STEP 5: Semantic analysis (word2vec)

Using the de_wiki word2vec model from https://sites.google.com/site/fritzgntr/software-resources/semantic_spaces because it is the closest one to the baroni model used in the Chai-allah paper but for German. 

The semantic spaces are provided in .rda format for R but can be export as a txt for use outside R using the following commands (in R). NOTE: I had to adjust from the instruction on the source website as it didn't account for use in gensim, which then meant I had problems with quotes and separators.

 load("C:/Users/ninam/Documents/UZH/04_Thesis/code/nm_forest_thesis/word2vec/de_wiki.rda")

 write.table(de_wiki, file = "C:/Users/ninam/Documents/UZH/04_Thesis/code/nm_forest_thesis/word2vec/de_wiki.txt", row.names = TRUE, col.names = FALSE, quote = FALSE, sep = " ")
 

 Some useful notes/resources:
 - "a word embedding refers to a vector representation of a particular word or phrase in a multidimensional space" (Generally this website is helpful: https://okan.cloud/posts/2022-05-02-text-vectorization-using-python-word2vec/)
 - https://medium.com/@dilip.voleti/classification-using-word2vec-b1d79d375381
 - https://medium.com/@denis.arvizu/text-clustering-using-word2vec-a89fbd9b9d0f

In [65]:
# STEP 5: MODEL PREP (RUN ONCE!)

# First I need to adjust the encoding and add the header information required by gensim
# Then I can save it in gensim format for easier use

# Count rows and vector size for header info
with open("./word2vec/de_wiki.txt", encoding="ISO-8859-1") as f:
    lines = f.readlines()

num_words = len(lines)
vector_size = len(lines[0].split()) - 1

# Write in utf-8 with header info
with open("./word2vec/de_wiki_utf8_header.txt", "w", encoding="utf-8") as f:
    f.write(f"{num_words} {vector_size}\n")
    f.writelines(lines)

# Load model (from new txt file) with gensim
model = KeyedVectors.load_word2vec_format("./word2vec/de_wiki_utf8_header.txt", binary=False)

# Save the model in optimised gensim format (this will make the loading faster for next time)
model.save("./word2vec/de_wiki_final.model")

In [None]:
# STEP 5: CHECKING MODEL

# Load model from model file
model = KeyedVectors.load('./word2vec/de_wiki.model')

# Which tokens are missing in model?
known_tokens = [token for token in filtered_tokens_lc if token in model]
print("Missing tokens:", [token for token in filtered_tokens_lc if token not in model])


Missing tokens: ['finndorff-blockland-st', 'jürgen-ritterhude-findorff', 'echterrr', 'st.', 'dammsiel', 'flughafenrunde', 'ochtumdeich', 'abzeig', 'ohz', 'hasenbühren', 'stephaniebrücke', '+', 'flughafenrunde', 'stephaniebrücke', 'neustadtseite', 'bade-', 'maaßener', 'harlyberg', 'start-', 'bismarck-denkmal', 'gottsched-platz', 'harlyburg', 'harlyturm', 'kräuter-august-höhle', 'mammutbaum-blick', 'mammutbaum-hütte', 'waldmännecken-höhle', 'sommerfreunde', 'mittelweg-lärchenweg', 'winsenluhe', 'leinemasch', 'obernfelde', 'trainen', 'ochtrup-laer', 'drilandsee', 'hafensänger', 'weser-ems', 'schienen-ersatzverkehr', 'reimersstraße', 'röseteich', '-harz', 'wald-', 'destedt', 'eleonorwald', 'markatal', 'produktionswaldweg', 'waldeslust', 'uuuuber', 'volkensen', 'meckelsen', 'veenland', 'på', 'cykel', 'überraschungen', 'streckenstrecken', 'varrying', '-landschaft', 'brocken-benno', 'brockengipfel', 'brockenstrasse', 'gipfelrundweg', 'oste-radweg', 'ochterhausen', 'ochterhausen', 'a2', 'grani

In [None]:
# STEP 5: CONVERT TOKENS TO VECTORS (CREATE WORD EMBEDDINGS)

# Load model from model file
model = KeyedVectors.load('./word2vec/de_wiki.model')

# Create empty list to store vectors
vectors = []

# Store the vectors for each token
for token in filtered_tokens:
    if token in model:
        vector = model[token]
        vectors.append(vector)


[array([-1.44818e-01,  3.08170e-02, -9.70480e-02,  7.86400e-03,
         8.95480e-02,  8.15770e-02,  4.09540e-02,  7.14900e-02,
         5.43100e-02, -2.46690e-02, -1.47108e-01,  2.49465e-01,
         3.97720e-02, -2.11505e-01,  1.88090e-02,  1.71335e-01,
        -3.43880e-02,  1.06663e-01, -3.25270e-02,  5.17860e-02,
         1.14431e-01,  2.47603e-01,  2.05802e-01,  3.07360e-02,
        -3.92850e-02,  1.40030e-01,  1.09702e-01,  6.20840e-02,
        -8.79650e-02,  3.24211e-01,  2.72725e-01,  5.86880e-02,
         2.05811e-01,  5.04680e-02, -4.52870e-02, -1.22580e-01,
        -2.19416e-01, -7.97830e-02, -4.18760e-02, -2.03307e-01,
         1.40132e-01, -2.50250e-02, -3.89870e-02, -1.17628e-01,
        -1.87234e-01, -5.51410e-02,  1.14580e-02, -1.97790e-02,
         1.35094e-01,  7.50880e-02, -1.23295e-01,  3.27260e-02,
         1.09390e-01, -1.78334e-01, -1.48147e-01, -2.78240e-02,
        -3.06106e-01,  1.14463e-01,  1.04188e-01,  9.41460e-02,
        -1.12331e-01,  1.15540e-02, -1.0

In [66]:
# TEST 

model["schöne"]

model.most_similar("schöne", topn=2)



[('wunderschöne', 0.7176626324653625), ('tolle', 0.6318803429603577)]

### Step 6: Clustering

I'll try it out with k-means first.

In [71]:
# Test from chatgpt

kmeans = KMeans(n_clusters=5)
kmeans.fit(vectors)

labels = kmeans.labels_

for token, label in zip(filtered_tokens_lc, labels):
    print(f"Token: {token}, Cluster: {label}")

Token: finndorff-blockland-st, Cluster: 2
Token: jürgen-ritterhude-findorff, Cluster: 2
Token: pause, Cluster: 1
Token: wümme, Cluster: 2
Token: deich, Cluster: 2
Token: wümme, Cluster: 2
Token: deich, Cluster: 2
Token: picknick, Cluster: 2
Token: möglichkeit, Cluster: 1
Token: bank, Cluster: 2
Token: gastronomie, Cluster: 2
Token: wc, Cluster: 2
Token: gastronomie, Cluster: 2
Token: wc, Cluster: 1
Token: einkehr, Cluster: 2
Token: möglichkeit, Cluster: 2
Token: echterrr, Cluster: 2
Token: kirche, Cluster: 2
Token: st., Cluster: 2
Token: jürgen, Cluster: 2
Token: möglichkeit, Cluster: 2
Token: picknick, Cluster: 1
Token: wiese, Cluster: 1
Token: eiscafe, Cluster: 2
Token: ritterhude, Cluster: 2
Token: wümme, Cluster: 2
Token: brücke, Cluster: 2
Token: dammsiel, Cluster: 2
Token: seitenwechsel, Cluster: 2
Token: wümme, Cluster: 2
Token: flughafenrunde, Cluster: 2
Token: abzweig, Cluster: 2
Token: park, Cluster: 2
Token: links, Cluster: 0
Token: weser, Cluster: 2
Token: ausgeschildert, C

