## RQ2 Text Analysis

General plan:
- Follow a similar approach to the Chai-Allah "Data Mining..." paper to create clusters from the data
    - Just use tokenisation first, if 2 big then use lemmatisation
    - for now, just use all words (don't worry about filtering for high frequency words)
    - Use a German version of the word2vec model (see what the baroni one is based on and try to find a German equivalent)
    - use k-means or ward's for the clustering

Open Questions
- How should I handle placenames? See if they form clusters? Filter using gazetteer?
- **HOW DO I CONNECT TO CLASS 3 AND 6 AREAS?** Do I run them separately (but then end up with different clusters for each) or do I run the model on everything to create a classifier and then run that on the two separate corpora? I think the latter makes more sense but I don't really understand how this would work.


Steps:
1. Initial Cleaning
2. Language Handling (skipped for now)
3. Tokenisation
4. Remove stop words and punctuation

In [103]:
# SETUP

# Import packages
import pandas as pd
import spacy 
from langdetect import detect
from deep_translator import GoogleTranslator
from collections import Counter

### Step 1: Initial Cleaning

General text preparation to get the data in a format which works for translation and for use with spacy.

In [118]:
# STEP 1: LOAD & CLEAN

# Load the master CSV from rq2_step1_data_collection
master = pd.read_csv("./processing/master.csv")

# Add ['None'] to any blank rows
# this is necessary for the next step, but then they will be removed later
master.fillna("['None']", inplace=True)

# Extract only the columns with text and combine them into a single column
# I am keeping the descriptions & photo captions separate from the comments as sometimes
# the comments are in a different language (so they need to be translated)
raw_text = pd.DataFrame()
raw_text["desc_capt"] = master["description text"] + " " + master["photo_captions"]
raw_text["comments"] = master["comments"] 

# Now remove all the ['None'] text from both columns
raw_text["desc_capt"] = raw_text["desc_capt"].str.replace(r"\['None'\]", "", regex=True)
raw_text["comments"] = raw_text["comments"].str.replace(r"\['None'\]", "", regex=True)

# Remove certain special characters: [, ], ', |, /, \ 
raw_text["desc_capt"] = raw_text["desc_capt"].str.replace(r"\[", "", regex=True)
raw_text["desc_capt"] = raw_text["desc_capt"].str.replace(r"\]", "", regex=True)
raw_text["desc_capt"] = raw_text["desc_capt"].str.replace(r"\'", "", regex=True)
raw_text["desc_capt"] = raw_text["desc_capt"].str.replace(r"\|", "", regex=True)
raw_text["desc_capt"] = raw_text["desc_capt"].str.replace(r"\\", "", regex=True)
raw_text["desc_capt"] = raw_text["desc_capt"].str.replace(r"\/", "", regex=True)

raw_text["comments"] = raw_text["comments"].str.replace(r"\[", "", regex=True)
raw_text["comments"] = raw_text["comments"].str.replace(r"\]", "", regex=True)
raw_text["comments"] = raw_text["comments"].str.replace(r"\'", "", regex=True)
raw_text["comments"] = raw_text["comments"].str.replace(r"\|", "", regex=True)
raw_text["comments"] = raw_text["comments"].str.replace(r"\\", "", regex=True)
raw_text["comments"] = raw_text["comments"].str.replace(r"\/", "", regex=True)

# This is to address a specific issue in one of the entries
raw_text["desc_capt"] = raw_text["desc_capt"].str.replace(r"\n", " ", regex=True)

# Create lists from the 2 columns
raw_text_p1 = raw_text["desc_capt"].astype(str).values.tolist()
raw_text_p2 = raw_text["comments"].astype(str).values.tolist()

# Convert entries which are just a space (" ") to be empty ("") - only needed for p1
raw_text_p1 = [x.strip(' ') for x in raw_text_p1]

# Remove all empty entries
raw_text_p1 = list(filter(None, raw_text_p1))
raw_text_p2 = list(filter(None, raw_text_p2))

# Combine into 1 list
raw_text_list = raw_text_p1 + raw_text_p2

# Check
raw_text_list

['Finndorff-Blockland-St. Jürgen-Ritterhude-Findorff Pause am Wümme Deich, Am Wümme Deich, Weitere Picknick Möglichkeit mit Bank, Foto, Gastronomie mit WC, Gastronomie mit WC, Foto, Weitere Einkehr Möglichkeit mit ECHTErrr, Foto, Kirche St. Jürgen, Möglichkeit für Picknick auf der Wiese!, Eiscafe in Ritterhude, Wümme Brücke, Dammsiel, Seitenwechsel über die Wümme',
 'Flughafenrunde Foto, Foto, Foto, Abzweig zum Park links der Weser , ausgeschildert!!, Abzweig zum Hotel Robben, Foto, Foto, Foto, Ochtumdeich, Fußweg, ein Stück schieben , wunderbarer Weg!, Abzeig zu Picknickplatz, Foto, Foto, Foto, Badestrand, Foto',
 'OHZ, Café, Foto',
 'WSV Hasenbühren - Bremen, Stephaniebrücke über Strom + Huchting Knotenpunkt Abzweig zur Flughafenrunde, Unser Weg heute Richtung Centrum, Endpunkt Stephaniebrücke auf der Neustadtseite',
 'Rund um Bremen Start, Silbersee mit Bade-  Picknick Möglichkeit und WC, Wassermühle Barrien , Picknick Platz',
 'Radtour Foto, Foto, Foto, Foto, Foto, Foto, Foto, Foto

### Step 2: Language Handling: Translate to German

As the vast majority of text is in German, I will use this as the base language. I have the option now to either remove anything not in German or to translate it. Although translation is not ideal (maybe some of original meaning is lost/altered), I think this is a better option than just removing the other languages entirely. So here I will tranlsate everything into German. 

langdetect package: https://anaconda.org/conda-forge/langdetect 
- Use to check if already German

deep-translator package: https://pypi.org/project/deep-translator/
- Using Google Translate as it doesn't require API key and can auto detect the input language (DeepL requires API key and I don't think it has an auto-detect option)


In [None]:
# STEP 2: TRANSLATE TO GERMAN

# Create list for all translated text
raw_text_de = []

# Check language, translate as needed and append to list 
for text_chunk in raw_text_list:
    # Check language
    input_lang = detect(text_chunk) 
    # If already German, append to German list
    if input_lang == "de":
        raw_text_de.append(text_chunk)
    # If not German, translate and append to German list    
    else:
        # Translate using Google Translate, use auto-detection for input language
        translated_chunk = GoogleTranslator(source='auto', target='de').translate(text=text_chunk)
        raw_text_de.append(translated_chunk)

# Check
raw_text_de



['Finndorff-Blockland-St. Jürgen-Ritterhude-Findorff Pause am Wümme Deich, Am Wümme Deich, Weitere Picknick Möglichkeit mit Bank, Foto, Gastronomie mit WC, Gastronomie mit WC, Foto, Weitere Einkehr Möglichkeit mit ECHTErrr, Foto, Kirche St. Jürgen, Möglichkeit für Picknick auf der Wiese!, Eiscafe in Ritterhude, Wümme Brücke, Dammsiel, Seitenwechsel über die Wümme',
 'Flughafenrunde Foto, Foto, Foto, Abzweig zum Park links der Weser , ausgeschildert!!, Abzweig zum Hotel Robben, Foto, Foto, Foto, Ochtumdeich, Fußweg, ein Stück schieben , wunderbarer Weg!, Abzeig zu Picknickplatz, Foto, Foto, Foto, Badestrand, Foto',
 'Ohz, Cafe, Foto',
 'WSV Hasenbühren - Bremen, Stephaniebrücke über Strom + Huchting Knotenpunkt Abzweig zur Flughafenrunde, Unser Weg heute Richtung Centrum, Endpunkt Stephaniebrücke auf der Neustadtseite',
 'Rund um Bremen Start, Silbersee mit Bade-  Picknick Möglichkeit und WC, Wassermühle Barrien , Picknick Platz',
 'Radtour Foto, Foto, Foto, Foto, Foto, Foto, Foto, Foto

In [119]:
detect('Zorge Foto, Foto, Foto, Foto, Foto, Foto')

'it'

In [None]:
detect('Photo')

'en'

## PROBLEM: 
"Foto" sometimes gets converted to "Fotos"! This is because langDetect assumes "Foto" is Italian and then translates that to "Fotos" in German

Testing with "Photo" also shows that it outputs Vietnamese??

Solutions?
- different language detector?
- remove "Fotos" as well as "Foto" (as stop word)
- remove "Foto" earlier on in pre-processing

### Step 3: Tokenisation

German spacy model options: https://spacy.io/models/de 

de_core_news_sm (I've installed this one so far)
de_core_news_md
de_core_news_lg
de_dep_news_trf

Which is best to use?

**REVISIT THIS LATER**


In [113]:
# STEP 3: TOKENISATION

# Load the spacy model
nlp = spacy.load("de_core_news_sm")

# Create an empty list to store the tokens
doc = []

# Tokenise the raw_text input
for string in raw_text_de:
    doc.extend(nlp(string))

# Print the tokens to check
for token in doc:
    print(token)


Finndorff-Blockland-St
.
Jürgen-Ritterhude-Findorff
Pause
am
Wümme
Deich
,
Am
Wümme
Deich
,
Weitere
Picknick
Möglichkeit
mit
Bank
,
Foto
,
Gastronomie
mit
WC
,
Gastronomie
mit
WC
,
Foto
,
Weitere
Einkehr
Möglichkeit
mit
ECHTErrr
,
Foto
,
Kirche
St.
Jürgen
,
Möglichkeit
für
Picknick
auf
der
Wiese
!
,
Eiscafe
in
Ritterhude
,
Wümme
Brücke
,
Dammsiel
,
Seitenwechsel
über
die
Wümme
Flughafenrunde
Foto
,
Foto
,
Foto
,
Abzweig
zum
Park
links
der
Weser
,
ausgeschildert
!
!
,
Abzweig
zum
Hotel
Robben
,
Foto
,
Foto
,
Foto
,
Ochtumdeich
,
Fußweg
,
ein
Stück
schieben
,
wunderbarer
Weg
!
,
Abzeig
zu
Picknickplatz
,
Foto
,
Foto
,
Foto
,
Badestrand
,
Foto
Ohz
,
Cafe
,
Foto
WSV
Hasenbühren
-
Bremen
,
Stephaniebrücke
über
Strom
+
Huchting
Knotenpunkt
Abzweig
zur
Flughafenrunde
,
Unser
Weg
heute
Richtung
Centrum
,
Endpunkt
Stephaniebrücke
auf
der
Neustadtseite
Rund
um
Bremen
Start
,
Silbersee
mit
Bade-
 
Picknick
Möglichkeit
und
WC
,
Wassermühle
Barrien
,
Picknick
Platz
Radtour
Foto
,
Foto
,
Foto
,
Foto

### Step 4: Filtering (stop words, punctuation, numbers)

Remove stop words, punctuation & numbers from the token list. 

**Also:** 
- remove the word "Foto" as this is just placeholder text

In [120]:
# STEP 4: STOP WORDS ETC

# Add word "Foto" to stop list
nlp.vocab["Foto"].is_stop = True
nlp.vocab["Fotos"].is_stop = True

# Filter out tokens that are stop words (is_stop), puncutation (is_punct), 
# numbers (is_digit & like_num) OR spaces (is_space)
filtered_tokens = [token.text for token in doc if not token.is_stop | token.is_punct | 
                   token.is_digit | token.like_num | token.is_space]

# Check
for token in filtered_tokens:
    print(token)

Finndorff-Blockland-St
Jürgen-Ritterhude-Findorff
Pause
Wümme
Deich
Wümme
Deich
Picknick
Möglichkeit
Bank
Gastronomie
WC
Gastronomie
WC
Einkehr
Möglichkeit
ECHTErrr
Kirche
St.
Jürgen
Möglichkeit
Picknick
Wiese
Eiscafe
Ritterhude
Wümme
Brücke
Dammsiel
Seitenwechsel
Wümme
Flughafenrunde
Abzweig
Park
links
Weser
ausgeschildert
Abzweig
Hotel
Robben
Ochtumdeich
Fußweg
Stück
schieben
wunderbarer
Weg
Abzeig
Picknickplatz
Badestrand
Ohz
Cafe
WSV
Hasenbühren
Bremen
Stephaniebrücke
Strom
+
Huchting
Knotenpunkt
Abzweig
Flughafenrunde
Weg
Richtung
Centrum
Endpunkt
Stephaniebrücke
Neustadtseite
Bremen
Start
Silbersee
Bade-
Picknick
Möglichkeit
WC
Wassermühle
Barrien
Picknick
Platz
Radtour
Alte
Luege
Zorge
Clausthal
Zellerfeld
Nah
Lautenthal
Hauptschacht
Lautenthal
Maaßener
Geipel
Rundweg
Runde
frühlingshaften
Harlyberg
Start-
Endpunkt
Parkplatz
Auto
Fahrrad
Alten
Forsthaus
nordwestlich
Klostergut
Wöltingerode
Abzweig
Mammutbaum
Momentan
getarnt
gefällten
Baum
Bismarck-Denkmal
Blütenpracht
Gottsched

In [123]:
word_freq = Counter(filtered_tokens)

common_words = word_freq.most_common(20)

common_words

[('HWN', 79),
 ('Braunschweig', 65),
 ('Weg', 36),
 ('Harzer', 28),
 ('Links', 23),
 ('Rechts', 22),
 ('Brocken', 22),
 ('Parkplatz', 21),
 ('Wandernadel', 21),
 ('Sachsenhagen', 21),
 ('Start', 19),
 ('Nummer', 19),
 ('Wald', 18),
 ('St.', 17),
 ('export', 16),
 ('Tour', 15),
 ('km', 15),
 ('Route', 15),
 ('Bad', 14),
 ('schöne', 14)]

I think HWN stands for Harzer Wandernadel: a hiking badge system for the Harz mountains https://en.wikipedia.org/wiki/Harzer_Wandernadel