## RQ3 Full Process

General plan:
- Load the tokens for class 3 and 6 (created in RQ2)
- Assign each token their cluster from RQ2
- Isolate the clusters I'm interested in 
- Compare those clusters (word cloud, cluster membership metric?)

To do: think about how to handle the max class - should it be the max across all classes, the max across class 3, 4, 5, 6 or the max between 3 and 6

In [56]:
# SETUP

# Import packages
import pandas as pd
import pickle

import spacy
from deep_translator import GoogleTranslator
from collections import Counter

import matplotlib.pyplot as plt
from wordcloud import WordCloud


### Step 1: Load 2 sets of tokens for class 3 and 6

These were created as part of RQ2.

In [44]:
# STEP 1: LOAD CLASS TOKENS

filtered_tokens_lc_c3 = pickle.load(open("./processing/final_tokens_c3.p", "rb"))
filtered_tokens_lc_c6 = pickle.load(open("./processing/final_tokens_c6.p", "rb"))


In [None]:
# STEP 1: CHECK GENERAL RESULTS

word_freq_c3 = Counter(filtered_tokens_lc_c3)
common_words_c3 = word_freq_c3.most_common(20)

word_freq_c6 = Counter(filtered_tokens_lc_c6)
common_words_c6 = word_freq_c6.most_common(20)

common_words_c3[0:9], common_words_c6[0:9]

([('wanderung', 9),
  ('weg', 6),
  ('to', 5),
  ('entlang', 4),
  ('rauhkopf', 4),
  ('jaegerkamp', 4),
  ('linderhof', 4),
  ('m', 3),
  ('bartholomä', 3)],
 [('weg', 155),
  ('blick', 111),
  ('aussicht', 100),
  ('parkplatz', 84),
  ('burg', 82),
  ('rechts', 70),
  ('wald', 69),
  ('entlang', 68),
  ('schöne', 68)])

### Step 2: Assign clusters

Assign each token their cluster from rq2_step2_text_analysis.ipynb

In [None]:
# STEP 2: LOAD TOKEN/CLUSTER LOOKUP FROM RQ2

token_cluster_lookup = pickle.load(open("./processing/token_cluster_all.p", "rb"))

token_cluster_lookup_dict = token_cluster_lookup.set_index("token").to_dict()["cluster"]

token_cluster_lookup_dict

{'breite': 8,
 'forstwege': 2,
 'pfade': 2,
 'halber': 13,
 'strecke': 2,
 'einkehren': 10,
 'montag': 7,
 'ruhetag': 2,
 'rinken': 5,
 'schauinsland': 5,
 'arten': 8,
 'menschen': 13,
 'geeignet': 8,
 'gipfel': 2,
 'abgerundet': 8,
 'bietet': 8,
 'guten': 8,
 'blick': 6,
 'schwarzen': 13,
 'dschungel': 0,
 'nördlichen': 4,
 'hang': 13,
 'schweizer': 7,
 'berge': 0,
 'liechtenstein': 7,
 'feldberg': 5,
 'm': 14,
 'schwarzes': 13,
 'deutschland': 8,
 'linke': 13,
 'gabelung': 1,
 'feldsee': 5,
 'see': 11,
 'bank': 7,
 'top': 7,
 'monolith': 4,
 'titisee': 5,
 'camping': 10,
 'teich': 11,
 'ruhestein': 5,
 'bartholomä': 5,
 'schönen': 6,
 'fuss': 2,
 'watzmann': 2,
 'ostwand': 4,
 'gleichem': 8,
 'weg': 12,
 'entlang': 2,
 'richtung': 12,
 'norden': 4,
 'beginnt': 8,
 'wald': 0,
 'zunehmend': 8,
 'ausgesetzter': 2,
 'wanderer': 2,
 'höhenangst': 13,
 'steig': 2,
 'erreicht': 8,
 'schließlich': 13,
 'aussichtspunkt': 2,
 'schöner': 6,
 'aussicht': 6,
 'königssee': 5,
 'minuten': 13,
 'sch

In [53]:
# STEP 2: MAP TOKENS TO CLUSERS

# Create a df with tokens as one column
final_tokens_c3_df = pd.DataFrame({'token': filtered_tokens_lc_c3})
final_tokens_c6_df = pd.DataFrame({'token': filtered_tokens_lc_c6})

# Add cluster column using dictionary as lookup
final_tokens_c3_df["cluster"] = final_tokens_c3_df["token"].map(token_cluster_lookup_dict)
final_tokens_c6_df["cluster"] = final_tokens_c6_df["token"].map(token_cluster_lookup_dict)

# Remove NaN entries (these are ones that were missing from the model during vectorisation)
final_tokens_c3_df = final_tokens_c3_df.dropna()
final_tokens_c6_df = final_tokens_c6_df.dropna()

# Convert cluster number to int
final_tokens_c3_df["cluster"] = final_tokens_c3_df["cluster"].astype(int)
final_tokens_c6_df["cluster"] = final_tokens_c6_df["cluster"].astype(int)

final_tokens_c3_df

Unnamed: 0,token,cluster
0,breite,8
1,forstwege,2
2,pfade,2
3,halber,13
4,strecke,2
...,...,...
485,oktober,7
486,museum,4
487,jena,7
488,jena,7


### Step 3: Isolate clusters of interest

At the moment I'm not sure of the final clusters so the actual clusters of interest might change - the ones used now are just placeholders so I can get the process ready.


In [59]:
# STEP 3: FILTER SPECIFIC CLUSTERS

# Currently extracting only cluster 0, 6 and 11
filtered_clusters_c3 = final_tokens_c3_df[(final_tokens_c3_df["cluster"]==0) | (final_tokens_c3_df["cluster"]==6) | (final_tokens_c3_df["cluster"]==11)]
filtered_clusters_c6 = final_tokens_c6_df[(final_tokens_c6_df["cluster"]==0) | (final_tokens_c6_df["cluster"]==6) | (final_tokens_c6_df["cluster"]==11)]

filtered_clusters_c3

Unnamed: 0,token,cluster
23,blick,6
25,dschungel,0
30,berge,0
40,see,11
41,see,11
48,teich,11
52,schönen,6
67,wald,0
77,schöner,6
78,aussicht,6


### Step 4: Compare Class 3 and 6 Results

wordcloud
within cluster membership metric?

In [None]:
# STEP 3: WORD CLOUDS

# Create a function to count the token frequencies for each cluster 
def get_cluster_freq(df, clus_num):
    token_list = df.loc[df["cluster"] == clus_num, "token"]
    delimiter = " " 
    join_str = delimiter.join(token_list)
    freqs = Counter(join_str.split())
    return freqs

# Create empty lists for storing the word cloud plots 
all_wc = []

# For each cluster number use the get_cluster_freq function to get token frequencies,
# input those frequencies into the WordCloud generator and append each wc to list
# IMPORTANT: I HAVE LIMITED TO SHOW TOP 20 TOKENS IN EACH CLUSTER
for clus_num in range(3):  
    freqs = get_cluster_freq(filtered_clusters_c6, clus_num)
    wc = WordCloud(background_color='white', 
                   colormap='ocean', 
                   max_words=20).generate_from_frequencies(freqs)
    all_wc.append(wc)

# Create custom titles
all_titles = ["Cluster 0", "Cluster 6", "Cluster 11",]

# Create a figure with 14 subplots
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(20, 40))
axes = axes.flatten()

# Iterate through word clouds and titles to plot
for i, (wc, title) in enumerate(zip(all_wc, all_titles)):
    axes[i].imshow(wc)
    axes[i].axis('off')
    axes[i].set_title(title, fontsize=20, fontweight="bold")

# Display final figure
plt.show()


ValueError: We need at least 1 word to plot a word cloud, got 0.