## RQ3 Full Process

General plan:
- Separate the main corpus into two separate ones: one where the dominant consensus class is Non-Consensus (max_class=3) and one where the dominant consensus is Full Conensus Forest (max_class=6).
- Perform the same text processing as in rq2_step2_text_analysis.ipynb to generate the tokens
- Assign each token their cluster from rq2_step2_text_analysis.ipynb
- Isolate the clusters I'm interested in 
- Compare those clusters (cluster membership metric?)

To do: think about how to handle the max class - should it be the max across all classes, the max across class 3, 4, 5, 6 or the max between 3 and 6

In [None]:
# SETUP

# Import packages
import pandas as pd

import spacy 


### Step 1: Create separate corpora

Separate the main corpus into two separate ones: one where the dominant consensus class is Non-Consensus (max_class=3) and one where the dominant consensus is Full Conensus Forest (max_class=6).

In [None]:
# STEP 1: SEPARATE CORPORA

# Load the master CSV from rq2_step1_data_collection
master = pd.read_csv("./processing/master.csv")

# Separate the master corpus into 2 corpora for class 3 and 6
corpus_3 = master[master["max_class"] == 3]
corpus_6 = master[master["max_class"] == 6]

### Step 2: Pre-processing

Same steps as in RQ2

In [None]:
# STEP 2: CLEAN

# Add ['None'] to any blank rows
# this is necessary for the next step, but then they will be removed later
corpus_3.fillna("['None']", inplace=True)
corpus_6.fillna("['None']", inplace=True)

# Extract the description and captitions and combine them into a single column
raw_text_c3 = pd.DataFrame()
raw_text_c3["desc_capt"] = corpus_3["description text"] + " " + corpus_3["photo_captions"]
raw_text_c6 = pd.DataFrame()
raw_text_c6["desc_capt"] = corpus_6["description text"] + " " + corpus_6["photo_captions"]

# Now remove all the ['None'] text from both columns
raw_text_c3["desc_capt"] = raw_text_c3["desc_capt"].str.replace(r"\['None'\]", "", regex=True)
raw_text_c6["desc_capt"] = raw_text_c6["desc_capt"].str.replace(r"\['None'\]", "", regex=True)

# Remove certain special characters
raw_text_c3["desc_capt"] = raw_text_c3["desc_capt"].str.replace(r"\[", "", regex=True)
raw_text_c3["desc_capt"] = raw_text_c3["desc_capt"].str.replace(r"\]", "", regex=True)
raw_text_c3["desc_capt"] = raw_text_c3["desc_capt"].str.replace(r"\'", "", regex=True)
raw_text_c3["desc_capt"] = raw_text_c3["desc_capt"].str.replace(r"\|", "", regex=True)
raw_text_c3["desc_capt"] = raw_text_c3["desc_capt"].str.replace(r"\\", "", regex=True)
raw_text_c3["desc_capt"] = raw_text_c3["desc_capt"].str.replace(r"\/", "", regex=True)
raw_text_c3["desc_capt"] = raw_text_c3["desc_capt"].str.replace(r"\+", "", regex=True)
raw_text_c3["desc_capt"] = raw_text_c3["desc_capt"].str.replace(r"=", "", regex=True)

raw_text_c6["desc_capt"] = raw_text_c6["desc_capt"].str.replace(r"\[", "", regex=True)
raw_text_c6["desc_capt"] = raw_text_c6["desc_capt"].str.replace(r"\]", "", regex=True)
raw_text_c6["desc_capt"] = raw_text_c6["desc_capt"].str.replace(r"\'", "", regex=True)
raw_text_c6["desc_capt"] = raw_text_c6["desc_capt"].str.replace(r"\|", "", regex=True)
raw_text_c6["desc_capt"] = raw_text_c6["desc_capt"].str.replace(r"\\", "", regex=True)
raw_text_c6["desc_capt"] = raw_text_c6["desc_capt"].str.replace(r"\/", "", regex=True)
raw_text_c6["desc_capt"] = raw_text_c6["desc_capt"].str.replace(r"\+", "", regex=True)
raw_text_c6["desc_capt"] = raw_text_c6["desc_capt"].str.replace(r"=", "", regex=True)

# This is to address a specific issue in one of the entries
raw_text_c3["desc_capt"] = raw_text_c3["desc_capt"].str.replace(r"\n", " ", regex=True)
raw_text_c6["desc_capt"] = raw_text_c6["desc_capt"].str.replace(r"\n", " ", regex=True)

# Create a list from the column
raw_text_c3_list = raw_text_c3["desc_capt"].astype(str).values.tolist()
raw_text_c6_list = raw_text_c6["desc_capt"].astype(str).values.tolist()

# Convert entries which are just a space (" ") to be empty ("")
raw_text_c3_list = [x.strip(' ') for x in raw_text_c3_list]
raw_text_c6_list = [x.strip(' ') for x in raw_text_c6_list]

# Remove all empty entries
raw_text_c3_list = list(filter(None, raw_text_c3_list))
raw_text_c6_list = list(filter(None, raw_text_c6_list))


# Check
#raw_text_c6_list

In [None]:
# STEP 2: FILTER OUT SHORT TEXTS

# Load the spacy model
nlp = spacy.load("de_core_news_sm")

# Create an empty list to store the unique token counts for each trail
unique_token_counts = []

# Tokenise the text for each trail & count the number of unique tokens for each trail
for trail_text in raw_text_list:
    doc = nlp(trail_text)
    tokens = [token.text.lower() for token in doc if not token.is_punct and not token.is_space]
    unique_tokens = set(tokens)
    unique_token_counts.append(len(unique_tokens))

# Combine results into a df
raw_text_counts = pd.DataFrame()
raw_text_counts["text"] = raw_text_list
raw_text_counts["unique_tokens"] = unique_token_counts

# Filter df to only include rows where unique_tokens >= 3
raw_text_counts = raw_text_counts.loc[(raw_text_counts["unique_tokens"] >= 3)]

# Save the text column as a list for use in the next steps
raw_text_3token_list = raw_text_counts["text"].astype(str).values.tolist()

# Check
raw_text_3token_list
