## Exploring the Dataset of Comments
This code was used to explore a dataset that contains 25.970 comments left on fanfiction included in the existing *MythFic* dataset. This comment-dataset is 984.369 words in total. To protect the privacy of the fanfiction community, the comment data used in the *Catching Feelings* project will not be made available for reuse, but you can use the scraping code available on our Github to collect your own dataset, then use this code to explore.

This notebook explores the following questions:
- How many characters and words are in each commments? I also provide some descriptive statistics of these numbers.
- Which languages are these comments predominantly written in?
- What are the most frequent words in the comments?
- What aspects do readers comment on? I explore this by looking at the most frequent words and identifying most frequent noun chunks with SpaCy.
- I also explore a Top2Vec topic model of all comments over 300 characters in length. Because this model is very large I have not (yet) shared it online. Email the author(s) to receive a WeTransfer link.

This notebook also contains the filtering and selection of the comments for annotation.

**Disclaimer:** Chat-GPT helped with some of the code in this notebook.

### Preliminaries

In [None]:
! pip install langid
! pip install langdetect
! pip install top2vec

In [1]:
# requirements
import csv
import pandas as pd
import spacy
from langdetect import detect
from collections import Counter
import langid
import numpy as np
import json
import os
import ipywidgets as widgets
from IPython.display import clear_output, display
from top2vec import Top2Vec
import matplotlib.pyplot as plt
from PIL import Image

In [2]:
from langid.langid import LanguageIdentifier, model
language_identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [6]:
# loading your data
df = pd.read_csv('filename.csv', sep=';')

In [None]:
# use this to check whether the data has been loaded
df.head()

In [None]:
# how many comments are in the dataframe?
len(df)

In [9]:
df['Comment']= df['Comment'].astype(str)

### Characters

In [10]:
# calculate how many characters are in each comment and add a characters-column
df['chars'] = df['Comment'].str.len()

In [None]:
# calculate some descriptive stats on the number of characters used in each comment
df['chars'].describe()

In [None]:
# how many characters in total?
df['chars'].sum()

### Wordcount

In [13]:
# calculate how many words are in each comment and add a wordcount-column
df['wordcount'] = df['Comment'].str.split().str.len()

In [None]:
# calculate some descriptive stats on the wordcount of each comment
df['wordcount'].describe()

In [None]:
# calculate total wordcount
df['wordcount'].sum()

### Most Frequent Words

In [None]:
strings = df["Comment"].astype(str)
Counter(" ".join(strings).split()).most_common(100)
# this is not very informative.

### Language Identification

In [None]:
def identify_language(text):
    return language_identifier.classify(text)

def add_language_columns(df, text_column):
    languages = []
    confidences = []
    for text in df['Comment']:
        language, confidence = identify_language(text)
        languages.append(language)
        confidences.append(confidence)
    df['Language'] = languages
    df['Confidence'] = confidences
    return df

# Add language columns to the DataFrame
df = add_language_columns(df, 'Text')

print(df)

In [None]:
# Here's some code to examine specific Language-labels
# == means check for exact label, != means check for absence of label

len(
    df
    .loc[lambda df: (df['Language'] != 'en')]
    .loc[lambda df: (df['chars'] < 300)]
)

# Note: 24.444 comments in the *Catching Feelings* dataset were identified as English
# 1526 as non-English
# For comments over 300 characters, it's 19.941 in English and 1476 in non-English.

In [None]:
df['Language'].value_counts()

In [None]:
df.sort_values(by=['Language'])

### Identifying the Most Common Noun Chunks with SpaCy

In [None]:
# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Function to extract noun chunks from text
def extract_noun_chunks(text):
    doc = nlp(text)
    return [chunk.text for chunk in doc.noun_chunks]

# Apply noun chunk extraction to each row in the 'Text' column
df['Noun_chunks'] = df['Comment'].apply(extract_noun_chunks)

# Print the DataFrame with the extracted noun chunks
print(df)

In [None]:
# Concatenate all noun chunks from all rows into a single list
all_noun_chunks = [chunk for row in df['Noun_chunks'] for chunk in row]

# Count the occurrences of each unique noun chunk
noun_chunk_counts = Counter(all_noun_chunks)

# Find the most frequent noun chunks
most_common_noun_chunks = noun_chunk_counts.most_common()

print(most_common_noun_chunks)

Some of the most frequent noun chunks that may point to aspects in the *Catching Feelings* data include:
- Achilles, Persephone, Hades, Patroclus (aspect: character)
- the story, the end, the ending, a happy ending, the plot (aspect: events)
- the world, Troy, the underworld (aspect: storyworld)
- your writing, writing, imagery (aspect: style)
- love, <3, my heart, tears, awe, fun, omg, pain,  (aspect: reader response)

#### You may want to save this dataframe with all the acquired metadata as a csv.*italicized text*

In [23]:
df.to_csv('comments+metadata.csv', index=False)

## Exploring a Top2Vec Topic Model of the Lengthier Comments

In [None]:
# this is a topic model of all comments over 300 characters long
# it was created with Top2Vec

model = Top2Vec.load("filtered_comments")

In [None]:
model.get_num_topics()

In [None]:
model.get_topics(25)

In [None]:
# You can visualize any topic like this
# I don't really find these easy to interpret
model.generate_topic_wordcloud(6)

In [None]:
# and take a closer look at the words in any topic
topic_words, word_scores, topics = model.get_topics(2)
for words, scores, num in zip(topic_words[1:], word_scores[1:], topics[1:]):
    print(f"Topic {num}")
    for word, score in zip(words, scores):
        print(word, score)

In [None]:
# you can also see how many documents are in each topic
topic_sizes, topic_nums = model.get_topic_sizes()
for topic_size, topic_num in zip(topic_sizes[:45], topic_nums[:45]):
    print(f"Topic Num {topic_num} has {topic_size} documents.")

## Creating the Annotation Set for *Catching Feelings*

- filtered by length between 100 and 4000 characters
 - this leads to a subset of 13073 comments, or about half of the total
- with a relatively high confidence that the language is English
- randomly select 1.000 comments

### Length selection

In [25]:
# filter by length
length_filter = df[(df['chars'] >= 100) & (df['chars'] <= 4000)]

In [None]:
# check how many comments are left after the length filter
len(length_filter)

In [27]:
language_filter = length_filter[(length_filter['Language'] == 'en')]

In [None]:
len(language_filter)

In [None]:
language_filter.head()

In [30]:
probability_filter = language_filter[(language_filter['Confidence'] >= 0.9)]

In [None]:
len(probability_filter)

In [32]:
filtered_set = probability_filter

In [None]:
filtered_set.head()

## Now let's sample 1000 and save them as txt-files

In [None]:
random_sample_filtered_set = filtered_set.sample(n=1000)

In [None]:
random_sample_filtered_set.head()

In [None]:
# Create a folder named 'annotation_set' if it doesn't exist
folder_path = 'annotation_set'
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

# Iterate over the 'Comments' column and write each comment to a separate text file
for index, comment in enumerate(random_sample_filtered_set['Comment']):
    file_name = f'comment_{index}.txt'
    file_path = os.path.join(folder_path, file_name)
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(comment)

print("Text files have been created in the 'annotation_set' folder.")