# Tidy Tuesday Project for September 17th, 2024

Hello there, folks!  I am coming at you with my first #tidytuesday project.  I am currently learning the [Julia](https://julialang.org/) programming language, and so that is what I am going to use in this notebook.

Please feel free to critique me if you are a Julia programmer!

## The Shakespeare Dialogue Dataset

Thanks to [nrennie](https://github.com/nrennie), we have access to a dataset that you can find --> [here](https://github.com/nrennie/shakespeare).

The author of the dataset we are using webscraped the data from [here](https://shakespeare.mit.edu/).

Let's get to it!

### Setup

In [None]:
# import Pkg and then other required packages
using Pkg
Pkg.add(["CSV", "DataFrames", "HTTP", "Statistics", "StatsPlots", "Plots"])

# load
using CSV, DataFrames, HTTP, Statistics, StatsPlots, Plots

### Import data

In [None]:
# Read directly from GitHub
hamlet_url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-09-17/hamlet.csv"
macbeth_url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-09-17/macbeth.csv"
romeo_juliet_url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-09-17/romeo_juliet.csv"

# Create DataFrames
hamlet = CSV.read(HTTP.get(hamlet_url).body, DataFrame)
macbeth = CSV.read(HTTP.get(macbeth_url).body, DataFrame)
romeo_juliet = CSV.read(HTTP.get(romeo_juliet_url).body, DataFrame);


### Hamlet 📚

In [None]:
# Basic Description
describe(hamlet)

In [None]:
# Number of unique characters
unique_characters = unique(hamlet.character)
println("Number of unique characters: ", length(unique_characters))

In [None]:
# Most frequent characters
character_counts = combine(groupby(hamlet, :character), nrow => :Count)
sorted_counts = sort(character_counts, :Count, rev=true)
println("Most frequent characters:\n", sorted_counts[1:10, :character, :Count])

In [None]:
# Distribution of dialogue lengths
hamlet.dialogue_length = length.(hamlet.dialogue)

histogram(hamlet.dialogue_length, 
bins=20, 
title="Distribution of Dialogue Lengths in Hamlet", 
xlabel="Dialogue Length", 
ylabel="Count"
)

In [None]:
# Plot line number distribution by act and scene
scatter(hamlet.line_number, 
hamlet.act, 
group=hamlet.scene, 
legend=:topright, 
title="Line Number Distribution by Act and Scene", 
xlabel="Line Number", 
ylabel="Act"
)

### Macbeth 🗡️

In [None]:
# Basic Description
describe(macbeth)

In [None]:
# Number of unique characters
unique_characters_macbeth = unique(macbeth.character)
println("Number of unique characters: ", length(unique_characters_macbeth))

In [None]:
# Most frequent characters
character_counts_macbeth = combine(groupby(macbeth, :character), nrow => :Count)
sorted_counts_macbeth = sort(character_counts_macbeth, :Count, rev=true)
println("Most frequent characters:\n", sorted_counts_macbeth[1:10, :character, :Count])

In [None]:
# Distribution of dialogue lengths
macbeth.dialogue_length = length.(macbeth.dialogue)
histogram(macbeth.dialogue_length, 
bins=20, 
title="Distribution of Dialogue Lengths in Macbeth", 
xlabel="Dialogue Length", 
ylabel="Count"
)

In [None]:
# Plot line number distribution by act and scene
scatter(macbeth.line_number, 
macbeth.act, 
group=macbeth.scene, 
legend=:topright, 
title="Line Number Distribution by Act and Scene", 
xlabel="Line Number", 
ylabel="Act"
)

### Romeo & Juliet ❤️

In [None]:
# Basic Description
describe(romeo_juliet)

In [None]:
# Number of unique characters
unique_characters_rj = unique(romeo_juliet.character)
println("Number of unique characters: ", length(unique_characters_rj))

In [None]:
# Most frequent characters
character_counts_rj = combine(groupby(romeo_juliet, :character), nrow => :Count)
sorted_counts_rj = sort(character_counts_rj, :Count, rev=true)
println("Most frequent characters:\n", sorted_counts_rj[1:10, :character, :Count])

In [None]:
# Distribution of dialogue lengths
romeo_juliet.dialogue_length = length.(romeo_juliet.dialogue)
histogram(romeo_juliet.dialogue_length, 
bins=20, 
title="Distribution of Dialogue Lengths in Romeo & Juliet", 
xlabel="Dialogue Length", 
ylabel="Count"
)

In [None]:
# Plot line number distribution by act and scene
scatter(romeo_juliet.line_number, 
romeo_juliet.act, 
group=romeo_juliet.scene, 
legend=:topright, 
title="Line Number Distribution by Act and Scene", 
xlabel="Line Number", 
ylabel="Act"
)

### NLP Techniques and Suggested Methods
- **Text Preprocessing and Tokenization**: Use `WordTokenizers.jl` for breaking down the text into tokens (words).
- **Word Embeddings**: Use pre-trained word embeddings with `Embeddings.jl` or train your own with `Word2Vec`.
- **Semantic Similarity**: Measure how similar different characters or dialogues are using embeddings.
- **Topic Modeling**: Use `TextAnalysis.jl` to identify topics in the dialogues.
- **Named Entity Recognition (NER)**: Identify names of characters, locations, etc., using `Languages.jl`.

### Text Preprocessing and Tokenization

In [None]:
using WordTokenizers

# Example: Tokenize dialogues in Hamlet
tokens_hamlet = [tokenize(lowercase(dialogue)) for dialogue in hamlet.dialogue]

# Display a few tokenized dialogues
println("Sample Tokenized Dialogues:\n", tokens_hamlet[1:3])


### Word Embeddings
Using Pre-trained Word Embeddings

In [None]:
using Embeddings

# Load pre-trained GloVe embeddings (if available)
embedding = Embeddings.load("path_to_pretrained_embeddings/glove.6B.100d.txt") # Adjust path

# Get embedding for a word
word_embedding = embedding["king"]  # Example word

println("Word Embedding for 'king':\n", word_embedding)


### Training Word2Vec Embeddings

In [None]:
using Word2Vec

# Train Word2Vec model on the tokenized dialogues
model = Word2Vec.train(tokens_hamlet, size=100, window=5, iter=5)

# Find similar words to "king"
similar_words = Word2Vec.similar_words(model, "king", 5)
println("Words similar to 'king':\n", similar_words)


### Semantic Similarity

In [None]:
using Distances

# Function to compute average embedding of a dialogue
function average_embedding(dialogue, embedding)
    words = filter(word -> haskey(embedding, word), dialogue)
    if isempty(words)
        return zeros(100)  # Assuming embedding size is 100
    else
        return mean(embedding[w] for w in words)
    end
end

# Compute embeddings for each character
character_embeddings = Dict()
for character in unique(hamlet.character)
    dialogues = hamlet[hamlet.character .== character, :dialogue]
    tokens = [tokenize(lowercase(d)) for d in dialogues]
    avg_embed = mean(average_embedding(t, embedding) for t in tokens)
    character_embeddings[character] = avg_embed
end

# Example: Compute cosine similarity between "Hamlet" and "King"
similarity = cosine_dist(character_embeddings["HAMLET"], character_embeddings["KING"])
println("Semantic similarity between Hamlet and King: ", similarity)


### Topic Modeling
Latent Dirichlet Allocation (LDA):

In [None]:
using TextAnalysis

# Convert dialogues to a document-term matrix
corpus = Corpus(hamlet.dialogue)
tf = TermDocumentMatrix(corpus)

# Fit LDA model with 5 topics
lda_model = LDA(tf, 5)

# Display the top words for each topic
println("Top words for each topic:\n")
for topic in 1:5
    println("Topic $topic: ", topwords(lda_model, topic))
end


### Named Entity Recognition (NER)
Identifying Named Entities

In [None]:
using Languages

# Define a simple function to extract entities from text
function extract_entities(text)
    tags = tag(text)
    return filter(t -> t.label in ["PER", "LOC", "ORG"], tags)
end

# Extract entities from sample dialogues
entities = [extract_entities(d) for d in hamlet.dialogue[1:5]]
println("Entities in sample dialogues:\n", entities)


### Summary of Techniques
- **Text Preprocessing and Tokenization**: Prepares the data for analysis.
- **Word Embeddings**: Encodes words into numerical vectors, capturing semantic meaning.
- **Semantic Similarity**: Quantifies similarity between characters' dialogues.
- **Topic Modeling**: Finds latent topics in dialogues, useful for discovering themes.
- **Named Entity Recognition**: Identifies important names, places, and organizations.

These techniques enable a deeper understanding of the text, exploring character relationships, thematic elements, and more subtle textual patterns. The choice of method depends on the specific questions you want to answer using this dataset.