# Data Mining with Professor Sloan Week 6

## Maggie Boles 

### From Blackboard: Using either spaCy or Gensim, create your own text summarization model using a very specific corpus (e.g., dialog from Star Wars). Be careful not to choose too big a corpus or too complex a model because it could take a very long time to build. If you want to go a step further, create a front end for it (using something like Flask).
    If you use Gensim, be sure to use pip install gensim=3.6.0, otherwise, the summarization isn't available.

In [21]:
import spacy
import string
import re
from collections import Counter

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Load the corpus from file
def load_corpus(file_path='howls_script.txt'):
    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read()
    return text

# Preprocess the text: remove stop words, punctuation, numbers, lemmatize, remove extra whitespace
def preprocess_text(text):
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    
    # Process with spaCy
    doc = nlp(text)
    
    # Lemmatize, remove stop words and punctuation
    tokens = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct and token.text.strip()]
    
    # Remove extra whitespace and join tokens
    cleaned_text = ' '.join(tokens).strip()
    
    # Normalize whitespace
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text)
    
    return cleaned_text

# Simple extractive summarization using word frequency
def summarize_text(text, summary_length=100):  # summary_length in words
    # Preprocess the text
    cleaned_text = preprocess_text(text)
    
    # Get word frequencies
    words = cleaned_text.split()
    freq = Counter(words)
    
    # Get sentences from original text (to preserve original structure)
    original_doc = nlp(text)
    sentences = [sent.text.strip() for sent in original_doc.sents]
    
    # Score sentences based on word frequency
    sent_scores = {}
    for sent in sentences:
        sent_words = preprocess_text(sent).split()
        score = sum(freq[word] for word in sent_words) / (len(sent_words) + 1)  # Avoid division by zero
        sent_scores[sent] = score
    
    # Select top sentences
    sorted_sents = sorted(sent_scores, key=sent_scores.get, reverse=True)[:int(len(sentences) * 0.2)]  # Top 20% sentences
    
    # Join and return summary, limited to approximately summary_length words
    summary = ' '.join(sorted_sents)
    summary_words = summary.split()[:summary_length]
    return ' '.join(summary_words)

# Main
if __name__ == '__main__':
    corpus = load_corpus()
    print("Full Script Length:", len(corpus.split()), "words")
    
    # Preprocess the entire script
    cleaned_corpus = preprocess_text(corpus)
    print("\nCleaned Text (after preprocessing):\n", cleaned_corpus[:500], "...")  # Preview first 500 chars
    
    # Generate summary (top 200 words after preprocessing)
    summary = summarize_text(corpus, summary_length=2000)
    summary_word_count = len(summary.split())
    print("\nSummary (top", summary_word_count, "words after preprocessing):\n", summary)

Full Script Length: 14933 words

Cleaned Text (after preprocessing):
 ﻿howl moving castle character list sophie howl calcifer markl witch waste witch madam sulliman sulliman lettie turnip prince heen japanese script unknown transcriber kitsunekko.net english translation gogoanime subtitle transcribed organize kiriban scene small hat shop sophie work bunch hat bessie enter room bessie sophie close shop work come time sophie well finish fun bessie right suit let girl action woman shop follow bessie woman wait ready look okay look howl castle oh howl see close think  ...

Summary (top 2000 words after preprocessing):
 Sophie? Sophie What? Sophie It’s me, Sophie! Honey Sophie? Sophie! Sophie Sophie No, he wouldn’t. Sophie? Sophie What about you? Sophie What is this? Sophie I’ve had enough of this! Sophie put these here for me. Sophie! Sophie, do something! Sophie Why don’t you just give up? Sophie That is enough! Sophie What?! Sophie How? Sophie I can’t do this. Sophie You were alone? Soph

##### I don't know why but this week took me a minute to get it together. Gensim wasn't working for me no matter how much I was looking online and different fixes, so I had to pivot to spaCY instead to do the text summarization. The movie I chose was Howl's Moving Castle by Hiyao Miyazaki and I chose the English dubbed manuscript to summarize. It's kind of funny seeing it now and it's a lot of Sophie/Howl in the summary for the topp 200 words that I wanted to see what was the most common.

##### For this assignment I did some preprocessing by removing numbers using regex, then using spaCy's to lemmatize, removing stop words, and excluding punctuation and normalizing whitespace. Then for the summarization we tokenize the text into spaCy and score the sentences based on frequency of preprocessed words using 'Counter' then we selected the top 20% of sentences limited to 200 words and joining the sentences into a summary. 

##### I think I would've liked an abstract summarization instead of going the route of the top words of the sentences. I'm not sure. It works, but it's not very meaningful, but a lot of my issues with the assignment comes from the frustrations with gensim not working the way I wanted, so I wanted to wrap this up and move on. 