## # Introduction
<p><img src="https://assets.datacamp.com/production/project_1010/img/book_cover.jpg" alt="The book cover of Peter and Wendy" style="width:183;height:253px;"></p>
<h3 id="flyawaywithpeterpan">Fly away with Peter Pan!</h3>
<p>Peter Pan has been the companion of many children, and went a long way, starting as a Christmas play and ending up as a Disney classic. Did you know that although the play was titled "Peter Pan, Or The Boy Who Wouldn't Grow Up", J. M. Barrie's novel was actually titled "Peter and Wendy"? </p>
<p>You're going to explore and analyze Peter Pan's text to answer the question in the instruction pane below. You are working with the text version available here at <a href="https://www.gutenberg.org/files/16/16-h/16-h.htm">Project Gutenberg</a>. Feel free to add as many cells as necessary. Finally, remember that you are only tested on your answer, not on the methods you use to arrive at the answer!</p>
<p><strong>Note:</strong> If you haven't completed a DataCamp project before you should check out the <a href="https://projects.datacamp.com/projects/33">Intro to Projects</a> first to learn about the interface. <a href="https://www.datacamp.com/courses/intermediate-importing-data-in-python">Intermediate Importing Data in Python</a> and <a href="https://www.datacamp.com/courses/introduction-to-natural-language-processing-in-python">Introduction to Natural Language Processing in Python</a> teach the skills required to complete this project. Should you decide to use them, English stopwords have been downloaded from <code>nltk</code> and are available for you in your environment.</p>

In [6]:
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter
import re

# Download stopwords 
nltk.download('stopwords')
from nltk.corpus import stopwords

# Fetch the text from Project Gutenberg 
url = "https://www.gutenberg.org/files/16/16-h/16-h.htm"  
response = requests.get(url)

# Parse the HTML using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")
text = soup.get_text()

# Clean the text: remove punctuation, normalize to lowercase, and split into words
cleaned_text = re.sub(r"[^\w\s]", "", text)  
cleaned_text = cleaned_text.lower()  
words = cleaned_text.split()  

# Remove stopwords
stop_words = set(stopwords.words("english"))
meaningful_words = [word for word in words if word not in stop_words]

# Count word frequencies
word_counts = Counter(meaningful_words)

# Get the 10 most common meaningful words
most_common_words = word_counts.most_common(10)
print("Top 10 most common meaningful words:", most_common_words)

# Define character names (normalize to lowercase for comparison)
character_names = ["peter", "wendy", "hook", "tinker", "bell", "darling", "neverland", "john", "michael"]

# Check which of the most common words are character names
protagonists = [word for word, count in most_common_words if word in character_names]

# Save the answer
print("Protagonists among the top 10:", protagonists)


[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Top 10 most common meaningful words: [('peter', 382), ('said', 358), ('wendy', 333), ('would', 217), ('one', 211), ('hook', 153), ('could', 142), ('cried', 136), ('john', 127), ('time', 122)]
Protagonists among the top 10: ['peter', 'wendy', 'hook', 'john']
