## # Introduction
<p><img src="https://assets.datacamp.com/production/project_1010/img/book_cover.jpg" alt="The book cover of Peter and Wendy" style="width:183;height:253px;"></p>
<h3 id="flyawaywithpeterpan">Fly away with Peter Pan!</h3>
<p>Peter Pan has been the companion of many children, and went a long way, starting as a Christmas play and ending up as a Disney classic. Did you know that although the play was titled "Peter Pan, Or The Boy Who Wouldn't Grow Up", J. M. Barrie's novel was actually titled "Peter and Wendy"? </p>
<p>You're going to explore and analyze Peter Pan's text to answer the question in the instruction pane below. You are working with the text version available here at <a href="https://www.gutenberg.org/files/16/16-h/16-h.htm">Project Gutenberg</a>. Feel free to add as many cells as necessary. Finally, remember that you are only tested on your answer, not on the methods you use to arrive at the answer!</p>
<p><strong>Note:</strong> If you haven't completed a DataCamp project before you should check out the <a href="https://projects.datacamp.com/projects/33">Intro to Projects</a> first to learn about the interface. <a href="https://www.datacamp.com/courses/intermediate-importing-data-in-python">Intermediate Importing Data in Python</a> and <a href="https://www.datacamp.com/courses/introduction-to-natural-language-processing-in-python">Introduction to Natural Language Processing in Python</a> teach the skills required to complete this project. Should you decide to use them, English stopwords have been downloaded from <code>nltk</code> and are available for you in your environment.</p>

In [1]:
# Use this cell to begin your analysis, and add as many as you would like!
import requests
from bs4 import BeautifulSoup
import re

# Get Webpage
## Use request library

In [2]:
URL = "https://www.gutenberg.org/files/16/16-h/16-h.htm"
r = requests.get(URL)

## Parse HTML page using BeautifulSoup
1. Open source code of the web page in browser 
2. Understand the html code and tags
3. Figure out the content you need

In [3]:
soup = BeautifulSoup(r.content)
#print(soup.prettify())

# Objective 1: content in all the chapters
1. content inside div tag 
2. class is chapter
3. content inside p tag

In [4]:
content_div = soup.find('div', class_='chapter')
content_pTag = content_div.findAllNext('p', text=True)

In [5]:
print(type(content_pTag))
print(type(content_pTag[0]))
print(type(content_pTag[0].text))

<class 'bs4.element.ResultSet'>
<class 'bs4.element.Tag'>
<class 'str'>


In [6]:
print(content_pTag[0])

<p>
All children, except one, grow up. They soon know that they will grow up, and
the way Wendy knew was this. One day when she was two years old she was playing
in a garden, and she plucked another flower and ran with it to her mother. I
suppose she must have looked rather delightful, for Mrs. Darling put her hand
to her heart and cried, “Oh, why can’t you remain like this for
ever!” This was all that passed between them on the subject, but
henceforth Wendy knew that she must grow up. You always know after you are two.
Two is the beginning of the end.
</p>


# Convert list of bs4.element.tag into a string
1. tag into text
2. text into string
3. remove special character (\r,\n)
4. join to get single string

In [7]:
content_string =[]
for content in content_pTag:
    content_string.append(re.sub('\W',' ',content.text).strip())
content = ' '.join(content_string)

In [8]:
print("length: ", len(content),"\n","type:", type(content))

length:  245941 
 type: <class 'str'>


# Tokenization

In [9]:
words = content.split()

In [10]:
from collections import Counter
from nltk.corpus import stopwords

# Most Common Words

In [11]:
counter = Counter(words)

counter.most_common(10)

[('the', 2100),
 ('and', 1321),
 ('to', 1148),
 ('was', 897),
 ('a', 877),
 ('he', 858),
 ('of', 780),
 ('it', 641),
 ('in', 613),
 ('that', 586)]

# Most Common Words
1. Remove stop words

In [12]:
no_stops_words = [t for t in words if t.lower() not in stopwords.words('english')]
counter_no_stops = Counter(no_stops_words)

counter_no_stops.most_common(10)

[('Peter', 398),
 ('Wendy', 354),
 ('said', 353),
 ('would', 210),
 ('one', 198),
 ('Hook', 147),
 ('could', 140),
 ('cried', 136),
 ('John', 133),
 ('time', 122)]

# Lead characters of the novel

In [13]:
protagonists=['Peter', 'Wendy', 'Hook', 'John']