# Webscraping with `BeautifulSoup`

In this assignment, you are tasked with:
* selecting any text that is interesting to you from Project Gutenberg (https://www.gutenberg.org/) (keep in mind that this text should have large divisions like chapters, books, parts etc...)
* scraping the text and ONLY the text
* creating a `DataFrame` from your scraped data and saving it as a `CSV`

Do your best to strip out any text that is not directly part of the chosen work. It can be difficult sometimes to tell, but this means footnotes, bibliographies, editor's notes do not count. These can provide interesting details, but the purpose of this assignment is to get comfortable with `BeautifulSoup` and its versitility with filtering text. If you are having trouble, feel free to reach out at peter.nadel@tufts.edu.

Good luck!

In [4]:
# call a GET request using the requests library and turn the response into a string
import requests
import bs4

response = requests.get('https://www.gutenberg.org/files/222/222-h/222-h.htm')
type(response)



requests.models.Response

In [18]:
# create a Beautiful Soup object from the response string

text = response.text
type(text)

from bs4 import BeautifulSoup
soup = BeautifulSoup(text)

soup


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
<title>
      The Project Gutenberg eBook of The Moon and Sixpence, by W. Somerset Maugham
    </title>
<style type="text/css">
/*<![CDATA[  XML blockout */
<!--
    p {  margin-top: .75em;
         text-align: justify;
         margin-bottom: .75em;
         }
    h1,h2,h3,h4,h5,h6 {
         text-align: center; /* all headings centered */
         clear: both;
         }
    hr { width: 33%;
	 margin-top: 2em;
	 margin-bottom: 2em;
         margin-left: auto;
         margin-right: auto;
         clear: both;
       }
     a[name] { position:absolute; }
     a:link {color:#0000ff; background-color:#FFFFFF;
                              text-decoration:none; }
    a:visited {color:#0000ff; background-color:#FFFFFF;
                              text-decor

In [19]:
# find the tags which mark important divisions in the text (this can be chapters, books, parts etc...)

print(len(soup.find_all('h2')))
soup.find_all('h2')

#important tags are h2 for chapter and p for paragraphs




61


[<h2>W. Somerset Maugham</h2>,
 <h2>Table of Contents</h2>,
 <h2>The Moon and Sixpence</h2>,
 <h2><a id="Chapter_I" name="Chapter_I"></a>Chapter I</h2>,
 <h2><a id="Chapter_II" name="Chapter_II"></a>Chapter II</h2>,
 <h2><a id="Chapter_III" name="Chapter_III"></a>Chapter III</h2>,
 <h2><a id="Chapter_IV" name="Chapter_IV"></a>Chapter IV</h2>,
 <h2><a id="Chapter_V" name="Chapter_V"></a>Chapter V</h2>,
 <h2><a id="Chapter_VI" name="Chapter_VI"></a>Chapter VI</h2>,
 <h2><a id="Chapter_VII" name="Chapter_VII"></a>Chapter VII</h2>,
 <h2><a id="Chapter_VIII" name="Chapter_VIII"></a>Chapter VIII</h2>,
 <h2><a id="Chapter_IX" name="Chapter_IX"></a>Chapter IX</h2>,
 <h2><a id="Chapter_X" name="Chapter_X"></a>Chapter X</h2>,
 <h2><a id="Chapter_XI" name="Chapter_XI"></a>Chapter XI</h2>,
 <h2><a id="Chapter_XII" name="Chapter_XII"></a>Chapter XII</h2>,
 <h2><a id="Chapter_XIII" name="Chapter_XIII"></a>Chapter XIII</h2>,
 <h2><a id="Chapter_XIV" name="Chapter_XIV"></a>Chapter XIV</h2>,
 <h2><a id

In [72]:
# populate a dictionary by looping through the soup

Maugham = {}

for h2 in soup.find_all('h2'):
    if h2.text.startswith('Chapter'):
        stringy = ''

        for t in h2.next_siblings:
            if t.name == 'p':
                stringy += '  '
                stringy += t.text.replace('\r\n',' ')

            if t.name == 'h2':
                break    
            

                


        Maugham[f"{h2.text}"] = (stringy)

   

print(Maugham.keys())

print(Maugham['Chapter LV'])


dict_keys(['Chapter I', 'Chapter II', 'Chapter III', 'Chapter IV', 'Chapter V', 'Chapter VI', 'Chapter VII', 'Chapter VIII', 'Chapter IX', 'Chapter X', 'Chapter XI', 'Chapter XII', 'Chapter XIII', 'Chapter XIV', 'Chapter XV', 'Chapter XVI', 'Chapter XVII', 'Chapter XVIII', 'Chapter XIX', 'Chapter XX', 'Chapter XXI', 'Chapter XXII', 'Chapter XXIII', 'Chapter XXIV', 'Chapter XXV', 'Chapter XXVI', 'Chapter XXVII', 'Chapter XXVIII', 'Chapter XXIX', 'Chapter XXX', 'Chapter XXXI', 'Chapter XXXII', 'Chapter XXXIII', 'Chapter XXXIV', 'Chapter XXXV', 'Chapter XXXVI', 'Chapter XXXVII', 'Chapter XXXVIII', 'Chapter XXXIX', 'Chapter XL', 'Chapter XLI', 'Chapter XLII', 'Chapter XLIII', 'Chapter XLIV', 'Chapter XLV', 'Chapter XLVI', 'Chapter XLVII', 'Chapter XLVIII', 'Chapter XLIX', 'Chapter L', 'Chapter LI', 'Chapter LII', 'Chapter LIII', 'Chapter LIV', 'Chapter LV', 'Chapter LVI', 'Chapter LVII', 'Chapter LVIII'])
  Mr. Coutras was an old Frenchman of great stature and exceeding bulk.  His body was

In [79]:
# convert the dictionary into a pandas DataFrame


import pandas as pd

The_Moon_And_Sixpence = pd.DataFrame.from_dict(Maugham, orient='index')
The_Moon_And_Sixpence = The_Moon_And_Sixpence.reset_index()
The_Moon_And_Sixpence = The_Moon_And_Sixpence.rename(columns={'index':'chapter',0:'text'})
The_Moon_And_Sixpence





Unnamed: 0,chapter,text
0,Chapter I,I confess that when first I made acquaintanc...
1,Chapter II,When so much has been written about Charles ...
2,Chapter III,But all this is by the way. I was very youn...
3,Chapter IV,No one was kinder to me at that time than Ro...
4,Chapter V,During the summer I met Mrs. Strickland not ...
5,Chapter VI,"But when at last I met Charles Strickland, i..."
6,Chapter VII,"The season was drawing to its dusty end, and..."
7,Chapter VIII,On reading over what I have written of the S...
8,Chapter IX,"""This is a terrible thing,"" he said, the mom..."
9,Chapter X,A day or two later Mrs. Strickland sent me r...


In [80]:
# save the DataFrame as a CSV



The_Moon_And_Sixpence.to_csv('Maugham.csv',index=False)

In [7]:
# (optional: plot the frequency of a given term in your new dataset)


