## Harry Potter Scraper
The purpose of the script below is to scrape the internet to find and store the text of all of the Harry Potter books. We will be scraping the books off of this website: https://allnovel.net/search.php?keyword=harry-potter. First we need to be able to scrape the link to each book from the homepage. Once we have followed the link to the book, we need to follow the link to each chapter, and scrape each of those.

In [1]:
import requests
import bs4
import os
from datetime import datetime as dt
base_url = 'https://allnovel.net'
t0 = dt.now()

In [2]:
home_url = 'https://allnovel.net/search.php?keyword=harry-potter'
homepage = requests.get(home_url)
home_soup = bs4.BeautifulSoup(homepage.content)
bookLinks = [base_url + div.find('a')['href'] for div in home_soup.find_all('div', class_='thumbnail')]

### Create Backbone Functions
Let's make a method that will scrape a book given its home page url. The method will have to get all of the links to the book's chapters and then scrape each of these pages for the actual text. Let's first get the links, and then we can make another function which scrapes the text given the chapter links.

In [3]:
def getChapterTextFromChapterLink(chapterLink):
    chapter_home = requests.get(chapterLink).content
    chapter_soup = bs4.BeautifulSoup(chapter_home)    
    paragraphs = [p.get_text() for p in chapter_soup.find(class_='des_novel').find_all('p')]
    return ' '.join(paragraphs).strip('\t\n\t\t\t\t\t\t\t\t')

In [4]:
def getBookTextFromBookLink(bookLink):
    book_page = requests.get(bookLink)
    book_soup = bs4.BeautifulSoup(book_page.content)
    chapterLinks = [base_url + a['href'] for a in book_soup.find(id='list_chapter').find_all('a')]
    book_text = ''
    for chapterLink in chapterLinks:
        book_text += ' ' + getChapterTextFromChapterLink(chapterLink)
    
    return book_text

In [5]:
def getBookTitleFromBookLink(bookLink):
    book_page = requests.get(bookLink)
    book_soup = bs4.BeautifulSoup(book_page.content)
    return book_soup.find('h1').get_text()

### Scrape all of the books and save them as a csv

In [6]:
books = dict()

for bookLink in bookLinks:
    title = getBookTitleFromBookLink(bookLink)
    print('Working on {}...'.format(title))
    bookText = getBookTextFromBookLink(bookLink)
    shortened_title = title[title.find('(')+1:title.find(')')] # save only the number harry potter it is
    
    books[title] = bookText
t1 = dt.now()

Working on Harry Potter and the Order of the Phoenix (Harry Potter #5)...
Working on Harry Potter and the Chamber of Secrets (Harry Potter #2)...
Working on Harry Potter and the Deathly Hallows (Harry Potter #7)...
Working on Harry Potter and the Half-Blood Prince (Harry Potter #6)...
Working on Harry Potter and the Goblet of Fire (Harry Potter #4)...
Working on Harry Potter and the Prisoner of Azkaban (Harry Potter #3)...
Working on Harry Potter and the Philosopher's Stone (Harry Potter #1)...


In [7]:
import pandas as pd
books_df = pd.DataFrame.from_dict(books, orient='index').reset_index()
books_df.rename(columns={'index': 'title', 0: 'text'}, inplace=True)
books_df['shortened_title'] = books_df['title'].apply(lambda x: x[x.find('(')+1:x.find(')')])

books_df.to_csv('harry_potter.csv')

In [8]:
print('Total time taken to run this script: {}'.format(t1-t0))

Total time taken to run this script: 0:02:12.444677
