# Scraping book summaries
In this practice project, I will be scraping book summaries from [James Clear's Website](https://jamesclear.com/book-summaries)

This is a very simple project, the page lists 55 books in total, first I will grab the title and short summary from the list page
Then I will open page of each book and grab title, short summary and also long summary of the book

## Imports

In [147]:
import requests
from bs4 import BeautifulSoup as bs
import re
import csv
import threading
import pandas as pd

## Grab short summaries of all books

In [149]:
req = requests.get('https://jamesclear.com/book-summaries')
soup = bs(req.text)

# create a list of dictionaries books [{'book1':'summary1'}, {'book2':'summary2'}]
books = []
titles = [title.text for title in soup.select('.sale-book__title')]
summaries = [p.text[29:] for p in soup.select('p') if 'The Book in Three Sentences:' in p.text]

for title, summary in zip(titles, summaries):
    book = {}
    book['title'] = title
    book['summary'] = summary
    books.append(book)
    
print(books[0])

{'title': '10% Happier by Dan Harris', 'summary': 'Practicing meditation and mindfulness will make you at least 10 percent happier. Being mindful doesn’t change the problems in your life, but mindfulness does help you respond to your problems rather than react to them. Mindfulness helps you realize that striving for success is fine as long as you accept that the outcome is outside your control.'}


**Now convert this data into a dataframe**

In [150]:
data = pd.DataFrame({'Title':titles, 'Short Summary':summaries})
data.head()

Unnamed: 0,Title,Short Summary
0,10% Happier by Dan Harris,Practicing meditation and mindfulness will mak...
1,The 10X Rule by Grant Cardone,The 10X Rule says that 1) you should set targe...
2,A Short Guide to a Happy Life by Anna Quindlen...,The only thing you have that nobody else has i...
3,A Technique for Producing Ideas by James Webb ...,An idea occurs when you develop a new combinat...
4,Adapt by Tim Harford,Seek out new ideas and try new things. When tr...


## Grab long summaries of all books without threading

In [151]:
# Grab links of all book summary pages 
req = requests.get('https://jamesclear.com/book-summaries')
soup = bs(req.text)
links = ['https://jamesclear.com'+a['href'] for a in soup.find_all('a', string='read the book summary')]

In [157]:
# Create a Book Summaries csv file and also a csv writer object 
# to write book data into the file 
f = open('Book Summaries.csv', mode='w', newline='', encoding='utf-8')
writer = csv.writer(f, delimiter=',')
writer.writerow(['Title', 'Short Summary', 'Full Summary'])

# Iterate over all the links
for link in links:
    # get request of book page and turn it into a soup
    book_page = requests.get(link)
    book_page_soup = bs(book_page.text)
    
    # grab title (book name)
    title = book_page_soup.select_one('h1').text
    
    # grab short summary 
    short_summary = book_page_soup.select_one('h2').find_next_sibling().text
    
    # grab full summary
    full_summary = book_page_soup.select_one('.summary').get_text(separator=' ')
    
    # write grabbed info into Book Summaries file
    writer.writerow([title, short_summary, full_summary])
    
# close Book Summaries file
f.close()

## Grab long summaries of all books with threading

In [159]:
# create a function to run concurrently in each thread
def get_book_info(link, writer):
    '''
    link: link of the book page
    writer: csv writer object needed to write data
    '''
    # get request of book page and turn it into a soup
    book_page = requests.get(link)
    book_page_soup = bs(book_page.text)
    
    # grab title (book name)
    title = book_page_soup.select_one('h1').text
    
    # grab short summary 
    short_summary = book_page_soup.select_one('h2').find_next_sibling().text
    
    # grab full summary
    full_summary = book_page_soup.select_one('.summary').get_text(separator=' ')
    
    # write grabbed info into Book Summaries file
    writer.writerow([title, short_summary, full_summary])


In [160]:
# Grab links of all book summary pages  
req = requests.get('https://jamesclear.com/book-summaries')
soup = bs(req.text)
links = ['https://jamesclear.com'+a['href'] for a in soup.find_all('a', string='read the book summary')]

# Create a Book Summaries2 csv file and also a csv writer object 
# to write book data into the file
f = open('Book Summaries2.csv', mode='w', newline='', encoding='utf-8')
writer = csv.writer(f, delimiter=',')
writer.writerow(['Title', 'Short Summary', 'Full Summary'])

# create a list for threads
threads = []

# iterate over all links
for link in links:
    
    # create thread for each link and start it
    t = threading.Thread(target=get_book_info, args=[link, writer])
    t.start()
    threads.append(t)
    
# join all threads
for t in threads:
    t.join()
   
#  the file
f.close()