We are excited to announce that as part of our internship selection process, we will be conducting a Jupyter Notebook exam to assess your skills and proficiency in programming. Jupyter Notebook is a powerful tool widely used in the field of data science, providing an interactive environment for coding, visualization, and documentation.

Your objective is to write a Python script using a web scraping library (such as BeautifulSoup or Scrapy or other library) to extract relevant information from "kathika.org" .

## Data Scraping

import all the library

In [1]:
# code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

base url: https://kathika.org/

In [2]:
# code:
r = requests.get('https://kathika.org/')
print(r) 
#print(r.content)

<Response [200]>


Extract page start from: https://kathika.org/stories

In [11]:
# code:
URL = "https://kathika.org/stories"
HEADERS = ({'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36', 'Accept-Language': 'en-US, en;q=0.5'})
webpage = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(webpage.content, "html.parser")
links = soup.find_all("a", attrs={})
link = links[0].get('href')
book_list = "https://kathika.org/stories" + link
new_soup = BeautifulSoup(new_webpage.content, "html.parser")
new_soup.find("h3", attrs={"title":'পাহাড়ের চূড়া'}).text.strip()
new_soup.find('p').text.strip()
new_soup.find("span", attrs={"class":'span'}).text.strip()

AttributeError: 'NoneType' object has no attribute 'text'

Use **iteration** or **loop** to go through **each book** and **each page** and collect **(book name, writer name, book content)**

hint: you can store data in a **list** or **dictionary**. **{"book_name":[[writer_name],[book_content]]}**

caution: all the data must be contained as organized where user can tell which book belongs to which writer

In [10]:
# code and check
if __name__ == '__main__':

    # add your user agent 
    HEADERS = ({'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36', 'Accept-Language': 'en-US, en;q=0.5'})

    # The webpage URL
    URL = "https://kathika.org/stories"

    # HTTP Request
    webpage = requests.get(URL, headers=HEADERS)

    # Soup Object containing all data
    soup = BeautifulSoup(webpage.content, "html.parser")

    # Fetch links as List of Tag Objects
    links = soup.find_all("a", attrs={})

    # Store the links
    links_list = []

    # Loop for extracting links from Tag Objects
    for link in links:
            links_list.append(link.get('href'))

    d = {"book_name":[], "writer_name":[], "book_content":[]}
    
    # Loop for extracting product details from each link 
    for link in links_list:
        new_webpage = requests.get("https://kathika.org/stories" + link, headers=HEADERS)

        new_soup = BeautifulSoup(new_webpage.content, "html.parser")

        # Function calls to display all necessary product information
        d['book_name'].append(get_book_name(new_soup))
        d['writer_name'].append(get_writer_name(new_soup))
        d['book_content'].append(get_book_content(new_soup))
        

    
    kathika_df = pd.DataFrame.from_dict(d)
    kathika_df['book_name'].replace('', np.nan, inplace=True)
    kathika_df = kathika_df.dropna(subset=['book_name'])
    kathika_df.to_csv("kathika_data.csv", header=True, index=False)

<Response [200]>
<Response [404]>
<Response [404]>
<Response [404]>
<Response [200]>
<Response [404]>
<Response [404]>
<Response [404]>
<Response [404]>
<Response [404]>
<Response [404]>
<Response [404]>
<Response [404]>
<Response [404]>
<Response [404]>
<Response [404]>
<Response [404]>
<Response [404]>
<Response [404]>
<Response [404]>
<Response [404]>
<Response [404]>
<Response [404]>
<Response [404]>
<Response [404]>
<Response [404]>


Save current Data as json or other format in you're local mechine. 

caution: you must be able to read it in jupyter notebook

## Data Cleaning

import saved data in jupyter notebook

In [None]:
# code:
df = pd.read_csv("kathika_data.csv")

Check if **book_content** contains any html element or other element which is not suppose to be there.

In [3]:
# Ans: No


If Answer is Yes then remove those element

hint: use library **re** or other library

caution: book content must maintain their Integrity

In [None]:
# code:


Check how many stories or books you've collected and how many stories or books does website have

In [None]:
# code
def scrape_kathika_books():
    # Specify the URL of the books page
    url = "https://kathika.org/stories"

    # Send a GET request to the website
    response = requests.get(url)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')

        # Initialize dictionaries to store the data and counts
        books_data = {}
        collected_counts = {'books': 0, 'stories': 0}

        # Find all book elements on the page
        books = soup.find_all('div', class_='book')

        # Iterate through each book
        for book in books:
            # Extract book name, writer name, and book content based on HTML structure
            # Replace the following with the specific HTML elements and classes you want to scrape
            book_name = book.find('h2', class_='book-title').text.strip()
            writer_name = book.find('p', class_='writer-name').text.strip()

            # Assume there's a link to the book page and navigate to it to get the content
            book_page_url = book.find('a', class_='book-link')['href']
            book_content = scrape_book_content(book_page_url)

            # Store the data in the dictionary
            if book_name not in books_data:
                books_data[book_name] = {'writer_name': writer_name, 'book_content': book_content}
                collected_counts['books'] += 1

        # Save the data to a JSON file
        save_to_json(books_data, 'kathika_books_data.json')

        # Print or process the collected data
        for book_name, book_info in books_data.items():
            print(f"\nBook: {book_name}")
            print(f"Writer: {book_info['writer_name']}")
            print(f"Content: {book_info['book_content']}")

        print("\nCollected Counts:")
        print(f"Books: {collected_counts['books']}")
        print(f"Stories: {collected_counts['stories']}")

    else:
        # Print an error message if the request was not successful
        print(f"Error: Unable to fetch the page. Status Code: {response.status_code}")

def scrape_book_content(book_page_url):
    # Function to scrape the content of a book page
    # You can customize this function based on the structure of the book page
    response = requests.get(book_page_url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        book_content = soup.find('div', class_='book-content').text.strip()
        return book_content
    else:
        return "Error: Unable to fetch book content."


# Call the function to start the scraping process
scrape_kathika_books()

Did your books count matches with website? Yes or no

In [7]:
# Ans: Yes 


### If your books count does not match then check your code and find the bug and fix it.

If your books count matches then print all writer name and how many book each writer wrote.

print hint: print(writer name: "name", number of book: "number")

In [None]:
# code:


Save each books as **txt** file. where file name is **book_name.txt**. inside that file is **book_content**.

Save each **book_name** and **writer_name** in **csv** file.

In [8]:
#code:


**Explain what you did and what are the challange you've faced doing this exercise...**

In [9]:
# Ans:web site content problem where html code tag not so good,BeautifulSoup Parsing,Data Collection,Counting Collected Items.
# This Exercise is my first web scraping project , source by Github, you tube , stack overflow. love it when i can understand what i am doing but i have not enough time for it becasue of my illness. Enjoyable

# what will you submit once exam is over?

1. Provide complete jupyter notebook script.
2. all the file you got after running the script for the last time.
3. zip all the files and submit by following email instruction