# Team- LevelUp, Submission Notebook

## Overview

This notebook demonstrates a process to scrape a website for links and content, generate questions based on the collected content, and save the results in JSON format. The workflow consists of 3 main parts:

1. **Web Scraping**: Extracts links and content from a specified website.
2. **Content Analysis**: Merges content, generates questions, and fetches titles for the links.
3. **Data Validation**: Testing the output file on various parameters.

![Overview Figure](overview.png)

## Dependencies

Before running the notebook, ensure the following Python packages are installed:
- `requests`
- `beautifulsoup4`
- `google-generativeai`
- `json`
- `subprocess`
- `logging`

In [1]:
import requests
from bs4 import BeautifulSoup 
import json
import google.generativeai as genai
import subprocess
from urllib.parse import urlparse
import logging

  from .autonotebook import tqdm as notebook_tqdm


# Website Scraper Script

This script is designed to scrape a given website, collect links, and extract relevant content and titles from those links. The process includes error handling and logging for better traceability.

## Features

- **Logging Configuration**: The script uses `logging.basicConfig` to set up logging with a debug level and a specific format that includes timestamps, log levels, and messages.

- **Scrape Website Function**: The `scrape_website(url)` function:
  - Logs the initiation of the scraping process.
  - Attempts to fetch the main webpage content.
  - Extracts all links from the main page.
  - Extracts and appends content from each link, including headers and paragraphs.
  - Handles exceptions, logging any errors encountered during the scraping.

- **Content and Links**: The script gathers the main content and relevant links with titles, ensuring a clean and organized output.

## Usage

To use the script, simply call the `scrape_website(url)` function with the desired URL. The function returns the scraped content and a list of relevant links.

```python
content, links = scrape_website("https://example.com")


In [2]:
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')

def scrape_website(url):
    logging.info(f"Scraping website: {url}")
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
    except requests.RequestException as e:
        logging.error(f"Failed to fetch the main URL: {e}")
        return []

    main_soup = BeautifulSoup(response.text, 'html.parser')
    links = [link.get('href') for link in main_soup.find_all('a') if link.get('href')]
    
    logging.debug(f"Found {len(links)} links on the main page")

    main_content = []
    unique_links = set()
    count = 0
    relevant_links = []

    for link in links:
        # Construct full link if necessary
        full_link = link if link.startswith('http') else f"{url.rstrip('/')}/{link.lstrip('/')}"

        # Check if the link is unique and doesn't contain '#'
        if '#' not in full_link and full_link not in unique_links:
            unique_links.add(full_link)
            logging.debug(f"Processing link: {full_link}")

            try:
                link_response = requests.get(full_link, timeout=5)
                link_response.raise_for_status()  # Check for HTTP errors

                soup = BeautifulSoup(link_response.text, 'html.parser')

                title = soup.title.string.strip() if soup.title else "No Title Found"
                relevant_links.append({"link": full_link, "title": title})
                content = {"link": full_link}
                txt = ""

                # Iterate over all elements and add their text to the content list
                for element in soup.find_all(['h1', 'h2', 'p']):
                    txt += " " + element.get_text(strip=True)

                content["text"] = txt.strip() + '\n'  # Ensure each content ends with a newline
                main_content.append(content)
                
                count += 1
                if count == 12:  # Stop after processing exactly 5 unique links
                    break

            except requests.RequestException as e:
                logging.warning(f"Failed to scrape {full_link}: {e}")
            except Exception as e:
                logging.error(f"Unexpected error when scraping {full_link}: {e}")
    
    return main_content, relevant_links


# JSON String Parser

This script provides a robust function to parse a JSON string that might be embedded within other text. It carefully trims the string to extract the JSON data and then attempts to parse it.

## Features
- **Error Handling**: The function includes comprehensive error handling to manage various scenarios:
  - **`JSONDecodeError`**: Catches and logs errors if the JSON parsing fails.
  - **`ValueError`**: Raises an error if no valid JSON data is found after trimming.
  - **General Exception Handling**: Catches any other unexpected errors and logs them.

- **Return Value**: If successful, the function returns the parsed JSON object. If an error occurs, it returns `None`.

## Usage

To use the function, pass the string containing the JSON data to `parse_json_string(json_string)`. The function will return the parsed JSON object or `None` if parsing fails.

```python
parsed_data = parse_json_string('Your string containing JSON data')
if parsed_data:
    print(parsed_data)
else:
    print("Failed to parse JSON data.")


In [3]:
def parse_json_string(json_string):
    try:
        # Trim the string to get the JSON part between the first '[' and the last ']'
        start_index = json_string.find('[')
        end_index = json_string.rfind(']') + 1
        trimmed_json_string = json_string[start_index:end_index].strip()

        # Check if the trimmed string is empty
        if not trimmed_json_string:
            raise ValueError("No valid JSON data found in the string.")

        # Parse the JSON string
        parsed_json = json.loads(trimmed_json_string)
        return parsed_json
    
    except json.JSONDecodeError as e:
        print(f"JSON decoding failed: {e}")
    except ValueError as e:
        print(f"Value error: {e}")
    except Exception as e:
        print(f"Unexpected error: {e}")

    return None


# Website Content Scraping and Question Generation Pipeline

The `process_website` function implements an end-to-end pipeline for extracting, processing, and validating web content, leveraging advanced AI capabilities for content analysis.

## Key Functionalities

1. **Web Scraping**:
   - The function initiates the process by invoking the `scrape_website(url)` method, which performs HTTP requests to fetch the HTML content of the specified URL.
   - It parses the HTML using `BeautifulSoup`, extracting relevant text and hyperlinks. The scraped data is serialized into a JSON structure and stored locally as `scraped_data.json`.

2. **AI-Driven Content Analysis**:
   - The function utilizes Google's Generative AI model (`gemini-1.5-flash`) to generate contextually relevant questions from the scraped textual content.
   - A templated query is crafted to instruct the AI model to produce concise, general-purpose questions, each under 80 characters.
   - The function aggregates these generated questions along with the corresponding relevant links until a threshold of 10 questions is met.

3. **Data storage**:
   - Post-processing, the output data—including the source URL, generated questions, and associated links—is encapsulated in a structured JSON format.
   - The JSON data is then written to a file, uniquely named based on the domain of the processed URL, and saved under the `./outputs/` directory.


## Usage Example

Invoke the `process_website` function with a target URL:

```python
process_website('https://example.com')
```
The output file will be stored in `./outputs/example.com.json`


In [4]:
def process_website(url):
    print("Scraping website...")
    scraped_data, relevent_links = scrape_website(url)
    if scraped_data and relevent_links:
        print("Website scraping completed...✅ ✅")
    else:
        print("Error encountered while scraping...❌❌")
        return
    print("Generating Questions... Please wait...🕒🕒🕒")
    with open('scraped_data.json', 'w') as f:
        json.dump(scraped_data, f, indent=4)

    genai.configure(api_key="AIzaSyA9hLGJD5RjmB8OrAmwdL5zpSEzUiG3w1Y")
    model = genai.GenerativeModel('gemini-1.5-flash')
    query_template = "from the following content generate 2 general questions, with length striclty less than 70 characters. (return a json array): "

    questions = []
    rl = []
    main_content = []

    cnt = 0
    ind = 0
    for(content) in scraped_data:
        ind = ind + 1
        query = query_template + content["text"]
        response = model.generate_content(query)
        q = parse_json_string(response.text)
        if q:
            cnt = cnt + len(q)
            questions = questions + q
            rl.append(relevent_links[ind-1])
            main_content.append(content["text"])
        if cnt >= 10:
            break

    outputdata = [
        {
            "url": url,
            "questions": questions,
            "relevant_links": rl
        }
    ]

    with open('././outputs/output.json', 'w') as f:
        json.dump(outputdata, f, indent=4)

    print("Output Received✅✅")

    pretty_json = json.dumps(outputdata, indent=4, sort_keys=True)
    print(pretty_json)

    print("Output also saved in -> outputs/output.json")

# Website Processing Script

This script scrapes a website, generates relevant questions using AI, and saves the output in `outputs/output.json`. 

In [5]:
website_url = input("Enter URL : ") 
process_website(website_url)

Enter URL :  https://trumio.ai/


2024-08-26 12:24:19,483 - INFO - Scraping website: https://trumio.ai/
2024-08-26 12:24:19,487 - DEBUG - Starting new HTTPS connection (1): trumio.ai:443


Scraping website...


2024-08-26 12:24:20,270 - DEBUG - https://trumio.ai:443 "GET / HTTP/11" 200 None
2024-08-26 12:24:20,368 - DEBUG - Found 106 links on the main page
2024-08-26 12:24:20,369 - DEBUG - Processing link: https://trumio.ai
2024-08-26 12:24:20,370 - DEBUG - Starting new HTTPS connection (1): trumio.ai:443
2024-08-26 12:24:20,890 - DEBUG - https://trumio.ai:443 "GET / HTTP/11" 200 None
2024-08-26 12:24:21,231 - DEBUG - Processing link: https://trumio.ai/
2024-08-26 12:24:21,232 - DEBUG - Starting new HTTPS connection (1): trumio.ai:443
2024-08-26 12:24:21,435 - DEBUG - https://trumio.ai:443 "GET / HTTP/11" 200 None
2024-08-26 12:24:21,486 - DEBUG - Processing link: https://trumio.ai/clients/
2024-08-26 12:24:21,487 - DEBUG - Starting new HTTPS connection (1): trumio.ai:443
2024-08-26 12:24:23,054 - DEBUG - https://trumio.ai:443 "GET /clients/ HTTP/11" 200 None
2024-08-26 12:24:23,172 - DEBUG - Processing link: https://trumio.ai/experts/
2024-08-26 12:24:23,173 - DEBUG - Starting new HTTPS conn

Website scraping completed...✅ ✅
Generating Questions... Please wait...🕒🕒🕒
Output Received✅✅
[
    {
        "questions": [
            "What problems does Trumio solve for businesses?",
            "How does Trumio connect students and experts with clients?",
            "What is Trumio's AI-powered marketplace for?",
            "How does Trumio facilitate collaboration between clients and university teams?",
            "How does Trumio connect clients with university talent?",
            "What are the key benefits of using Trumio for project development?",
            "What services does Trumio offer to experts?",
            "How does Trumio facilitate collaboration between experts and students?",
            "What are the benefits of using Trumio for students?",
            "How does Trumio facilitate project collaboration and learning?"
        ],
        "relevant_links": [
            {
                "link": "https://trumio.ai",
                "title": "AI-Powered Marketpl

## Explanation

- **`process_website(url)`**: This function handles the end-to-end process of scraping a website, generating questions from the scraped content, and saving the results in JSON format.

## Start Instructions

### How to Start Locally

- Create  Your Virtual Environment
- Run the Command in the terminal
```
python -m venv myenv
source myenv/bin/activate
pip install -r requirements.txt
python que_generation/main.py

```

### How to Access Through Docker 

__If You want to use the server code in Docker. Follow the Steps__



#### Follow the Steps
- Go to the que_generation/main.py
- Uncomment the Docker Code Written Down Below and comment the input line
- Now Follow the commands
 
```
docker build -t your_image_name .

docker run -e API_KEY=your_api_key_here -e WEBSITE_URL="https://example.com" your_image_name


```




