<a href="https://colab.research.google.com/github/micah-shull/LLMs/blob/main/LLM_022_useragent_reuquests_PDF_summarzer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


### 1. **What is a User-Agent?**

A **User-Agent** is a string that your browser or application (like Python's `requests` library) sends to a web server to identify itself. It typically includes information like the browser type, operating system, and version. Servers can use this information to tailor responses (like sending different content to mobile vs. desktop browsers) or to block certain kinds of clients (e.g., bots).

For example, when you visit a website with a browser like Chrome, your browser sends a User-Agent string like this:

```
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36
```

In the case of Python's `requests` library, it uses a default User-Agent like this:

```
python-requests/2.26.0
```

Some websites might block requests that come from the default Python User-Agent, assuming it’s from an automated script. To bypass that, you can "spoof" your User-Agent to make it appear like a regular browser.

**Example of setting a custom User-Agent in `requests`:**

```python
import requests

url = 'https://example.com'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'}
response = requests.get(url, headers=headers)
```

### 2. **What are `requests` and `response` in Python?**

#### a) `requests` Library
The `requests` library is used to send HTTP requests in Python. It allows you to interact with websites, APIs, or other online services by sending various types of requests (e.g., `GET`, `POST`, `PUT`, etc.) and receiving responses.

- **`requests.get()`**: This sends a GET request to the server to fetch the resource (like a web page or file).
- **`requests.post()`**: This sends a POST request, usually to submit data (like form data) to a server.

#### b) `response` Object
When you send a request (e.g., `requests.get()`), the server responds, and `requests` returns a **response object**. This response object contains various details about the server's response, such as:

- **`response.status_code`**: This is the HTTP status code that tells you whether the request was successful. Common status codes are:
  - `200`: Success (The request was successful).
  - `404`: Not Found (The requested resource couldn't be found on the server).
  - `500`: Internal Server Error (Something went wrong on the server).

- **`response.content`**: This is the raw content of the response (often binary, like when downloading a file).

- **`response.text`**: This is the content of the response in text format (usually HTML or JSON).

- **`response.json()`**: If the response is in JSON format, you can use this method to parse it into a Python dictionary.

### Example: Making a Request and Getting a Response
```python
import requests

# Send a GET request to the server
url = 'https://api.example.com/data'
response = requests.get(url)

# Access the status code
print(response.status_code)  # 200 means success

# Access the content of the response
print(response.text)  # Print the text (HTML, JSON, etc.) returned by the server

# If it's a JSON response, convert it to a dictionary
data = response.json()
print(data)
```

### 3. **What is `response.status_code == 200`?**

When you check `response.status_code == 200`, you are confirming whether the request was **successful**. The status code `200` is part of the HTTP standard and indicates that the request was successfully received, understood, and processed by the server.

- **200 OK**: This status code means the server successfully processed the request and sent back the expected content.
  
You often check for `200` to ensure that the request was successful before proceeding to work with the response content:

```python
if response.status_code == 200:
    print("Request was successful!")
else:
    print(f"Failed with status code: {response.status_code}")
```

### Example Use Case:
Let's say you're trying to download a PDF file from a website. You can send a `GET` request and check if the server returns a `200` status code before saving the file.

```python
import requests

url = 'https://example.com/somefile.pdf'
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Save the file locally
    with open('downloaded_file.pdf', 'wb') as f:
        f.write(response.content)
    print("File downloaded successfully!")
else:
    print(f"Failed to download file. Status code: {response.status_code}")
```

### Summary:
- **User-Agent**: A string identifying your client (like a browser or script) to the server.
- **`requests`**: A Python library used to send HTTP requests to a web server.
- **`response`**: The server's response to your request, containing the content, status code, headers, etc.
- **`response.status_code == 200`**: A check to confirm the request was successful, as `200` is the standard status code for success.



Understanding the response object is essential for working with APIs! The `requests` library in Python makes it easy to interact with APIs, and the response object it returns is central to handling the data you get back from an API.

### What is the Response Object?

When you make a request using `requests.get()` or `requests.post()`, Python sends an HTTP request to the API’s endpoint. The response that comes back from the server contains several elements:

- **Status Code** (`response.status_code`): Indicates whether the request was successful (like 200 for OK or 404 for Not Found).
- **Headers** (`response.headers`): Contains metadata about the response, such as content type, date, and server information.
- **Content** (`response.content`): The raw response data in bytes, which is useful for non-text responses like images or PDFs.
- **Text** (`response.text`): The response data decoded as a string. Useful for reading HTML or plain text responses.
- **JSON** (`response.json()`): If the server returns JSON data (a structured format commonly used in APIs), this method will parse it into a Python dictionary or list.

### Loading the Response as JSON

APIs often return data in JSON format, so it’s common to load the response as JSON using `response.json()`. This only works if the response is actually JSON. Attempting to call `response.json()` on non-JSON data (like a PDF) will cause an error.

### Extracting Elements from the Response

Once you’ve verified that the response contains JSON data, you can access elements of the parsed JSON as you would with any dictionary or list in Python.

Here’s a basic example to illustrate:

```python
import requests

# Example of making an API request that returns JSON data
url = 'https://jsonplaceholder.typicode.com/todos/1'
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Load response as JSON
    data = response.json()
    
    # Print the JSON data
    print("Full JSON data:", data)
    
    # Access individual items in the JSON (assuming it's a dictionary)
    print("User ID:", data['userId'])
    print("ID:", data['id'])
    print("Title:", data['title'])
    print("Completed:", data['completed'])
else:
    print("Failed to retrieve data")
```

### Explanation of the Code:

1. **Make the Request**: `response = requests.get(url)`
2. **Check the Status Code**: `if response.status_code == 200`
   - A status code of `200` means the request was successful.
3. **Load JSON Data**: `data = response.json()`
   - This converts the JSON data into a Python dictionary (or list).
4. **Extract Values**: `data['userId']`, `data['title']`, etc.
   - Access values in the dictionary using keys.

In this example, the JSON response is something like:
```json
{
  "userId": 1,
  "id": 1,
  "title": "delectus aut autem",
  "completed": false
}
```

### Key Points to Remember:

- **Use `response.json()` only if the response is JSON**. If it’s not, you’ll get a `JSONDecodeError`.
- **Use `response.content`** for non-text responses (like files) that you want to save directly.
- **Check the `Content-Type` header** (`response.headers['Content-Type']`) to understand the type of data in the response. JSON responses typically have `Content-Type: application/json`.



In [27]:
# response.headers[list('Date', 'Content-Type', 'access-control-allow-methods', 'access-control-allow-origin')]

import requests

# Step 1: Download the file
url = 'https://core.ac.uk/download/pdf/35115719.pdf'
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Print out different parts of the response object
    print("Status Code:", response.status_code)  # Status code of the request
    print("Headers:", response.headers)  # Headers with metadata about the response
    print("Content-Type:", response.headers.get('Content-Type'))  # Specific header field for content type
    print("Date:", response.headers.get('Date'))  # Specific header field for date
    print("Content (first 500 bytes):", response.content[:500])  # First 500 bytes of content for a glimpse

    # Try to parse as JSON to see if it’s JSON data (it won't be in this case)
    try:
        data = response.json()
        print("JSON data:", data)
    except ValueError:
        print("The response is not in JSON format.")
else:
    print("Failed to download the file")


Status Code: 200
Headers: {'Date': 'Thu, 07 Nov 2024 13:40:38 GMT', 'Content-Type': 'application/pdf', 'Content-Length': '561867', 'Connection': 'keep-alive', 'x-frame-options': 'SAMEORIGIN, SAMEORIGIN, SAMEORIGIN', 'access-control-allow-methods': 'POST, GET, OPTIONS', 'access-control-allow-origin': '*', 'access-control-allow-headers': 'Content-Type, Access-Control-Allow-Headers, Authorization, X-Frame-Options', 'Cache-Control': 'public', 'link': '<https://core.ac.uk/reader/35115719>; rel="canonical"', 'last-modified': 'Mon, 15 Mar 2021 19:24:30 GMT', 'etag': '"892cb-5bd9832e37780"', 'vary': 'User-Agent, Accept-Encoding', 'CF-Cache-Status': 'BYPASS', 'Set-Cookie': '_ga=GA1.3.c3a37971-bb78-45fe-abec-520004f1123e; expires=Sat, 07-Nov-2026 13:40:38 GMT; Max-Age=63072000; path=/; domain=core.ac.uk', 'Accept-Ranges': 'bytes', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v4?s=bupxntrASWPNs%2BmFOdUJzunEc4zmINNOp%2FON7Ee6qVmmHW1zL6dtD5kPMkGjrnDQJPSgWMD%2BKKuM

### Access Response Content

Let’s break down why we use different methods to access parts of the `response` object.

### 1. Accessing `response.headers` with `.get()`

`response.headers` is a dictionary-like object, where each header name is a key and its corresponding value is the content of that header. When you access a specific header like `"Date"` using `.get()`, you’re using a dictionary’s method to safely retrieve the value:

```python
response.headers.get('Date')
```

This is useful because:
- **If the header exists**: `.get()` returns its value (like `'Wed, 07 Nov 2024 12:34:56 GMT'`).
- **If the header does not exist**: `.get()` returns `None` instead of raising an error, allowing your code to avoid breaking if the header is missing.

#### Example:
Using `.get()` ensures that if `"Date"` isn’t in `response.headers`, it won’t throw a `KeyError`. Instead, it will simply return `None`, which you can handle gracefully.

### 2. Accessing `response.content` with Indexing (`[:500]`)

`response.content` is a **byte string** containing the raw content of the response (such as a PDF, image, or HTML document). Unlike `response.headers`, it’s not a dictionary but a single, long string of binary data. Here, we use **slicing** to print just a small part of it:

```python
response.content[:500]
```

This takes the **first 500 bytes** of the content and shows them, giving you a preview without overwhelming the console with potentially large data.

#### Why Indexing?

Since `response.content` is a byte string, you can treat it like a sequence of bytes (similar to a list of characters in a regular string). Slicing is a common way to limit what we print or display, especially when the content is large (e.g., files, images).

### Summary

- **`response.headers.get('Date')`**: Safe access for dictionary items; avoids errors if the key is missing.
- **`response.content[:500]`**: Slice the first 500 bytes of a byte string for a preview without printing everything.



## Read PDF from URL

1. **Download the PDF File**:
   ```python
   response = requests.get(url)
   ```
   This line sends an HTTP GET request to the specified URL, attempting to retrieve the file. Since this URL points to a PDF, `response.content` contains the raw binary data of the PDF file.

2. **Check the Response Status**:
   ```python
   if response.status_code == 200:
   ```
   We’re confirming that the request was successful (status code `200`). If it wasn’t, we print an error message instead of proceeding, ensuring we only save the file if it was downloaded correctly.

3. **Save the PDF to a Local File**:
   ```python
   with open('downloaded.pdf', 'wb') as f:
       f.write(response.content)
   ```
   This step saves the binary data to a file:
   - **`'downloaded.pdf'`**: Specifies the file’s name.
   - **`'wb'` (write binary)**: Opens the file in binary mode for writing, which is required for non-text files like PDFs.
   - **`f.write(response.content)`**: Writes the entire byte string of `response.content` to the file, storing the PDF locally.

### How This Differs from the Previous Code

- **No JSON Parsing Attempt**: This code doesn’t attempt to interpret the response as JSON because we know we’re handling a binary file.
- **Directly Saving the File**: Since the goal is to download the file, we immediately save `response.content` to disk, making the code more efficient and focused on the task of saving.
- **No Preview of Data**: We skip previewing the content because, with binary data, there’s no need to inspect it (it would appear as unreadable bytes).

### Why This Approach is Effective for File Downloads

When working with downloadable files (like PDFs, images, or zip files):
- **Binary Mode**: Ensures the file’s integrity by preserving its raw byte structure.
- **No JSON Parsing or String Manipulation**: Simplifies the code by treating the response as a file download rather than data for immediate inspection or manipulation.
  


In [1]:
!pip install pdfx



In [28]:
import requests
import pdfx

# Step 1: Download the PDF file
url = 'https://core.ac.uk/download/pdf/35115719.pdf'
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Step 2: Save the PDF to a local file
    with open('downloaded.pdf', 'wb') as f:
        f.write(response.content)
else:
    print("Failed to download the file")


### Process PDF

This code processes a saved PDF file using the `pdfx` library. Let’s break down what it does and suggest some improvements:

### Code Explanation

1. **Load the PDF File with `pdfx`**:
   ```python
   pdf = pdfx.PDFx('downloaded.pdf')
   ```
   This line creates a `PDFx` object by loading the saved PDF file (`'downloaded.pdf'`). The `pdfx` library is designed for extracting metadata, text, and references (e.g., URLs) from PDF documents.

2. **Extract Metadata**:
   ```python
   metadata = pdf.get_metadata()
   print(metadata)
   ```
   The `get_metadata()` method retrieves metadata from the PDF, which may include:
   - Title, author, and subject
   - Creation date, modification date
   - Producer (software that generated the PDF)
   - Other embedded information specific to the PDF

   Printing `metadata` shows this information in a dictionary format, making it accessible for further analysis or saving.


### Summary Code

- **File Existence Check**: Ensures `'downloaded.pdf'` is present before processing.
- **Error Handling**: Uses a try-except block to catch any issues with loading or processing the PDF.
- **Optional Saving of Metadata**: Saves metadata to a JSON file if you want to keep a record.
- **Text Extraction Preview**: Provides a short preview of the PDF’s content for quick inspection.



In [29]:
import os
import json
import pdfx

# Check if the file exists
if os.path.exists('downloaded.pdf'):
    try:
        # Load the PDF file
        pdf = pdfx.PDFx('downloaded.pdf')

        # Extract and print metadata
        metadata = pdf.get_metadata()
        print("Metadata:", metadata)

        # Optionally save metadata to a JSON file
        with open('metadata.json', 'w') as f:
            json.dump(metadata, f, indent=4)
        print("Metadata saved to 'metadata.json'")

        # Extract and print a preview of the PDF text
        text = pdf.get_text()
        print("Extracted Text (Preview):", text[:500])  # First 500 characters

    except Exception as e:
        print("An error occurred while processing the PDF:", e)
else:
    print("File 'downloaded.pdf' does not exist.")


Metadata: {'CreationDate': "D:20110502201921+02'00'", 'Creator': 'Adobe InDesign CS5 (7.0.3)', 'ModDate': "D:20111128144517+01'00'", 'Producer': 'Adobe PDF Library 9.9', 'rdf': {}, 'xap': {'ModifyDate': '2011-11-28T14:45:17+01:00', 'CreateDate': '2011-05-02T20:19:21+02:00', 'MetadataDate': '2011-11-28T14:45:17+01:00', 'CreatorTool': 'Adobe InDesign CS5 (7.0.3)'}, 'dc': {'format': 'application/pdf'}, 'xapmm': {'DocumentID': 'uuid:c8b0c497-6ec1-4539-836e-754e73d73881', 'InstanceID': 'uuid:c937332f-d96b-49f0-9fe3-23c4c93d5bef'}, 'pdf': {'Producer': 'Adobe PDF Library 9.9'}, 'Pages': 5}
Metadata saved to 'metadata.json'
Extracted Text (Preview): CORE

Provided by Open Marine Archive

Metadata, citation and similar papers at core.ac.uk

Journal of Coastal Research 
Journal of Coastal Research

SI 64 
SI 64

pg - pg 
349 - 353

ICS2011 (Proceedings) 
ICS2011 (Proceedings)

Poland 
Poland

ISSN 0749-0208 
ISSN 0749-0208

 

Characterisation  of  mangrove  forest  types  in  view  of  conserva

## PDF from Google Drive
The steps for reading a file from a **URL** and from **Google Drive** differ because of how each service handles file hosting and access. Let me break it down:

### 1. **Reading from a URL** (Standard Web Server)
When you download a file from a typical URL, you're accessing a file hosted on a web server that directly serves files in response to HTTP requests. Here's how it works:

- **Standard Web Server**: A web server like `example.com` or `core.ac.uk` serves files when requested through standard HTTP or HTTPS protocols.
- **Direct File Links**: If the link points directly to a file (e.g., a `.pdf` or `.csv` file), you can use Python's `requests` library to directly fetch the content of the file and save it or process it.

**Example**:

```python
import requests

# URL points directly to a file
url = 'https://example.com/somefile.pdf'

# Make the GET request to download the file
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    with open('file.pdf', 'wb') as f:
        f.write(response.content)
else:
    print(f"Failed to download the file: {response.status_code}")
```

In this case, you're making a direct request to the server for a file, and the server responds by sending the file contents. There are no extra steps or layers involved.

### 2. **Reading from Google Drive**
Google Drive works a bit differently. When you share a file through Google Drive, the URL typically does **not** point directly to the file but instead to a **web page** where the file is displayed (such as a view or download page).

Here's what happens with a Google Drive link:

- **Google Drive Link (view page)**: When you access a file through a standard Google Drive link, you're not taken directly to the file itself. Instead, Google presents a web page with an embedded viewer for the file (like the one you get when clicking "View" on a shared Google Drive link).
- **Direct Download Links**: Google provides a mechanism to generate direct download links that let you bypass the "View" page and download the file directly. This requires modifying the URL.

#### How to Convert Google Drive URLs:
When working with Google Drive links, the standard format is:

```
https://drive.google.com/file/d/{file_id}/view
```

To directly download the file, you need to convert this to:

```
https://drive.google.com/uc?export=download&id={file_id}
```

- The `{file_id}` is the unique identifier for your file, which you extract from the standard Google Drive link.

**Example of Downloading a File from Google Drive**:

```python
import requests

# Original Google Drive URL (view page)
url = 'https://drive.google.com/file/d/1ILsIxKRBmqPxGW6Py3pRPLiBWhWBEU3g/view'

# Convert to direct download URL
file_id = '1ILsIxKRBmqPxGW6Py3pRPLiBWhWBEU3g'
download_url = f'https://drive.google.com/uc?export=download&id={file_id}'

# Download the file
response = requests.get(download_url)

if response.status_code == 200:
    with open('downloaded_file.pdf', 'wb') as f:
        f.write(response.content)
    print("File downloaded successfully!")
else:
    print(f"Failed to download file. Status code: {response.status_code}")
```

### Why the Difference?

- **Standard URL**: When you download a file from a typical web server (like `example.com`), you're usually dealing with a direct link to the file. No extra steps are needed, and you can download the file directly with a `GET` request.
  
- **Google Drive**: Google Drive serves files differently. Google Drive links typically point to a view or preview page, not the file itself. You need to either modify the URL to create a direct download link or download the file manually from the interface.

### Summary of the Differences:
1. **File Hosting**:
   - **URL**: Direct access to the file.
   - **Google Drive**: Requires viewing through Google’s interface, or a modified link for direct download.

2. **URL Format**:
   - **URL**: Points directly to a downloadable file (e.g., `example.com/file.pdf`).
   - **Google Drive**: Needs a special download link (`uc?export=download`).

3. **Additional Step**:
   - **Google Drive**: Requires extracting the file ID and converting the URL for direct download.

### Use Case Example:

- **Direct URL**:
  - You can use a simple `requests.get()` to fetch the file.

- **Google Drive**:
  - You need to modify the URL to the direct download format or use an API (e.g., Google Drive API) for more complex operations.


### PDF from Google Drive

In [32]:
import requests
import pdfx

# Step 1: Download the file from Google Drive
url = 'https://drive.google.com/uc?export=download&id=1ILsIxKRBmqPxGW6Py3pRPLiBWhWBEU3g'
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Print headers and other response info
    print("Status Code:", response.status_code)
    print("Headers:", response.headers)  # Print all headers
    print("Content-Type:", response.headers.get('Content-Type'))  # Specific header field
    print("Date:", response.headers.get('Date'))  # Specific header field

    # Step 2: Save the PDF to a local file
    with open('google_drive_downloaded.pdf', 'wb') as f:
        f.write(response.content)
    print("File downloaded and saved as 'google_drive_downloaded.pdf'")
else:
    print("Failed to download the file")

# Step 3: Process the saved PDF file
try:
    pdf = pdfx.PDFx('google_drive_downloaded.pdf')

    # Extract and print a slice of the text
    text = pdf.get_text()
    print("Extracted Text (Preview):", text[:500])  # Print the first 500 characters of the text

    # Extract and print metadata for reference
    metadata = pdf.get_metadata()
    print("Metadata:", metadata)

except Exception as e:
    print("An error occurred while processing the PDF:", e)


Status Code: 200
Headers: {'Content-Type': 'application/octet-stream', 'Content-Security-Policy': "sandbox, default-src 'none', frame-ancestors 'none'", 'X-Content-Security-Policy': 'sandbox', 'Cross-Origin-Opener-Policy': 'same-origin', 'Cross-Origin-Embedder-Policy': 'require-corp', 'Cross-Origin-Resource-Policy': 'same-site', 'X-Content-Type-Options': 'nosniff', 'Content-Disposition': 'attachment; filename="Art Collector personality types_ finish.pdf"', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'false', 'Access-Control-Allow-Headers': 'Accept, Accept-Language, Authorization, Cache-Control, Content-Disposition, Content-Encoding, Content-Language, Content-Length, Content-MD5, Content-Range, Content-Type, Date, developer-token, financial-institution-id, X-Goog-Sn-Metadata, X-Goog-Sn-PatientId, GData-Version, google-cloud-resource-prefix, linked-customer-id, login-customer-id, x-goog-request-params, Host, If-Match, If-Modified-Since, If-None-Match, If-Unmod

#### What is a Binary File?

When we say that a **PDF** is a **binary file**, we are referring to the way the data in the file is stored on disk—not necessarily the kind of content it contains (text, images, etc.). Binary files are stored as sequences of bytes (0s and 1s) that can include a wide range of data types, such as text, images, metadata, and instructions for displaying the content in a particular layout. PDFs are considered **binary** because they contain much more than just plain text.

Here’s a breakdown:

### 1. **Binary vs. Text Files**
- **Text files**: These contain plain, human-readable text and are stored as a sequence of characters (letters, numbers, symbols). Examples are `.txt` files or simple `.csv` files. They can be opened and read with a standard text editor (like Notepad or a code editor), and the content makes sense without any special decoding.
  
- **Binary files**: These are stored as raw byte data and may contain encoded text, images, font data, metadata, formatting instructions, and even compressed or encrypted data. They often require special software to interpret and render their content correctly. Examples include images (`.jpg`, `.png`), executable files (`.exe`), and complex document formats like PDFs (`.pdf`).

### 2. **Why PDF is a Binary File**
A PDF file is not just plain text; it includes a combination of:
- **Text content**: The actual words in the document, often compressed or encoded.
- **Images**: PDFs often contain embedded images, which are binary data.
- **Fonts**: PDFs include font information so that the document can be displayed consistently across devices.
- **Formatting**: Instructions for how to layout the text, images, and other elements on the page.
- **Metadata**: Information about the document itself, like the author, title, creation date, etc.
- **Interactive elements**: Things like forms, buttons, and links can be embedded within a PDF.

All of this data is stored in binary format so that specialized PDF readers (such as Adobe Acrobat or your browser’s built-in PDF viewer) can interpret and render the content appropriately. If you were to open a PDF file in a plain text editor, you would mostly see unreadable characters, because it includes encoded and compressed data that only specialized software can interpret.

### 3. **Why It’s Important to Handle PDFs as Binary Files**
When you download or save a PDF (or any other binary file), it’s essential to use **binary mode** (`'wb'` for writing, `'rb'` for reading) to ensure that Python treats the file’s contents as raw bytes. If you handle it in text mode (e.g., `'w'` or `'r'`), Python will try to interpret the bytes as plain text, which could corrupt the file because the binary data might not map cleanly to characters.

#### Example:
- If you open a PDF in a text editor, it might look like this:
  ```
  %PDF-1.4
  %âãÏÓ
  1 0 obj
  <<
  /Type /Catalog
  /Pages 2 0 R
  >>
  endobj
  ...
  ```
  This is a mixture of human-readable parts (`/Type /Catalog`, etc.) and non-readable byte sequences (`%âãÏÓ`). The unreadable parts are binary data encoded in a way that is not meant for plain text interpretation.

### 4. **PDF Structure**
A PDF has a well-defined structure that includes:
- **Header**: Contains the PDF version number.
- **Body**: Contains objects such as text streams, images, and font descriptions.
- **Cross-reference table**: Points to the locations of the objects within the file.
- **Trailer**: Helps the PDF reader locate the cross-reference table.

This complex structure is what allows PDFs to include rich content like images, vector graphics, hyperlinks, and embedded fonts.

### Example: Why Binary Mode is Required
When you download and save a PDF using Python’s `requests` library, it is fetching the **binary content** of the file from the web server. If you saved this in **text mode**, Python might alter some of the raw byte data (e.g., by interpreting line endings differently), which could corrupt the PDF file. Using `'wb'` ensures that Python writes the file exactly as it was downloaded, byte-for-byte, without modification.

```python
import requests

# Download the PDF file
url = 'https://example.com/somefile.pdf'
response = requests.get(url)

# Save it as a binary file (to prevent corruption)
with open('downloaded_file.pdf', 'wb') as f:
    f.write(response.content)
```

### Summary
- A **PDF is a binary file** because it contains not just text but also binary-encoded elements like images, fonts, and formatting information.
- When downloading or saving binary files (including PDFs), you need to use `'wb'` mode to handle the raw byte data correctly.
- PDFs require special software to render because they contain complex structures and embedded data that go beyond simple text.



## PDF Summarizer

### Install Libraries

In [37]:
# !pip install langchain
# !pip install openai
# !pip install tiktoken
# !pip install unstructured
# !pip install chromadb
# !pip install Cython
# !pip install pypdf
# !pip install gensim
# !pip install transformers

#### Dependency Conflicts

To avoid dependency conflicts when installing libraries initially, you can specify compatible versions of libraries that are known to work well together. Here’s a modified setup code snippet that includes compatible versions for `transformers` and `protobuf` along with your other libraries.

```python
# Install necessary libraries with compatible versions
!pip install langchain==0.0.134
!pip install openai==0.10.2
!pip install tiktoken==0.2.0
!pip install unstructured==0.2.4
!pip install chromadb==0.3.23
!pip install Cython==0.29.32
!pip install pypdf==3.8.1
!pip install gensim==4.2.0
!pip install transformers==4.30.0  # Specify compatible version of transformers
!pip install protobuf==3.20.3       # Specify compatible version of protobuf
```

### Explanation

- **`transformers==4.30.0`**: This version is known to work with `protobuf==3.20.3`.
- **`protobuf==3.20.3`**: This version is compatible with `transformers` and avoids the common compatibility issues with newer versions.

### Alternative Solution: Using a `requirements.txt` File

If you’re working in an environment that supports `requirements.txt` (like a virtual environment or Docker container), you can create a `requirements.txt` file with specific versions of your libraries and install everything with one command. Here’s how you’d set up `requirements.txt`:

```
langchain==0.0.134
openai==0.10.2
tiktoken==0.2.0
unstructured==0.2.4
chromadb==0.3.23
Cython==0.29.32
pypdf==3.8.1
gensim==4.2.0
transformers==4.30.0
protobuf==3.20.3
```

Then, run:

```python
!pip install -r requirements.txt
```

By specifying compatible versions from the beginning, you reduce the risk of dependency conflicts when using the `transformers` library. Let me know if this helps streamline your setup!

In [2]:
# !pip install --upgrade protobuf
# !pip install --upgrade transformers


### HugginFace Summarizer Model

In [3]:
# Import necessary libraries
from transformers import pipeline  # Hugging Face transformers
import PyPDF2  # For reading PDF files

# Verify the setup by loading the summarization pipeline
try:
    summarizer = pipeline("summarization", model='sshleifer/distilbart-cnn-12-6')
    print("Summarization pipeline loaded successfully.")
except Exception as e:
    print("An error occurred while loading the summarization pipeline:", e)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Summarization pipeline loaded successfully.


### Model Configuration

This section defines the architecture, settings, and behavior of the DistilBART model used for summarization. Here’s a brief summary of the main types of information:

1. **Model Identity and Purpose**:
   - **Name and Path** (`_name_or_path`): Identifies the specific pretrained model.
   - **Model Type** (`model_type`): Specifies the underlying model architecture (e.g., "bart").
   - **Task Parameters** (`task_specific_params`): Fine-tuned parameters for summarization, like maximum summary length and repetition control.

2. **Architecture Details**:
   - **Encoder and Decoder Layers** (`encoder_layers`, `decoder_layers`): Defines the number of layers in both the encoder and decoder parts of the model.
   - **Attention Heads** (`encoder_attention_heads`, `decoder_attention_heads`): Controls the number of attention heads, which help the model focus on different parts of the text.
   - **Feed-Forward Dimensions** (`encoder_ffn_dim`, `decoder_ffn_dim`): Defines the size of the intermediate layers in each encoder and decoder layer.

3. **Tokenization and Special Tokens**:
   - **Token IDs** (`bos_token_id`, `eos_token_id`, `pad_token_id`): Special token IDs for marking the beginning and end of sentences and padding sequences.
   - **Maximum Position Embeddings** (`max_position_embeddings`): Sets the maximum number of tokens in each input, limiting text length.

4. **Regularization and Dropout**:
   - **Dropouts** (`dropout`, `attention_dropout`, etc.): Controls dropout rates, adding randomness to reduce overfitting and improve generalization.
   - **Length Penalty** (`length_penalty`): Adjusts the length of generated summaries.

5. **Summarization-Specific Parameters**:
   - **Beams and Repetition Control** (`num_beams`, `no_repeat_ngram_size`): Configures the beam search and prevents repeated phrases.
   - **Min and Max Length** (`min_length`, `max_length`): Defines the minimum and maximum summary lengths.

In essence, these settings control how the model generates summaries by defining the model’s structure, managing text handling, and fine-tuning output quality. This combination allows for effective summarization tailored to common summary requirements.

### Neural Network with Multiple Layers

The DistilBART model you’re working with is a **neural network with multiple layers**, specifically an encoder-decoder architecture based on the Transformer model. Here’s a high-level breakdown:

1. **Encoder-Decoder Structure**:
   - Like other Transformer models, DistilBART has both an **encoder** (to understand the input text) and a **decoder** (to generate the summarized text).
   - The **encoder layers** process the input text, capturing its meaning in a series of internal representations.
   - The **decoder layers** then take those representations and generate a condensed, coherent summary.

2. **Layers and Attention**:
   - DistilBART has **stacked layers** in both the encoder and decoder, which are repeated processing units in the neural network. Each layer refines the information it receives, allowing the model to capture complex relationships in the text.
   - Each layer includes **attention heads**, which help the model focus on relevant words and phrases within the input sequence, improving understanding of context and relationships.

3. **Parameter Configurations**:
   - The configuration you saw (e.g., number of layers, attention heads, dropout rates) defines the neural network's depth, capacity, and regularization strategies. These parameters enable the model to effectively perform tasks like summarization by learning patterns in large text datasets.

In summary, DistilBART is a neural network specifically designed to process and generate natural language text through a series of structured, layered transformations. It’s a “lighter” (distilled) version of the original BART model but still retains strong performance, making it efficient for tasks like summarization while keeping computational demands lower.

In [5]:
# Print the summarizer object itself
print("Summarizer Object:\n", summarizer)

# Model configuration details
print("\nModel Configuration:\n", summarizer.model.config)

# Tokenizer details
print("\nTokenizer Details:\n", summarizer.tokenizer)

# Help on the summarizer function to see available parameters
print("\nSummarizer Parameters:\n")
help(summarizer)


Summarizer Object:
 <transformers.pipelines.text2text_generation.SummarizationPipeline object at 0x7d125fc656c0>

Model Configuration:
 BartConfig {
  "_attn_implementation_autoset": true,
  "_name_or_path": "sshleifer/distilbart-cnn-12-6",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "extra_pos_embeddings": 2,
  "force_bos_token_to_be_generated": true,
  "forced_bos_token_id": 0,
  "forced_eos

### Summarize PDF

1. **Combine Text Extraction and Chunking**:
   Instead of building `pdf_text` first and then chunking it, you can append each page's text directly into chunks as you go. This approach reduces memory usage for large PDFs.

2. **Chunk Size and Overlap**:
   Sometimes splitting text exactly at `chunk_size` can cut sentences in half, affecting summary quality. Adding a slight overlap between chunks helps retain context across chunks.

3. **Batch Summarization**:
   Instead of calling `summarizer` individually for each chunk, you can pass all chunks in a single call. Hugging Face `pipeline` can handle batches, which may be faster.



In [7]:
# Initialize summarizer
summarizer = pipeline("summarization", model='sshleifer/distilbart-cnn-12-6')

# Load PDF
pdf_path = '/content/Art Collector personality types_ finish.pdf'
chunks = []
chunk_size = 1000  # Define the chunk size
overlap = 100      # Define overlap between chunks for better context

with open(pdf_path, 'rb') as file:
    pdf_reader = PyPDF2.PdfReader(file)
    pdf_text = ""

    # Process each page and split into chunks directly
    for page in pdf_reader.pages:
        pdf_text += page.extract_text()

    # Create chunks with overlap
    for i in range(0, len(pdf_text), chunk_size - overlap):
        chunk = pdf_text[i:i + chunk_size]
        chunks.append(chunk)

# Generate summaries for each chunk
for i, chunk in enumerate(chunks):
    summary = summarizer(chunk, max_length=150, min_length=50, do_sample=False)
    print(f"Summary of Chunk {i + 1}:", summary[0]['summary_text'])

Summary of Chunk 1:  The Niche Enthusiast looks for very specific art forms, be it abstract, surrealism, or any niche genre . The Novice Collector is eager, curious, and often rely on others' opinions or visible  trends . For artists, aligning  with their niche preference is key to gain their attention .
Summary of Chunk 2:  The Investment Collector buys artworks they believe will appreciate value . The Speculative Collector takes risks on unknown or lesser-known artists hoping that their value will explode in the future . Artists who’ve received media attention or have a growing  reputation will appeal to this group .
Summary of Chunk 3:  The Emotional Buyer buys art that resonates with them on a deep, emotional level . The Trend Follower buys artworks that are trending or popular at the moment . For artists, storytelling and personal connection can be the key to attracting these collectors .
Summary of Chunk 4:  The Philanthropic/Patron Collector: They believe in supporting artists .

### Adjust Summary Length

Setting `max_length=50` and `min_length=10` should work for producing shorter summaries. These parameters tell the summarizer to aim for a summary with a length between 10 and 50 tokens (words or subwords, depending on the tokenizer). Just keep in mind:

1. **Context Loss**: With shorter summaries, there’s a chance of losing important context, especially if the chunked text is lengthy. You might need to adjust the `chunk_size` to balance this, ensuring each chunk contains focused information.

2. **Model Constraints**: If the summarizer doesn’t have enough content to summarize within the set length constraints, it may produce summaries that don’t fully reflect the intended range. Adjust `max_length` and `min_length` based on the content complexity.

3. **Summarization Quality**: Shorter summaries are more concise but may miss finer details. Experimenting with values (like `min_length=20` and `max_length=70`) can help you find the balance that best captures key points in shorter text.


This setup will attempt to generate shorter, more concise summaries for each chunk, fitting the range you've specified. Adjusting `max_length` and `min_length` lets you control the detail level and length of each summary as needed.

In [13]:
# Initialize summarizer
summarizer = pipeline("summarization", model='sshleifer/distilbart-cnn-12-6')

# Define desired summary length
max_length = 50
min_length = 20

# Load PDF
pdf_path = '/content/Art Collector personality types_ finish.pdf'
chunks = []
chunk_size = 500  # Smaller chunk size for more compact context
overlap = 50      # Smaller overlap if chunks are smaller

with open(pdf_path, 'rb') as file:
    pdf_reader = PyPDF2.PdfReader(file)
    pdf_text = ""

    # Process each page and split into chunks directly
    for page in pdf_reader.pages:
        pdf_text += page.extract_text()

    # Create compact chunks with overlap
    for i in range(0, len(pdf_text), chunk_size - overlap):
        chunk = pdf_text[i:i + chunk_size]
        chunks.append(chunk)

# Generate summaries for each chunk with shorter lengths
for i, chunk in enumerate(chunks):
    summary = summarizer(chunk, max_length=max_length, min_length=min_length, do_sample=False)
    print(f"Summary of Chunk {i + 1}:", summary[0]['summary_text'])


Summary of Chunk 1:  Novice Collector: These are individuals just starting in art collection . Niche Enthusiast: This collector looks for very specific art forms, be it abstract, surrealism, or surrealism .
Summary of Chunk 2:  For artists, aligning with their niche preference is key to gain their attention . They’re well-informed about their  purposefully chosen niche and appreciate artists who excel within it .
Summary of Chunk 3:  The Investment Collector is driven by the potential return on artworks they believe will appreciate value . Artists who’ve received media attention or have a growing reputation will appeal to this group . The Speculative Collector buys artworks he believes
Summary of Chunk 4:  The speculative collector takes risks on unknown or lesser-known artists hoping that their value will explode in the future . Provide data on past art sales, auction results, and press coverage to highlight potential ROI .
Summary of Chunk 5:  The Trend Follower: Buy artworks that ar