# Analyzing the 'AI Blueprint of India' PDF

This notebook demonstrates how to work with the content of the `CaseStudy-AIBluePrintofIndiabyLeezaPathan.pdf` file. We will cover three main methods:

1.  **Displaying the PDF** for easy reference.
2.  **Extracting all text** for analysis.
3.  **Extracting tables** into pandas DataFrames.

## Setup: Installing Libraries

First, make sure you have the necessary libraries installed. If not, run the following command in a code cell by removing the `#`:

In [None]:
# !pip install PyMuPDF
# !pip install tabula-py
# !pip install pandas

--- 
## Method 1: Display the PDF in the Notebook 📄

This method embeds the PDF directly into the output cell, which is useful for quick reference.

In [None]:
from IPython.display import IFrame

# Display the PDF file. You can adjust the width and height as needed.
IFrame("CaseStudy-AIBluePrintofIndiabyLeezaPathan.pdf", width=900, height=600)

--- 
## Method 2: Extract All Text for Analysis

Here, we'll use the **PyMuPDF** library (imported as `fitz`) to read the entire text content from the PDF into a single Python string. This is the first step for any text-based analysis or NLP task.

In [None]:
import fitz  # PyMuPDF

pdf_path = "CaseStudy-AIBluePrintofIndiabyLeezaPathan.pdf"
full_text = ""

# Open the PDF file
try:
    with fitz.open(pdf_path) as doc:
        # Loop through each page and extract its text
        for page in doc:
            full_text += page.get_text()

    # Print the first 500 characters of the extracted text
    print("Successfully extracted text. Printing the first 500 characters:\n")
    print(full_text[:500])
except Exception as e:
    print(f"An error occurred: {e}")
    print(f"Please ensure the file '{pdf_path}' is in the same directory as this notebook.")

--- 
## Method 3: Extract Tables into a DataFrame 📊

For structured data, the **`tabula-py`** library is fantastic. It can detect tables in a PDF and convert them directly into pandas DataFrames. 

**Note:** This requires Java to be installed on your system.

In [None]:
import tabula
import pandas as pd

pdf_path = "CaseStudy-AIBluePrintofIndiabyLeezaPathan.pdf"

try:
    # Read all tables from all pages of the PDF
    # This returns a list of pandas DataFrames
    list_of_dataframes = tabula.read_pdf(pdf_path, pages='all', multiple_tables=True)

    print(f"Found {len(list_of_dataframes)} tables in the PDF.\n")

    # Let's find and display the supercomputer table (from page 7)
    # We can identify it by looking for a specific column name
    supercomputer_table = None
    for df in list_of_dataframes:
        if 'Rmax' in df.columns or 'Rmax\r[TFlop/s]' in df.columns:
            supercomputer_table = df
            break

    if supercomputer_table is not None:
        print("Displaying the Supercomputer Table:")
        display(supercomputer_table)
    else:
        print("Could not automatically find the supercomputer table.")

except Exception as e:
    print(f"An error occurred: {e}")
    print("Please ensure Java is installed and that the PDF file is not corrupted.")