# PDF Table Extraction from MHA Website Using Camelot

This project focuses on extracting table data from Ministry of Home Affairs (MHA) PDFs using **Camelot**. These PDFs contain structured data in table format, and **Camelot** is ideal for extracting such tabular data accurately.

### **Camelot-Based Table Extraction**


## Project Overview

- **Download PDFs**: Automatically download PDF files from the MHA website using `requests` and `BeautifulSoup`.
- **Extract Tables**: Use **Camelot** to extract tables from the PDFs, especially from PDFs containing structured tabular data.

## Features

- **Table Extraction**: Extracts tables from each PDF file and combines the data into a pandas dataframe for further analysis.
- **Text Extraction**: Additionally, extracts introductory text from the first page using **PDFplumber**.

## Steps

1. **Download PDFs from MHA Website**:
   We use `BeautifulSoup` to extract links to the PDF files on the MHA website, and `requests` is used to download them.
   
2. **Table Extraction Using Camelot**:
   **Camelot** is used to extract tables from the PDF files, converting the data into a pandas dataframe. The data is then cleaned and structured appropriately.
   
3. **Data Combination**:
   Both text and table data are extracted and combined into a structured dictionary.

## Requirements

Install the required Python libraries:

```python
!pip install ghostscript
!apt-get install -y ghostscript
!pip install camelot-py
!pip install pdfplumber
```

The script processes:
- Downloads PDFs.
- Extracts tables using **Camelot** and combines them into a pandas dataframe.
- Extracts introductory text using **PDFplumber** for additional context.

## Sample Output

```python
{
    "title": "List of Unlawful Associations...",
    "data": [
        {"sr_no": "1", "Name_of_Association": "ABC Organization"},
        {"sr_no": "2", "Name_of_Association": "XYZ Organization"},
        ...
    ]
}
```


In [None]:
# install necesaasry libs

!pip install ghostscript
!apt-get install -y ghostscript
!pip install camelot-py
!pip install pdfplumber

Collecting ghostscript
  Downloading ghostscript-0.7-py2.py3-none-any.whl.metadata (4.4 kB)
Downloading ghostscript-0.7-py2.py3-none-any.whl (25 kB)
Installing collected packages: ghostscript
Successfully installed ghostscript-0.7
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  fonts-droid-fallback fonts-noto-mono fonts-urw-base35 libgs9 libgs9-common libidn12 libijs-0.35
  libjbig2dec0 poppler-data
Suggested packages:
  fonts-noto fonts-freefont-otf | fonts-freefont-ttf fonts-texgyre ghostscript-x poppler-utils
  fonts-japanese-mincho | fonts-ipafont-mincho fonts-japanese-gothic | fonts-ipafont-gothic
  fonts-arphic-ukai fonts-arphic-uming fonts-nanum
The following NEW packages will be installed:
  fonts-droid-fallback fonts-noto-mono fonts-urw-base35 ghostscript libgs9 libgs9-common libidn12
  libijs-0.35 libjbig2dec0 poppler-data
0 upgraded, 10 newly installed, 0 to remove and 49 

In [None]:
import camelot
import json
import requests
from bs4 import BeautifulSoup
import os
import pdfplumber
import pandas as pd

# Function to download PDFs with headers
def download_pdf(url, save_dir):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1'
    }

    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        file_name = os.path.join(save_dir, url.split('/')[-1])
        with open(file_name, 'wb') as file:
            file.write(response.content)
        print(f"Downloaded: {file_name}")
    except Exception as e:
        print(f"Failed to download {url}: {e}")

# PDF URLs to download
pdfs_to_download = [
    'https://www.mha.gov.in/sites/default/files/2024-07/LISTOFUNLAWFULASSOCIATIONS_11072024.pdf',
    'https://www.mha.gov.in/sites/default/files/2024-03/Listof57terrorists_07032024.pdf',
    'https://www.mha.gov.in/sites/default/files/2023-06/TERRORIST_ORGANIZATIONS_10032023.pdf'
]

# Directory to save the downloaded PDFs
save_dir = '/content/'
os.makedirs(save_dir, exist_ok=True)

# Download PDFs
for url in pdfs_to_download:
    download_pdf(url, save_dir)

# Function to process PDF files and convert to dictionary
def process_pdf(pdf_path):
    try:
        with pdfplumber.open(pdf_path) as pdf:
            first_page = pdf.pages[0]  # Assuming the title is on the first page
            text = first_page.extract_text()
            first_sentence = text.split('.')[0] + '.'

        tables = camelot.read_pdf(pdf_path, pages='all')
        combined_df = pd.concat([table.df for table in tables], ignore_index=True)

        if not combined_df.empty:
            new_header = combined_df.iloc[0]
            combined_df = combined_df[1:]
            combined_df.columns = new_header

            if len(combined_df.columns) > 1:
                combined_df.columns = ['sr_no', combined_df.columns[1].replace(" ", "_")]

            combined_df['sr_no'] = combined_df['sr_no'].str.strip('.')

            data_dict = combined_df.to_dict(orient='records')

            result_dict = {
                'title': first_sentence,
                'data': data_dict
            }
        else:
            result_dict = {'title': 'No data', 'data': []}

    except Exception as e:
        print(f"Failed to process {pdf_path}: {e}")
        result_dict = {'title': 'Error', 'data': []}

    return result_dict

# Process each PDF file
Name_of_the_Terrorist = process_pdf('/content/Listof57terrorists_07032024.pdf')
Name_of_Terrorist_Organization = process_pdf('/content/TERRORIST_ORGANIZATIONS_10032023.pdf')
Name_of_Unlawful_Association = process_pdf('/content/LISTOFUNLAWFULASSOCIATIONS_11072024.pdf')

# Print the final dictionaries
print("Name_of_the_Terrorist:")
print(Name_of_the_Terrorist)
print('*'*100)

print("\nName_of_Terrorist_Organization:")
print(Name_of_Terrorist_Organization)
print('**'*100)

print("\nName_of_Unlawful_Association:")
print(Name_of_Unlawful_Association)


Downloaded: /content/LISTOFUNLAWFULASSOCIATIONS_11072024.pdf
Downloaded: /content/Listof57terrorists_07032024.pdf
Downloaded: /content/TERRORIST_ORGANIZATIONS_10032023.pdf
Name_of_the_Terrorist:
{'title': 'LIST OF INDIVIDUAL TERRORISTS DESIGNATED UNDER SECTION 35 OF THE\nUNLAWFUL ACTIVITIES (PREVENTION) ACT, 1967, LISTED IN THE IVth SCHEDULE\nOF THE ACT\nSl.', 'data': [{'sr_no': '1', 'Name_of_the_Terrorist': 'Maulana Masood Azhar @ Maulana Mohammad Masood Azhar Alvi @ Vali \nAdam Issa'}, {'sr_no': '2', 'Name_of_the_Terrorist': 'Hafiz Muhammad Saeed @ Hafiz Mohammad Sahib @ Hafiz Mohaddad \nSayid @ Hafiz Muhammad @ Hafiz Saeed @ Hafez Mohammad Saeed @ \nHafiz Mohammad Sayeed @ Mohammad Sayed @ Muhammad Saeed'}, {'sr_no': '3', 'Name_of_the_Terrorist': 'Zaki-ur-Rehman Lakhvi @ Abu Waheed Irshad Ahmad Arshad  @ Kaki Ur-\nRehman @ Zakir Rehman Lakhvi @ Zaki-Ur-Rehman Lakvi @ Zakir \nRehman'}, {'sr_no': '4', 'Name_of_the_Terrorist': 'Dawood Ibrahim Kaskar @ Dawood Hasan Shiekh Kaskar @ Dawoo

In [None]:
# verify data
df1=pd.DataFrame(Name_of_the_Terrorist)
df1['data'][0]

{'sr_no': '1',
 'Name_of_the_Terrorist': 'Maulana Masood Azhar @ Maulana Mohammad Masood Azhar Alvi @ Vali \nAdam Issa'}

In [None]:
df2=pd.DataFrame(Name_of_Terrorist_Organization)
df2['data'][0]

{'sr_no': '1', 'Name_of_Terrorist_Organization': 'Babbar Khalsa International'}

In [None]:
df3=pd.DataFrame(Name_of_Unlawful_Association)
df3['data'][0]

{'sr_no': '1',
 'Name_of_Unlawful_Association': 'Students Islamic Movement of India (SIMI)'}



### **PDFplumber-Based Text Extraction**

# PDF Data Extraction from MHA Website Using PDFplumber

This project automates the process of downloading PDFs from the Ministry of Home Affairs (MHA) website and extracting text data using **PDFplumber**. The focus is on extracting simple text from the PDFs, particularly from the first page, which typically contains important information like the title.

## Project Overview

- **Download PDFs**: Automatically download PDF files from the MHA website using `requests` and `BeautifulSoup`.
- **Extract Text**: Use **PDFplumber** to extract textual content from the PDFs. This is useful for PDFs that contain simple, easy-to-read content.

## Features

- **Text Extraction**: Extracts text from the first page of each PDF file. The extracted text is stored in a dictionary format.
  
## Steps

1. **Download PDFs from MHA Website**:
   The MHA website is scraped using `BeautifulSoup` to extract the necessary PDF links, followed by downloading the files using `requests`.
   
2. **Text Extraction Using PDFplumber**:
   **PDFplumber** is employed to extract text from the first page of each PDF, where titles and introductory content are often located.
   
3. **Data Organization**:
   The extracted text from each PDF is stored in a Python dictionary, allowing for easy access and further processing.

## Requirements

Install the required Python libraries:

```bash
pip install requests pdfplumber beautifulsoup4 lxml
```

The script processes the following:
- Downloads PDFs from the MHA website.
- Extracts text from the first page of each PDF using PDFplumber.
- Saves the text in a dictionary format.

## Sample Output

```python
{
    "title": "List of Terrorists...",
    "data": {
        "Page_1": "This is the extracted text from the first page..."
    }
}
```


In [None]:
import os
import requests
import pdfplumber
from bs4 import BeautifulSoup

# Define headers for web scraping
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1'
}

def download_pdfs(pdf_urls, save_dir):
    """Download PDFs from the provided URLs and save them to the specified directory."""
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)

    for url in pdf_urls:
        try:
            response = requests.get(url, headers=headers)
            response.raise_for_status()  # Check for HTTP errors
            file_name = os.path.join(save_dir, url.split('/')[-1])
            with open(file_name, 'wb') as file:
                file.write(response.content)
            print(f"Downloaded: {file_name}")
        except requests.RequestException as e:
            print(f"Failed to download {url}: {e}")

def extract_pdf_content_to_dict(pdf_path):
    """Extract text content from a PDF and return it as a dictionary."""
    extracted_data = {}
    try:
        with pdfplumber.open(pdf_path) as pdf:
            for i, page in enumerate(pdf.pages):
                text = page.extract_text()
                if text:
                    extracted_data[f'Page_{i+1}'] = text
    except Exception as e:
        print(f"Error extracting content from {pdf_path}: {e}")
    return extracted_data

def get_pdfs_content():
    """Main function to download and extract content from PDFs."""
    # Base URL and links extraction
    webpage = requests.get('https://www.mha.gov.in/en/divisionofmha/counter-terrorism-and-counter-radicalization-division', headers=headers).text
    soup = BeautifulSoup(webpage, 'lxml')
    base_url = 'https://www.mha.gov.in'
    links = soup.find_all('a', class_='ext')
    pdf_urls = [base_url + link['href'] for link in links if link['href'].endswith('.pdf')]

    # List of PDFs to download
    pdfs_to_download = [
        'https://www.mha.gov.in/sites/default/files/2024-07/LISTOFUNLAWFULASSOCIATIONS_11072024.pdf',
        'https://www.mha.gov.in/sites/default/files/2024-03/Listof57terrorists_07032024.pdf',
        'https://www.mha.gov.in/sites/default/files/2023-06/TERRORIST_ORGANIZATIONS_10032023.pdf'
    ]

    # Directory to save the downloaded PDFs
    save_dir = '/content/'

    # Download specified PDFs
    download_pdfs(pdfs_to_download, save_dir)

    # File paths for extracted content
    pdf_files = {
        'Name_of_the_Terrorist': os.path.join(save_dir, 'Listof57terrorists_07032024.pdf'),
        'Name_of_Terrorist_Organization': os.path.join(save_dir, 'TERRORIST_ORGANIZATIONS_10032023.pdf'),
        'Name_of_Unlawful_Association': os.path.join(save_dir, 'LISTOFUNLAWFULASSOCIATIONS_11072024.pdf')
    }

    # Extract content from each PDF and return as separate dictionaries
    name_of_the_terrorist = extract_pdf_content_to_dict(pdf_files['Name_of_the_Terrorist'])
    name_of_terrorist_organization = extract_pdf_content_to_dict(pdf_files['Name_of_Terrorist_Organization'])
    name_of_unlawful_association = extract_pdf_content_to_dict(pdf_files['Name_of_Unlawful_Association'])

    return name_of_the_terrorist, name_of_terrorist_organization, name_of_unlawful_association

# Execute and get the extracted content
name_of_the_terrorist, name_of_terrorist_organization, name_of_unlawful_association = get_pdfs_content()

# Example to show the output
print("\nName_of_the_Terrorist - First Page Content:\n")
print(name_of_the_terrorist.get('Page_1'))
print('*' * 150)

print("\nName_of_Terrorist_Organization - First Page Content:\n")
print(name_of_terrorist_organization.get('Page_1'))
print('*' * 150)

print("\nName_of_Unlawful_Association - First Page Content:\n")
print(name_of_unlawful_association.get('Page_1'))
print('*' * 150)


Downloaded: /content/LISTOFUNLAWFULASSOCIATIONS_11072024.pdf
Downloaded: /content/Listof57terrorists_07032024.pdf
Downloaded: /content/TERRORIST_ORGANIZATIONS_10032023.pdf

Name_of_the_Terrorist - First Page Content:

LIST OF INDIVIDUAL TERRORISTS DESIGNATED UNDER SECTION 35 OF THE
UNLAWFUL ACTIVITIES (PREVENTION) ACT, 1967, LISTED IN THE IVth SCHEDULE
OF THE ACT
Sl. Name of the Terrorist
No.
1. Maulana Masood Azhar @ Maulana Mohammad Masood Azhar Alvi @ Vali
Adam Issa
2. Hafiz Muhammad Saeed @ Hafiz Mohammad Sahib @ Hafiz Mohaddad
Sayid @ Hafiz Muhammad @ Hafiz Saeed @ Hafez Mohammad Saeed @
Hafiz Mohammad Sayeed @ Mohammad Sayed @ Muhammad Saeed
3. Zaki-ur-Rehman Lakhvi @ Abu Waheed Irshad Ahmad Arshad @ Kaki Ur-
Rehman @ Zakir Rehman Lakhvi @ Zaki-Ur-Rehman Lakvi @ Zakir
Rehman
4. Dawood Ibrahim Kaskar @ Dawood Hasan Shiekh Kaskar @ Dawood Bhai
@ Dawood Sabri @Iqbal Seth @ Bada Patel @ Dawood Ebrahim @ Sheikh
Dawood Hassan @ Abdul Hamid Abdul Aziz @ Anis Ibrahim @ Aziz Dilip @
Daud 

In [None]:
# verify data

In [None]:
name_of_terrorist_organization.get('Page_1')

'LIST OF ORGANISATIONS DESIGNATED AS ‘TERRORIST ORGANIZATIONS’ UNDER\nSECTION 35 OF THE UNLAWFUL ACTIVITIES (PREVENTION) ACT, 1967, LISTED IN\nTHE 1St SCHEDULE OF THE ACT.\nS No. Name of Terrorist Organization\n1. Babbar Khalsa International\n2. Khalistan Commando Force\n3. Khalistan Zindabad Force\n4. International Sikh Youth Federation\n5. Lashkar-E-Taiba/Pasban-E-Ahle Hadis/The Resistance Front and all its\nmanifestations and front organizations.\n6. Jaish-E-Mohammed/Tahreik-E-Furqan/People’s Anti-Fascist-Front (PAFF) and\nall its manifestations and front organizations.\n7. Harkat-ul-Mujahideen/Harkat-ul-Ansar/Harkat-ul-Jehad-E-Islami or Ansar-Ul-\nUmmah\n8. Hizb-Ul-Mujahideen/Hizb-Ul-Mujahideen Pir Panjal Regiment\n9. Al-Umar-Mujahideen\n10. Jammu and Kashmir Islamic Front\n11. United Liberation Front of Assam (ULFA)\n12. National Democratic Front of Bodoland (NDFB) in Assam\n13. People Liberation Army (PLA)\n14. United National Liberation Front (UNLF)\n15. People’s Revolutionary P

In [None]:
name_of_unlawful_association.get('Page_1')

'LIST OF ASSOCIATIONS DECLARED AS ‘UNLAWFUL\nASSOCIATION’ UNDER SUB-SECTION 1 OF SECTION 3 OF\nUNLAWFUL ACTIVITIES (PREVENTION) ACT, 1967.\nSl. Name of Unlawful Association\nNo.\n1. Students Islamic Movement of India (SIMI)\n2. United Liberation Front of Assam (ULFA)\n3. National Democratic Front of Bodoland (NDFB)\n4. All Tripura Tiger Force (ATTF)\n5. Meitei Extremist Organizations, namely-\n(i) Peoples’ Liberation Army (PLA) and its political wing, the Revolutionary\nPeople’s Front (RPF)\n(ii) United National Liberation Front (UNLF) and its armed wing, the Manipur\nPeoples’ Army (MPA)\n(iii) Peoples’ Revolutionary Party of Kangleipak (PREPAK) and its Armed\nwing, the ‘Red Army’.\n(iv) Kangleipak Communist Party (KCP) and its armed wing, also called\nthe ‘Red Army’\n(v) Kanglei Yaol Kanba Lup (KYKL)\n(vi) Coordination Committee (CorCom) and\n(vii) Alliance for Socialist Unity Kangleipak (ASUK)\n6. National Liberation Front of Tripura (NLFT)\n7. Hynniewtrep National Liberation Council

In [None]:
name_of_the_terrorist.get('Page_1')

'LIST OF INDIVIDUAL TERRORISTS DESIGNATED UNDER SECTION 35 OF THE\nUNLAWFUL ACTIVITIES (PREVENTION) ACT, 1967, LISTED IN THE IVth SCHEDULE\nOF THE ACT\nSl. Name of the Terrorist\nNo.\n1. Maulana Masood Azhar @ Maulana Mohammad Masood Azhar Alvi @ Vali\nAdam Issa\n2. Hafiz Muhammad Saeed @ Hafiz Mohammad Sahib @ Hafiz Mohaddad\nSayid @ Hafiz Muhammad @ Hafiz Saeed @ Hafez Mohammad Saeed @\nHafiz Mohammad Sayeed @ Mohammad Sayed @ Muhammad Saeed\n3. Zaki-ur-Rehman Lakhvi @ Abu Waheed Irshad Ahmad Arshad @ Kaki Ur-\nRehman @ Zakir Rehman Lakhvi @ Zaki-Ur-Rehman Lakvi @ Zakir\nRehman\n4. Dawood Ibrahim Kaskar @ Dawood Hasan Shiekh Kaskar @ Dawood Bhai\n@ Dawood Sabri @Iqbal Seth @ Bada Patel @ Dawood Ebrahim @ Sheikh\nDawood Hassan @ Abdul Hamid Abdul Aziz @ Anis Ibrahim @ Aziz Dilip @\nDaud Hasan Shaikh Ibrahim Kaskar @ Daud Ibrahim Memon Kaskar @\nDawood Hasan Ibrahim Kaskar @ Dawood Ibrahim Memon @ Kaskar\nDawood Hasan @ Shaikh Mohd Ismail Abdul Rehman @ Dowood Hassan\nShaikh Ibrahim @ 