## Reading PDFs in Python

PDFs are bad data structures to keep data in but that is whera a lot of the data is stored and shared. At the momebt we are dealing with Rajya Sabha proceedings.


In [25]:
import os
import requests
import pandas as pd
from dataclasses import dataclass
from datetime import datetime
from PyPDF2 import PdfFileReader
from functools import singledispatch
from typing import Optional, Union
from urllib.parse import urlparse

In [2]:
# Getting the PDF document

sessions = pd.read_csv('../data/sessions.csv')
sessions.head()

Unnamed: 0.1,Unnamed: 0,session_id,date,url,size
0,0,116,1980-12-24 00:00:00,http://164.100.47.5/Official_Debate_Nhindi/Flo...,(5825 Kb)
1,1,116,1980-12-23 00:00:00,http://164.100.47.5/Official_Debate_Nhindi/Flo...,(3738 Kb)
2,2,116,1980-12-22 00:00:00,http://164.100.47.5/Official_Debate_Nhindi/Flo...,(10298 Kb)
3,3,116,1980-12-19 00:00:00,http://164.100.47.5/Official_Debate_Nhindi/Flo...,(6683 Kb)
4,4,116,1980-12-18 00:00:00,http://164.100.47.5/Official_Debate_Nhindi/Flo...,(6937 Kb)


In [118]:
sessions.date = sessions.date.apply(pd.to_datetime)

In [119]:
sample_session = sessions.sort_values(by='date', ascending=False).iloc[0] 
pdf_link = sample_session['url']
session = sample_session['session_id']
session_date = sample_session['date']

In [109]:
class NotPDFLinkError(Exception):
    
    def __str__(self):
        return "The Link Provided does not end with type `.pdf`"
    

class NotAnURLError(Exception):
    
    def __str__(self):
        return "The link provided is not a valid URL"

In [110]:
@dataclass(frozen=True)
class ValidURL:
    string: str


@dataclass()
class ValidPDFURL:
    url: ValidURL


def get_valid_pdf_url(url) -> [ValidPDFURL, Exception]:
    return get_pdf_link(get_valid_url(url))


@singledispatch
def get_pdf_link(url) -> NotAnURLError:
    raise NotAnURLError()

@get_pdf_link.register
def _(url: ValidURL) -> [ValidPDFURL, NotPDFLinkError]:
    if not url.string.endswith('.pdf'):
        raise NotPDFLinkError()
    return ValidPDFURL(url)


@singledispatch
def get_valid_url(url) -> Exception:
    raise Exception('Not A String')
    

@get_valid_url.register
def _(url: str) -> [ValidURL, NotAnURLError]:
    result = urlparse(url)
    if not all([result.scheme, result.netloc, result.path]):
        raise NotAnURLError()
    return ValidURL(url)


In [131]:
@dataclass(frozen=True)
class SessionPDF:
    url: ValidPDFURL
    session: int
    session_date: datetime
    
    @property
    def filename(self):
        return f'session_{session}_{session_date.strftime("%d_%m_%Y")}.pdf'
    
    @property
    def link(self):
        return self.url.url.string


def download_pdf(session_pdf: SessionPDF, output_dir: str) -> str:
    if not os.path.isdir(output_dir): raise Exception('Output Directory is not a directory')
    
    data = requests.get(session_pdf.link).content
    output_path = os.path.join(output_dir, session_pdf.filename)
    with open(output_path, 'wb') as datafile:
        datafile.write(data)
    return output_path

In [137]:
valid_pdf_url = get_pdf_link(is_url('http://164.100.47.5/Official_Debate_Nhindi/Floor/250/F13.12.2019.pdf'))
session_pdf = SessionPDF(
    url=valid_pdf_url,
    session=session,
    session_date=session_date
)
pdf_filepath = download_pdf(session_pdf, '../data/pdfs/')

In [65]:
data = PdfFileReader(pdf_filepath)



In [67]:
data.documentInfo

{'/Author': 'win7',
 '/CreationDate': "D:20200317130702+05'30'",
 '/Creator': 'Adobe Acrobat 8.0 Combine Files',
 '/ModDate': "D:20200909154801+05'30'",
 '/Producer': 'Acrobat Distiller 8.1.0 (Windows)',
 '/Title': '13 Dec.indd'}

In [68]:
data.getNumPages()

436

In [76]:
print(data.getPage(0).extractText())

Vol. 250 
FridayNo. 20 13 December, 2019
                                                                  22 Agrahayana, 1941 (Saka) 
PARLIAMENTARY DEBATES
RAJYA SABHA
OFFICIAL REPORT
(FLOOR VERSION)CONTENTSReference by the Chair (page 1)Papers laid on the Table (pages 1-41)

Reports of the Department-related Parliamentary Standing Committee on  Defence Š 
Laid on the Table 
(pages 41-42)Report of the Committee on Welfare of Other Backward Classes Š 
Laid on the  Table 
(page 42)Statement of the Committee on the Welfare of Scheduled Castes and Scheduled 
 Tribes Š 
Laid on the Table 
 (page 42)Leave of absence (page 43)
Statements by MinistersŠ
 Status of implementation of recommendations contained in the Forty-third Report
  and the Fiftieth Report of the Department-related Parliamentary Standing 

  Committee on DefenceŠ
Laid on the Table 
(page 43)©RAJYA SABHA SECRETARIAT
NEW DELHI
PRICE : ` 100.00Ö´Öê¾Ö ŁÖ[P.T.O.



In [77]:
print(data.getPage(1).extractText())

Website : http://rajyasabha.nic.in
  http://parliamento
Þ ndia.nic.in
E-mail : rsedit-e@sansad.nic.in
 Status of implementation of observations/recommendations contained in the Fiftieth 
  Report and the Fifty-third Report of the Department-related Parliamentary 

  Standing Committee on Labour Š 
Laid on the Table 
(page 44) Status of implementation of recommendations contained in the Twenty-
Þ fth Report 
  and the Thirty-sixth Report of the Department-related Parliamentary Standing 

  Committee on Rural DevelopmentŠ 
Laid on the Table 
(page 44) Status of implementation of recommendations/observations contained in the One 
  Hundred and Forty-ninth Report of the Department-related Parliamentary 

  Standing Committee on Commerce Š 
Laid on the Table 
(page 45)Matters raised with Permission Š
 Need for inclusion of the Union Territory of Ladakh in the Sixth Schedule to the 
  Constitution (pages 45-46)
 Wide spread unrest in the North-Eastern States after the passage of the Citizens

In [78]:
print(data.getPage(2).extractText())

PUBLISHED UNDER RULE 260 OF RULES OF PROCEDURE AND CONDUCT OF BUSINESS IN THE COUNCIL OF STATES
 (RAJYA
 SABHA) AND PRINTED BY PRINTOGRAPH
, KAROL BAGH, NEW DELHI-110005



In [79]:
print(data.getPage(3).extractText())

RAJYA SABHA
Friday, the 13th December, 2019/22 Agrahayana, 1941 (Saka)
The House met at eleven of the clock,MR. CHAIRMAN in the Chair.
REFERENCE  BY  THE  CHAIREighteenth anniversary of terrorist attack on ParliamentMR. CHAIRMAN: Hon. Members, today, the 13th December, 2019, marks the
Eighteenth Anniversary of the dastardly terror attack on the Parliament House.
On this day, this House recalls the supreme sacrifice of our security personnel,
including two of the Parliament Security Service staff, along with five Delhi Policepersonnel, and a woman constable of the Central Reserve Police Force. One gardenerof the CPWD and a cameraman of the ANI were also martyred. By their selfless act,

these martyrs set an example of indomitable courage and outstanding devotion to duty.
I am sure the whole House will join me in paying homage to these martyrs andexpressing our profound gratitude to our brave security personnel, who are vigilant anddutiful in protecting this temple of democracy. On this 

### Task Complete!

Luckily the PDF is text only and not images, though I am pretty sure as we run through many pdfs there might be a mix of pdf structure, we will get to that once we get there.

The next step would be extracting information from these pages. The initial question that I want to figure out is:-
1. What was talked about in the session?

In [None]:
@singledispatch
def fun(arg, verbose=False):
    if verbose:
        print("Let me just say,", end=" ")
    print(arg)

@fun.register
def _(arg: int, verbose=False):
    if verbose:
        print("Strength in numbers, eh?", end=" ")
    print(arg)

@fun.register
def _(arg: list, verbose=False):
    if verbose:
        print("Enumerate this:")
    for i, elem in enumerate(arg):
        print(i, elem)