### Reading file and breaking it into paragraphs

In [1]:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def pdf_to_text(path):
    """Convert a PDF file to text."""
    manager = PDFResourceManager()
    text_io = StringIO()
    laparams = LAParams()
    device = TextConverter(manager, text_io, laparams=laparams)
    interpreter = PDFPageInterpreter(manager, device)

    with open(path, "rb") as f:
        for page in PDFPage.get_pages(f, check_extractable=True):
            interpreter.process_page(page)

    text = text_io.getvalue()

    device.close()
    text_io.close()

    return text

def text_to_paragraphs(text):
    """Tokenize text into paragraphs."""
    paragraphs = text.split("\n\n")
    return paragraphs

# Convert a PDF file to text
text = pdf_to_text("/Users/himeshpunj/TheArmyAct1950 (2).pdf")

# Tokenize the text into paragraphs
paragraphs = text_to_paragraphs(text)

# Print the paragraphs
for i, paragraph in enumerate(paragraphs):
    print(f"Paragraph {i + 1}: {paragraph}")
    print("********************************************")


Paragraph 1: An Act to consolidate and amend the law relating to the government of the regular 
********************************************
Paragraph 2: The Army Act, 1950 
********************************************
Paragraph 3: ACT NO. 46 OF 1950 [ 20th May, 1950.]
********************************************
Paragraph 4: BE it enacted by Parliament as follows:-
********************************************
Paragraph 5: Army.
********************************************
Paragraph 6: CHAP
********************************************
Paragraph 7: PRELIMINARY.
********************************************
Paragraph 8: CHAPTER I
********************************************
Paragraph 9: PRELIMINARY
********************************************
Paragraph 10: 1. Short title and commencement.
(1) This Act may be called the Army Act, 1950 .
(2) It shall come into force on such date as the Central Government may, by notification in the Official 
Gazette, appoint in this behalf.
****************

1. firstly used pypdf2 .
2. did not work perfectly with double new line character.
3. Came with idea to during discussion to convert to pdf to txt format
4. then applied double newline character logic.
5. Many times irregular spaces can be seen. But new line character is working perfectly.

### Embedding the paragraphs into the model

In [2]:
from sentence_transformers import SentenceTransformer,util
#sentences = paragraphs

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v1')
embeddings = model.encode(paragraphs,convert_to_tensor=True)
print(embeddings)

tensor([[ 0.0133,  0.0450,  0.0082,  ..., -0.0161, -0.1095,  0.0131],
        [-0.0073, -0.0091,  0.0164,  ..., -0.0344,  0.0071,  0.0020],
        [-0.0384, -0.0167,  0.0125,  ..., -0.0496, -0.0841,  0.0071],
        ...,
        [ 0.0037, -0.0702,  0.0075,  ..., -0.0037, -0.0893, -0.0027],
        [ 0.0092, -0.0063,  0.0047,  ..., -0.0437, -0.0671,  0.0169],
        [ 0.0102,  0.0577, -0.0205,  ...,  0.0716, -0.0557, -0.0086]])


1. Used all-mpnet-base-v1 model, better than prior all_datasets_v4_MiniLM-L6.
2. Used convert_to_tensor=True, to get cosine similarity scores.

### Embedding queries and generating results.

In [3]:
import torch
embedder = SentenceTransformer('all-mpnet-base-v1')
corpus = paragraphs
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
queries = ['Who can appoint comissioned officer in the army?', 'Who cannot enroll in Indian Army?', 'who are subjected to this act?','What is punishment for escape from custody?','Explain regular army?']


# Find the closest 2 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(2, len(embeddings))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 2 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 2 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))
        print("*****************")





Query: Who can appoint comissioned officer in the army?

Top 2 most similar sentences in corpus:
10. Commission and appointment. The President may grant, to such person as he thinks fit, a 
commission as, an officer, or as a junior commissioned officer or appoint any person, as a warrant 
officer of the regular Army. (Score: 0.7472)
*****************
8. Officers exercising powers in certain cases.
(1) Whenever persons subject to this Act are serving under an officer commanding any military 
Organisation, not in this section specifically named and being in the opinion of the Central Government 
not less than a brigade, that Government may prescribe the officer by whom the powers, which under 
this Act may be exercised by officers commanding armies, army corps, divisions and brigades, shall, 
as regards such, persons, be exercised.
(2) The Central Government may confer such powers, either absolutely or subject to such restrictions, 
reservations, exceptions and conditions, as it may 

1. The queries are first embedded into the model and then by torch module topk we find paragraphs with highest cosine similarity which is stored in an array. Then with highest two are printed.
2. The query "who are subjected to this act?" is answered incorrectly. It being answered as arrest clause. But answer can be stated from paragraph 11.
3. The query "Explain regular army?"  is misinterpreted as corpus has not exact terms used but information about it is still there.

In [4]:
queries = ['what is the time span of reconsideration after suspension?', 'who directs to pay allowances or part of pay during custody?', 'What are punishment prescribed for officers after court martial?','what is imprisonment term on false answers during enrolment?','what are privelges of reservists?']


# Find the closest 2 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(2, len(embeddings))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 2 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 2 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))
        print("*****************")





Query: what is the time span of reconsideration after suspension?

Top 2 most similar sentences in corpus:
187. Reconsideration of case after suspension.
(1) Where a sentence has been suspended, the case may at any time, and shall at intervals of not 
more than four months, be reconsidered by the authority or officer specified in section 182, or by any 
general or other officer not below the rank of field officer duly authorised by the authority or officer 
specified in section 182.
(2) Where on such reconsideration by the officer so authorised it appears to him that the conduct of 
the offender since his conviction has been such as to justify a remission of the sentence, he shall refer 
the matter to the authority or officer specified in section 182. (Score: 0.7492)
*****************
CHAP PARDONS, REMISSIONS AND SUSPENSIONS. CHAPTER XIV PARDONS, REMISSIONS 
AND SUSPENSIONS (Score: 0.5468)
*****************




Query: who directs to pay allowances or part of pay during custody?

To

1. The question previliges of reservists of reservists in human context refers to someone who is reserved personnel and the previliges he gets on such position.
2. But it directly picks up direct words from corpus.

## Questions that need more options to be tested

In [5]:
queries = ['Expalin regular army?','what are privelges of reservists?']


# Find the closest 3 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(3, len(embeddings))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 2 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 3 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))
        print("*****************")





Query: Expalin regular army?

Top 3 most similar sentences in corpus:
Army. (Score: 0.6658)
*****************
78. Retention in the ranks of a person convicted on active service. When., on active service, any 
enrolled person has been sen- tenced by a court- martial to dismissal, or to transportation or im-
prisonment whether combined with dismissal or not, the prescribed officer may direct that such person 
may be retained to serve in the ranks, and such service shall be reckoned as part of his term of 
transportation or imprisonment, if any. (Score: 0.5080)
*****************
The Army Act, 1950  (Score: 0.4853)
*****************




Query: what are privelges of reservists?

Top 3 most similar sentences in corpus:
31. Privileges of reservists. Every person belonging to the Indian Reserve Forces shall, when called 
out for or engaged in or returning from, training or service, be entitled to all the privileges accorded by 
sections 28 and 29 to a person subject to this Act. (Score: 0.

## Extractive Questions

In [6]:
queries = ['Which person is considered as in active service?','Who is warrant officer?','Explain the different modes of enrolment.']


# Find the closest 3 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(3, len(embeddings))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 2 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 3 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))
        print("*****************")





Query: Which person is considered as in active service?

Top 3 most similar sentences in corpus:
9. Power to declare persons to be on active service. Notwithstanding anything contained in clause 
(i) of section 3, the Central Government may, by notification, declare that any person or class of 
persons subject to this Act shall, with reference to any
1. Subs. by Act 58 of 1974, s. 3 and Sch. II, for" clause (i) of section 2".
area in which they may be serving or with reference to any provision of this Act or of any other law for 
the time being in force, be deemed to be on active service within the meaning of this Act. CHAP 
COMMISSION, APPOINTMENT AND ENROLMENT. CHAPTER III COMMISSION, APPOINTMENT 
AND ENROLMENT (Score: 0.6015)
*****************
3. Definition. In this Act, unless the context otherwise requires,-
(i) " active service", as applied to a person subject to this Act, means the time during which such 
person-
(a) is attached to, or forms part of, a force which is engaged

In [7]:
queries = ['Explain attestation.','Which personnel has immunity from court martial?']


# Find the closest 3 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(3, len(embeddings))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 2 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 3 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))
        print("*****************")





Query: Explain attestation.

Top 3 most similar sentences in corpus:
17. Mode of attestation.
(1) When a person who is to be attested is reported fit for duty, or has completed the prescribed period 
of probation, an oath or affirmation shall be administered to him in the prescribed form by his 
commanding officer in front of his corps or such portion thereof or such members of his department as 
may be present, or by any other prescribed person.
(2) The form of oath or affirmation prescribed under this section shall contain a promise that the 
person to be attested will bear true allegiance to the Constitution of India as by law established, and 
that he will serve in the regular Army and go wherever he is ordered by land, sea or air, and that he 
will obey all commands of any officer set over him, even to the peril of his life.
(3) The fact of an enrolled person having taken the oath or affirmation directed by this section to be 
taken shall be entered on his enrolment paper, and

1. An intersting observation is here the confidence OR  COSINE score drops significantly with increase options.

## Abstract Questions

In [8]:
queries = ['Which gender is abstained or refused for enrolment?','When fundamental rights are refused?']


# Find the closest 3 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(3, len(embeddings))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 2 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 3 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))
        print("*****************")





Query: Which gender is abstained or refused for enrolment?

Top 3 most similar sentences in corpus:
12. Ineligibility of females for enrolment or employment. No female shall be eligible for enrolment 
or employment in the regular Army, except in such corps, department, branch or other body forming 
part of, or attached to any portion of, the regular Army as the Central Government may, by notification 
in the Official Gazette, specify in this behalf: Provided that nothing contained in this section shall affect 
the provisions of any law for the time being in force providing for the raising and maintenance of any 
service auxiliary to the regular Army, or any branch thereof in which females are eligible for enrolment 
or employment. (Score: 0.5382)
*****************
14. Mode of enrolment. If, after complying with the provisions of section 13, the enrolling officer is 
satisfied that the person desirous of being enrolled fully understands the questions put to him and 
consents to the 

In [9]:
queries = ['List offences which are severly penalised.','Explain punishment for organizing a revolt?','Is alcohol allowed during duty?']


# Find the closest 3 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(3, len(embeddings))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 2 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 3 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))
        print("*****************")





Query: List offences which are severly penalised.

Top 3 most similar sentences in corpus:
69. Civil offences. Subject to the provisions of section 70, any person subject to this Act who at any 
place in or beyond India commits any civil
offence shall be deemed to be guilty of an offence against this Act and, if charged therewith under this 
section, shall be liable to be tried by a court- martial and, on conviction, be punishable as follows, that 
is to say,-
(a) if the offence is one which would be punishable under any law in force in India with death or with 
transportation, he shall be liable to suffer any punishment, other than whipping, assigned for the 
offence, by the aforesaid law and such less punishment as is in this Act mentioned; and
(b) in any other case, he shall be liable to suffer any punishment, other than whipping, assigned for the 
offence by the law in force in India, or imprisonment for a term which may extend to seven years, or 
such less punishment as is in 

### Synonyms for Extravtive Questions (1 word)

In [10]:
queries = ['Which person is considered as in operation service?','Who is an authority officer?','Explain the different modes of Hiring.']


# Find the closest 3 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(3, len(embeddings))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 2 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 3 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))
        print("*****************")





Query: Which person is considered as in operation service?

Top 3 most similar sentences in corpus:
16. Persons to be attested. The following persons shall be attested, namely:-
(a) all persons enrolled as combatants;
(b) all persons selected to hold a non- commissioned or acting non- commissioned rank; and
(c) all other persons subject to this Act as may be prescribed by the Central Government. (Score: 0.5422)
*****************
9. Power to declare persons to be on active service. Notwithstanding anything contained in clause 
(i) of section 3, the Central Government may, by notification, declare that any person or class of 
persons subject to this Act shall, with reference to any
1. Subs. by Act 58 of 1974, s. 3 and Sch. II, for" clause (i) of section 2".
area in which they may be serving or with reference to any provision of this Act or of any other law for 
the time being in force, be deemed to be on active service within the meaning of this Act. CHAP 
COMMISSION, APPOINTMENT AN

In [11]:
queries = ['Explain documentation.','Which personnel has immunity from impeachment.?']


# Find the closest 3 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(3, len(embeddings))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 2 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 3 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))
        print("*****************")





Query: Explain documentation.

Top 3 most similar sentences in corpus:
(4) When a witness is required to produce any particular document or other thing in his possession or 
power, the summons shall describe it with reasonable precision. (Score: 0.3139)
*****************
16. Persons to be attested. The following persons shall be attested, namely:-
(a) all persons enrolled as combatants;
(b) all persons selected to hold a non- commissioned or acting non- commissioned rank; and
(c) all other persons subject to this Act as may be prescribed by the Central Government. (Score: 0.2742)
*****************
141. Enrolment paper.
(1) Any enrolment paper purporting to be signed by an enrolling officer shall, in proceedings under this 
Act, be evidence of the person enrolled having given the answers to questions which he is therein 
represented as having given.
(2) The enrolment of such person may be proved by the production of the original or a copy of his 
enrolment paper purporting to be c