# Keyword Extraction Assignment

**Problem Statement** - Write a code to extract the keywords (like Inheritance, encapsulation, multithreading) from the document.
<br>
<br>
**Document** - JavaBasics-notes.pdf

## Libraries Used

In [30]:
#!pip install pdfminer.six
#!pip install numpy
#!pip install pandas
#!pip install nltk

In [31]:
# Array Operations
import numpy as np
# For dataframe
import pandas as pd

# For English Stopwords
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# For Tf-idf Weighting
from sklearn.feature_extraction.text import TfidfVectorizer

# For Regular Expressions
import re

## Reading Document

In [32]:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter#process_pdf
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams

from io import StringIO

def pdf_to_text(pdfname):

    # PDFMiner boilerplate
    rsrcmgr = PDFResourceManager()
    sio = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, sio, codec=codec, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    # Extract text
    fp = open(pdfname, 'rb')
    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
    fp.close()

    # Get text from StringIO
    text = sio.getvalue()

    # Cleanup
    device.close()
    sio.close()

    return text

In [33]:
extracted_text = pdf_to_text("JavaBasics-notes.pdf")

### Example

In [34]:
print(extracted_text[:1000])

Java Basics

Java Basics

Topics in this section include:

•  What makes Java programs portable, secure, and robust

•  The structure of Java applets and applications

•  How Java applications are executed

•  How applets are invoked and executed

•  The Java Language, Part I

•  Comments

•  Declarations

•  Expressions

•  Statements

•  Garbage collection

•  Java Semantics

Portability

Java programs are portable across operating systems and hardware environments.
Portability is to your advantage because:

•  You need only one version of your software to serve a broad market.

•  The Internet, in effect, becomes one giant, dynamic library.

•  You are no longer limited by your particular computer platform.

Three features make Java String programs portable:

1.  The language. The Java language is completely specified; all data-type sizes and

formats are defined as part of the language. By contrast, C/C++ leaves these
"details" up to the compiler implementor, and many C/C++ program

## Preprocessing Text

In [35]:
def preprocess_text(text):
    tokens = word_tokenize(text)
    words = []
    for word in tokens:
        # Replace non-alpha character with space
        word = re.sub('[^A-Za-z]+', ' ', word)
        words.append(word.lower())
    return " ".join(words)

In [36]:
preprocess_text(extracted_text[:1502])

'java basics java basics topics in this section include     what makes java programs portable   secure   and robust   the structure of java applets and applications   how java applications are executed   how applets are invoked and executed   the java language   part i   comments   declarations   expressions   statements   garbage collection   java semantics portability java programs are portable across operating systems and hardware environments   portability is to your advantage because     you need only one version of your software to serve a broad market     the internet   in effect   becomes one giant   dynamic library     you are no longer limited by your particular computer platform   three features make java string programs portable       the language   the java language is completely specified   all data type sizes and formats are defined as part of the language   by contrast   c c  leaves these   details   up to the compiler implementor   and many c c  programs therefore     

In [37]:
# NLTK Stop words 
stop_words = stopwords.words('english')

In [38]:
# Some extra Stop words like text in Header and Footer of every page
stop_words.extend(["java", "jguru", "com", "all", "rights",
                   "reserved", "etc", "abc", "hello",
                   "world", "www", "blah"])

## Keyword Extraction

In [39]:
vectorizer = TfidfVectorizer(#max_features = 500, 
                             stop_words = stop_words, 
                             preprocessor=preprocess_text)
X = vectorizer.fit_transform([extracted_text])
words = vectorizer.get_feature_names()
weights = X.toarray()

In [40]:
ordered_weights = weights[0][(np.argsort(-weights[0]))]
ordered_weights[:200]

array([0.36045293, 0.30231536, 0.27324658, 0.24999155, 0.24999155,
       0.23836404, 0.22673652, 0.20348149, 0.16859895, 0.1627852 ,
       0.15115768, 0.14534392, 0.13371641, 0.13371641, 0.11046138,
       0.09302011, 0.08720635, 0.08720635, 0.08720635, 0.08720635,
       0.08720635, 0.0813926 , 0.0813926 , 0.0813926 , 0.07557884,
       0.06976508, 0.06976508, 0.06976508, 0.06395133, 0.06395133,
       0.05813757, 0.05813757, 0.05813757, 0.05813757, 0.05813757,
       0.05813757, 0.05813757, 0.05232381, 0.05232381, 0.05232381,
       0.05232381, 0.05232381, 0.05232381, 0.05232381, 0.05232381,
       0.05232381, 0.05232381, 0.05232381, 0.05232381, 0.05232381,
       0.04651006, 0.04651006, 0.04651006, 0.04651006, 0.04651006,
       0.04651006, 0.04651006, 0.04651006, 0.0406963 , 0.0406963 ,
       0.0406963 , 0.0406963 , 0.0406963 , 0.0406963 , 0.0406963 ,
       0.03488254, 0.03488254, 0.03488254, 0.03488254, 0.03488254,
       0.03488254, 0.03488254, 0.03488254, 0.03488254, 0.03488

In [41]:
weight_ordered_words = np.array(words)[np.argsort(-weights[0])]
weight_ordered_words[:200]

array(['data', 'new', 'basics', 'int', 'button', 'code', 'applet',
       'class', 'method', 'object', 'array', 'objects', 'string',
       'public', 'example', 'null', 'return', 'types', 'language', 'use',
       'memory', 'void', 'primitive', 'comments', 'system', 'program',
       'browser', 'may', 'allocate', 'garbage', 'pointer', 'following',
       'runtime', 'would', 'stack', 'applets', 'methods', 'collection',
       'init', 'value', 'make', 'byte', 'reference', 'operator',
       'boolean', 'expr', 'file', 'applications', 'called', 'two', 'type',
       'constant', 'style', 'programs', 'stat', 'variables', 'element',
       'main', 'pointers', 'parameters', 'ok', 'bits', 'one', 'args',
       'arrays', 'integer', 'statements', 'calloc', 'literal', 'width',
       'refer', 'strings', 'note', 'heap', 'semantics', 'sizeof', 'true',
       'equivalent', 'predefined', 'used', 'passed', 'source', 'elements',
       'portable', 'platform', 'id', 'import', 'threads', 'executed',
     

## DataFrame

In [48]:
d = {"words": weight_ordered_words, "weight": ordered_weights}

df = pd.DataFrame(data = d, columns = ["words", "weight"])

# Top 15 words
df.head(15)

Unnamed: 0,words,weight
0,data,0.360453
1,new,0.302315
2,basics,0.273247
3,int,0.249992
4,button,0.249992
5,code,0.238364
6,applet,0.226737
7,class,0.203481
8,method,0.168599
9,object,0.162785


### Convert DataFrame to csv

In [43]:
df.to_csv("keywords.csv", encoding='utf-8', index=False)