# Reading PDF

In [2]:
! pip install PyPDF2

Collecting PyPDF2
  Downloading PyPDF2-2.11.1-py3-none-any.whl (220 kB)
[K     |████████████████████████████████| 220 kB 1.8 MB/s eta 0:00:01
Installing collected packages: PyPDF2
Successfully installed PyPDF2-2.11.1


In [4]:
# Imports
from PyPDF2 import PdfFileReader

with open("sample.pdf", "rb") as pdf:
    # Creating pdf reader object
    pdf_reader = PdfFileReader(pdf)
 
    # Fetching number of pages in the PDF
    num_pages = pdf_reader.numPages
    print(
        "Total no. of pages: {}".format(num_pages)
    )
    if num_pages > 0:
        # Creating a page object for the 1st page
        # Replace 0 with 1 to access the 2nd page, 
        # and so on
        page = pdf_reader.getPage(0)
        # Extracting text from the page
        text = page.extractText()
        print("Contents of the first page:\n")
        print(text)

Total no. of pages: 1
Contents of the first page:

2⌅Natural Language Processing in the Real-World
popular databases that are used widely for text data. Each database comes with ways
to query and perform operations on text. We’ll implement some of these operations.
Finally, we will introduce the concept of data maintenance and discuss some useful
tips and tricks to prevent the data from corruption.
This chapter includes the following sections.
•Sources of data
•Data extraction
•Data storage
1.2 SOURCES OF DATA
1.2.1 Generated by businesses
The most common source of text data is the data generated by the business’s opera-
tions and is dependent on what the business does. For example, in real estate, sources
of text data include property listing descriptions, agent comments, legal documents,
and customer interaction data. For some other industry verticals, the source of text
data can include social media posts, product descriptions, articles, web documents,
chat data, or a combination th

# Reading JSON

In [5]:
# Imports
import json

# Writing a sample dict to json file
sample = {
    "Mathew": ["mathematics", "chemistry"],
    "Perry": ["biology", "arts"]
}
with open("sample.json", "w") as f:
    json.dump(sample, f)

# Reading the json file back
with open("sample.json", "r") as f:
    read_sample = json.load(f)

# Printing
print("Sample written to json file: {}".format(sample))
print("Sample read from json file: {}".format(read_sample))

Sample written to json file: {'Mathew': ['mathematics', 'chemistry'], 'Perry': ['biology', 'arts']}
Sample read from json file: {'Mathew': ['mathematics', 'chemistry'], 'Perry': ['biology', 'arts']}


# Reading CSV

In [1]:
# Imports
import csv

# Creating csv reader object
with open("sample_csv.csv", "r") as file:
    csvreader = csv.reader(file)
    # If your file has a head, then
    header = next(csvreader)
    print(header)
    # get rows in a list
    rows = [row for row in csvreader]
    print(rows)

['row', ' name', ' age']
[['1', ' Kim', ' 23'], ['2', ' Lowe', ' 45'], ['3', ' Beatrice', ' 55']]


In [6]:
file = open('sample_csv.csv')
content = file.readlines()

# If the first row is the header
header = content[:1]

# Fetching the rows
rows = content[1:]

# Printing header and rows
print(header)
print(rows)

# Close the file
file.close()

['row, name, age\n']
['1, Kim, 23\n', '2, Lowe, 45\n', '3, Beatrice, 55\n']


In [7]:
! pip install pandas



In [8]:
# Imports
import pandas as pd

# Load the data into a pandas datafrme
data = pd.read_csv("sample_csv.csv")

# print column names
print(data.columns)

# print data
print(data)

Index(['row', ' name', ' age'], dtype='object')
   row       name   age
0    1        Kim    23
1    2       Lowe    45
2    3   Beatrice    55


# Reading from HTML page (web scraping)

In [9]:
! pip install beautifulsoup4



In [10]:
# Imports
from bs4 import BeautifulSoup
from urllib.request import urlopen

# URL to questions
myurl = """https://stackoverflow.com/questions/19410018/how-to-count-the-number-of-words-in-a-sentence-ignoring-numbers-punctuation-an"""

html = urlopen(myurl).read()
soup = BeautifulSoup(html, "html.parser")
question = soup.find("div", {"class": "question"})

# Print top 1000 characters of question to find relevant tag
ques_text = question.find(
    "div", {"class": "s-prose js-post-body"}
)
print("Question: \n", ques_text.get_text().strip())

answers = soup.find("div", {"class": "answer"})
# print to check the correct class tag.

ans_text = answers.find(
    "div", {"class": "s-prose js-post-body"}
)
print("Top answer: \n", ans_text.get_text().strip())

Question: 
 How would I go about counting the words in a sentence? I'm using Python.
For example, I might have the string: 
string = "I     am having  a   very  nice  23!@$      day. "

That would be 7 words. I'm having trouble with the random amount of spaces after/before each word as well as when numbers or symbols are involved.
Top answer: 
 str.split() without any arguments splits on runs of whitespace characters:
>>> s = 'I am having a very nice day.'
>>> 
>>> len(s.split())
7

From the linked documentation:

If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.


# Reading a Word document

In [12]:
! pip install python-docx 

Collecting python-docx
  Downloading python-docx-0.8.11.tar.gz (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 2.2 MB/s eta 0:00:01
Building wheels for collected packages: python-docx
  Building wheel for python-docx (setup.py) ... [?25ldone
[?25h  Created wheel for python-docx: filename=python_docx-0.8.11-py3-none-any.whl size=184508 sha256=6831f2c608c0525b3e6c91266f4b4f8454ff8426933fb54710c73805293110fa
  Stored in directory: /Users/jyotikasingh/Library/Caches/pip/wheels/83/8b/7c/09ae60c42c7ba4ed2dddaf2b8b9186cb105255856d6ed3dba5
Successfully built python-docx
Installing collected packages: python-docx
Successfully installed python-docx-0.8.11


In [13]:
# Imports
from docx import Document

doc = open("sample_word.docx", "rb")
document = Document(doc)

# Placeholder for text
doc_text=""
for para in document.paragraphs:
    doc_text += para.text

# Print the final output
print(doc_text)

# Close the file
doc.close()

The most common source of text data is the data generated by the business’s operations and is dependent on what the business does. For example, in real estate, sources of text data include property listing descriptions, agent comments, legal documents, and customer interaction data. For some other industry verticals, the source of text data can include social media posts, product descriptions, articles, web documents, chat data, or a combination thereof. When there is an absence of owned (ﬁrst-party) data, organizations leverage data from different vendors and clients. Overall, from a business’s standpoint, the commonly seen text data is of the following types.Customer reviews/commentsUser comments are a very common source of text, especially from social media, e-commerce, and hospitality businesses that collect product reviews. For instance, Google and Yelp collect reviews across brands as well as small and large businesses.Social media/blog postsSocial media posts and blogs ﬁnd prese