# Extracting the Data

 - Reading PDF file in Python
 - Reading word document
 - Reading JSON object
 - Reading HTML page and HTML parsing
 - Regular expressions
 - String handling
 - Web scraping

## Client Data
SQL databases
1. Hadoop clusters
2. Cloud storage
3. Flat files

## Free source 
A huge amount of data is freely available over the
internet. We just need to streamline the problem and start exploring
multiple free data sources.


Free APIs like Twitter
1. Wikipedia
2. Government data (e.g. http://data.gov)
3. Census data (e.g. http://www.census.gov/data.html)
4. Health care claim data (e.g. https://www.healthdata.gov/)

### Web scraping 
Extracting the content/data from websites, blogs,
forums, and retail websites for reviews with the permission from the
respective sources using web scraping packages in Python.

### 1.2 Collecting Data from PDFs

The simplest way to do this is by using the PyPDF2 library.

In [137]:
!pip install PyPDF2
import PyPDF2
from PyPDF2 import PdfFileReader



In [147]:
#Creating a pdf file object
pdf = open("Get_Started_With_Smallpdf.pdf","rb")


In [148]:
#creating pdf reader object
pdf_reader = PyPDF2.PdfReader(pdf)


In [140]:
#checking number of pages in a pdf file
len(pdf_reader.pages)

1

In [141]:
#creating a page object
page = pdf_reader.pages[0]


In [149]:
pdf_reader.pages[0]

{'/Annots': [IndirectObject(63, 0, 140653209483968),
  IndirectObject(60, 0, 140653209483968),
  IndirectObject(15, 0, 140653209483968),
  IndirectObject(12, 0, 140653209483968),
  IndirectObject(94, 0, 140653209483968)],
 '/ArtBox': [0, 0, 595.276, 841.89],
 '/BleedBox': [0, 0, 595.276, 841.89],
 '/Contents': [IndirectObject(21, 0, 140653209483968),
  IndirectObject(22, 0, 140653209483968),
  IndirectObject(23, 0, 140653209483968),
  IndirectObject(24, 0, 140653209483968),
  IndirectObject(25, 0, 140653209483968),
  IndirectObject(26, 0, 140653209483968),
  IndirectObject(27, 0, 140653209483968),
  IndirectObject(28, 0, 140653209483968)],
 '/CropBox': [0, 0, 595.276, 841.89],
 '/Group': {'/CS': ['/ICCBased', IndirectObject(29, 0, 140653209483968)],
  '/S': '/Transparency',
  '/Type': '/Group'},
 '/MediaBox': [0, 0, 595.276, 841.89],
 '/Parent': {'/Count': 1,
  '/Kids': [IndirectObject(10, 0, 140653209483968)],
  '/Type': '/Pages'},
 '/PieceInfo': {'/InDesign': {'/DocumentID': 'xmp.did

In [150]:
#finally extracting text from the page
print(page.extract_text())
#closing the pdf file
pdf.close()

Welcome to Smallpdf
Digital Documents—All In One Place
Access Files Anytime, Anywhere Enhance Documents in One Click 
Collaborate With Others With the new Smallpdf experience, you can 
freely upload, organize, and share digital 
documents. When you enable the ‘Storage’ 
option, we’ll also store all processed files here. 
You can access files stored on Smallpdf from 
your computer, phone, or tablet. We’ll also 
sync files from the Smallpdf Mobile App to our 
online portalWhen you right-click on a file, we’ll present 
you with an array of options to convert, 
compress, or modify it. 
Forget mundane administrative tasks. With 
Smallpdf, you can request e-signatures, send 
large files, or even enable the Smallpdf G Suite 
App for your entire organization. Ready to take document management to the next level? 



### 1-3. Collecting Data from Word Files

In [151]:
!pip install python-docx




In [152]:
from docx import Document

# Path to the Word document
document_path = '/Users/jothiramsanjeevi/Documents/IPythonnotebook/Natural Language Processing Bootcamp/Sample.docx'

# Open the Word document
doc = Document(document_path)


In [153]:
# Extract text from paragraphs
text = []
for paragraph in doc.paragraphs:
    text.append(paragraph.text)



In [155]:
text

['The majestic mountains stood tall, their peaks reaching towards the heavens. The lush green valleys sprawled beneath, adorned with colorful wildflowers that swayed gently in the breeze. Sunlight cascaded through the gaps between the trees, painting dappled patterns on the forest floor. Birds chirped melodiously, their songs echoing through the serene landscape. Nature\'s orchestra played harmoniously, creating a symphony of sights and sounds. It was a place where time seemed to stand still, where one could lose themselves in the beauty and tranquility of the natural world."']

In [156]:
# Join the extracted text into a single string
extracted_text = '\n'.join(text)

In [158]:
type(extracted_text)

str

In [157]:
# Print or use the extracted text as needed
print(extracted_text)

The majestic mountains stood tall, their peaks reaching towards the heavens. The lush green valleys sprawled beneath, adorned with colorful wildflowers that swayed gently in the breeze. Sunlight cascaded through the gaps between the trees, painting dappled patterns on the forest floor. Birds chirped melodiously, their songs echoing through the serene landscape. Nature's orchestra played harmoniously, creating a symphony of sights and sounds. It was a place where time seemed to stand still, where one could lose themselves in the beauty and tranquility of the natural world."


In [159]:
type(extracted_text)

str

### 1-4. Collecting Data from JSON

In [160]:
import json

# Path to the JSON file
json_file_path = '/Users/jothiramsanjeevi/Documents/IPythonnotebook/Natural Language Processing Bootcamp/example_1.json'

# Open the JSON file
with open(json_file_path) as file:
    data = json.load(file)


In [161]:
# Extract text from specific fields or keys in the JSON data
text = data['fruit']

# Print or use the extracted text as needed
print(text)


Apple


In [None]:
import requests
import json

In [162]:
#json from "https://quotes.rest/qod.json"
#https://theysaidso.com/login
    
r = requests.get("https://quotes.rest/qod.json")
res = r.json()
print(json.dumps(res, indent = 4))

{
    "message": "Not authenticated"
}


In [163]:
import requests

# API endpoint
url = 'https://quotes.rest/qod.json'

# API key
api_key = 'haB3xY8M7T16YpIB5Qliotrqdz2Ef0FKFdWDM6QA'

# Set the headers with the API key
headers = {
    'X-TheySaidSo-Api-Secret': api_key
}

# Send GET request to the API with headers
response = requests.get(url, headers=headers)

# Extract and process the response data
data = response.json()
# Process the data as needed


In [165]:
data

{'success': {'total': 1},
 'contents': {'quotes': [{'id': 'wPkodOctkz8HYuyIo1e8FgeF',
    'quote': 'Winning is not everything, but the effort to win is.',
    'length': 52,
    'author': 'Zig Ziglar',
    'language': 'en',
    'tags': ['effort', 'inspire', 'winning', 'win'],
    'sfw': 'sfw',
    'permalink': 'https://theysaidso.com/quote/zig-ziglar-winning-is-not-everything-but-the-effort-to-win-is',
    'title': 'Inspiring Quote of the day',
    'category': 'inspire',
    'background': 'https://theysaidso.com/assets/images/qod/qod-inspire.jpg',
    'date': '2023-07-20'}]},
 'copyright': {'url': 'https://quotes.rest', 'year': '2023'}}

In [166]:
print(json.dumps(data, indent = 4))

{
    "success": {
        "total": 1
    },
    "contents": {
        "quotes": [
            {
                "id": "wPkodOctkz8HYuyIo1e8FgeF",
                "quote": "Winning is not everything, but the effort to win is.",
                "length": 52,
                "author": "Zig Ziglar",
                "language": "en",
                "tags": [
                    "effort",
                    "inspire",
                    "winning",
                    "win"
                ],
                "sfw": "sfw",
                "permalink": "https://theysaidso.com/quote/zig-ziglar-winning-is-not-everything-but-the-effort-to-win-is",
                "title": "Inspiring Quote of the day",
                "category": "inspire",
                "background": "https://theysaidso.com/assets/images/qod/qod-inspire.jpg",
                "date": "2023-07-20"
            }
        ]
    },
    "copyright": {
        "url": "https://quotes.rest",
        "year": "2023"
    }
}


In [167]:
#extract contents
q = data['contents']['quotes'][0]
q

{'id': 'wPkodOctkz8HYuyIo1e8FgeF',
 'quote': 'Winning is not everything, but the effort to win is.',
 'length': 52,
 'author': 'Zig Ziglar',
 'language': 'en',
 'tags': ['effort', 'inspire', 'winning', 'win'],
 'sfw': 'sfw',
 'permalink': 'https://theysaidso.com/quote/zig-ziglar-winning-is-not-everything-but-the-effort-to-win-is',
 'title': 'Inspiring Quote of the day',
 'category': 'inspire',
 'background': 'https://theysaidso.com/assets/images/qod/qod-inspire.jpg',
 'date': '2023-07-20'}

In [168]:
#extract only quote
print(q['quote'], '\n--', q['author'])

Winning is not everything, but the effort to win is. 
-- Zig Ziglar


### 1-5. Collecting Data from HTML


Parsing text refers to the process of analyzing and interpreting the structure and meaning of text data. In the context of programming, parsing is commonly used when working with structured data formats like HTML, XML, JSON, or specific file formats.

In [169]:
!pip install bs4
import urllib.request as urllib2
from bs4 import BeautifulSoup



In [170]:
response = urllib2.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')
html_doc = response.read()

In [172]:
#Parsing
soup = BeautifulSoup(html_doc, 'html.parser')
# Formating the parsed html file
strhtm = soup.prettify()
# Print few lines
print (strhtm[:1000])

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Natural language processing - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled";(function(){var cookie=document.cookie.m

Extracting tag value

In [176]:
print(soup.title)
print(soup.title.string)
print(soup.a.string)
print(soup.b.string)

<title>Natural language processing - Wikipedia</title>
Natural language processing - Wikipedia
Jump to content
Natural language processing


Extracting all instances of a particular tag

In [177]:
for x in soup.find_all('a'): print(x.string)

Jump to content
Main page
Contents
Current events
Random article
About Wikipedia
Contact us
Donate
Help
Learn to edit
Community portal
Recent changes
Upload file
None
None
Create account
Log in
None
None
learn more
Contributions
Talk
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
Afrikaans
العربية
Արեւմտահայերէն
Azərbaycanca
বাংলা
Bân-lâm-gú
Беларуская
Беларуская (тарашкевіца)
Български
Bosanski
Català
Čeština
Cymraeg
Dansk
Deutsch
Eesti
Ελληνικά
Español
Esperanto
Euskara
فارسی
Français
Galego
한국어
Հայերեն
हिन्दी
Hrvatski
Bahasa Indonesia
IsiZulu
Íslenska
Italiano
עברית
ಕನ್ನಡ
ქართული
Latviešu
Lietuvių
Македонски
मराठी
مصرى
Монгол
မြန်မာဘာသာ
日本語
ଓଡ଼ିଆ
Picard
Piemontèis
Polski
Português
Română
Runa Simi
Русский
Shqip
Simple English
کوردی
Српски / srpski
Srpskohrvatski / српскохрватски
Suomi
தமிழ்
తెలుగు
ไทย
Türkçe
Українська
Tiếng Việt
粵語
中文
Edit links
Article
Talk
Read
Edit
View history
Read
Edit
View history
What links here


Extracting all text of a particular tag

In [None]:
for x in soup.find_all('p'): print(x.text)

### 1-6. Parsing Text Using Regular Expressions

re.match() and re.search() functions are used to find the patterns
and then can be processed according to the requirements of the application.
Let’s look at the differences between re.match() and re.search():
1. re.match(): This checks for a match of the string only
at the beginning of the string. So, if it finds the pattern
at the beginning of the input string, then it returns the
matched pattern; otherwise; it returns a noun.
2. re.search(): This checks for a match of the string
anywhere in the string. It finds all the occurrences of
the pattern in the given input string or data.
Now let’s look at a few of the examples using these regular expressions

#### Tokenizing

In [179]:
# Import library
import re
#run the split query
re.split('\s+','I like this book.')

['I', 'like', 'this', 'book.']

In [180]:
a = 'Hi How are you'.split()

In [181]:
a

['Hi', 'How', 'are', 'you']

#### Extracing email IDs

In [185]:
doc = "For more details please mail us at: xyz@abc.com,pqr@mno.com,learnwithme4998@gmail.com"

In [186]:
addresses = re.findall(r'[\w\.-]+@[\w\.-]+', doc)

In [187]:
addresses

['xyz@abc.com', 'pqr@mno.com', 'learnwithme4998@gmail.com']

#### Replacing email IDs

In [188]:
new_email_address = re.sub(r'([\w\.-]+)@([\w\.-]+)',r'learnwithme4998@gmail.com', doc)
print(new_email_address)

For more details please mail us at: learnwithme4998@gmail.com,learnwithme4998@gmail.com,learnwithme4998@gmail.com


### 1-7. Handling Strings

The simplest way to do this is by using the below string functionality.
- s.find(t) index of first instance of string t inside s (-1 if not found)
- s.rfind(t) index of last instance of string t inside s (-1 if not found)
- s.index(t) like s.find(t) except it raises ValueError if not found
- s.rindex(t) like s.rfind(t) except it raises ValueError if not found
- s.join(text) combine the words of the text into a string
- using s as the glue
- s.split(t) split s into a list wherever a t is found
- (whitespace by default)
- s.splitlines() split s into a list of strings, one per line
- s.lower() a lowercased version of the string s
- s.upper() an uppercased version of the string s
- s.title() a titlecased version of the string s
- s.strip() a copy of s without leading or trailing whitespace
- s.replace(t, u) replace instances of t with u inside s

In [189]:
String_v1 = "I am exploring NLP"
#To extract particular character or range of characters from string
print(String_v1[0])
#To extract exploring
print(String_v1[5:14])

I
exploring


Replace “exploring” with “Learnwithme” in the above string

In [190]:
String_v2 = String_v1.replace("exploring", "Learning")
print(String_v2)

I am Learning NLP


#### Concatenating two strings

In [191]:
s1 = "nlp"
s2 = " machine learning"
s3 = s1+s2
print(s3)

nlp machine learning


#### Searching for a substring in a string

In [192]:
var="I am learning NLP"
f= "learn"
var.find(f)

5

### 1-8. Scraping Text from the Web

<span style="color:red">Caution Before scraping any websites, blogs, or e-commerce websites, please make sure you read the terms and conditions of the websites on whether it gives permissions for data scraping.</span>

The simplest way to do this is by using beautiful soup or scrapy library
from Python. Let’s use beautiful soup in this recipe

#### Import the libraries

In [193]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from pandas import Series, DataFrame
from ipywidgets import FloatProgress
from time import sleep
from IPython.display import display
import re
import pickle

#### Identify the url to extract the data

In [194]:
url = 'http://www.imdb.com/chart/top?ref_=nv_mv_250_6'

In [195]:
result = requests.get(url)
c = result.content
soup = BeautifulSoup(c,"lxml")

In [196]:
soup

<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
</body>
</html>

In [197]:
import requests
from bs4 import BeautifulSoup

# Send a GET request to the IMDB Top Rated Movies page
url = 'https://www.imdb.com/chart/top/'
response = requests.get(url)

# Create a BeautifulSoup object from the response content
soup = BeautifulSoup(response.content, 'html.parser')

# Find the movie titles and ratings using appropriate CSS selectors
titles = soup.select('.lister-list tr .titleColumn a')
ratings = soup.select('.lister-list tr .imdbRating strong')



In [198]:
response

<Response [403]>

In [None]:
# Extract the text from the elements and print the movie details
for title, rating in zip(titles, ratings):
    movie_title = title.text.strip()
    movie_rating = rating.text.strip()
    print(f"Title: {movie_title}, Rating: {movie_rating}")

In [None]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd


In [None]:
# Downloading imdb top 250 movie's data
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")


In [None]:
response

In [None]:
soup

In [199]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd


In [200]:
# Downloading imdb top 250 movie's data
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")


In [206]:
response

<Response [403]>

In [201]:
movies = soup.select('td.titleColumn')
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value')
		for b in soup.select('td.posterColumn span[name=ir]')]


In [202]:
# create a empty list for storing
# movie information
list = []

# Iterating over movies to extract
# each movie's details
for index in range(0, len(movies)):
	
	# Separating movie into: 'place',
	# 'title', 'year'
	movie_string = movies[index].get_text()
	movie = (' '.join(movie_string.split()).replace('.', ''))
	movie_title = movie[len(str(index))+1:-7]
	year = re.search('\((.*?)\)', movie_string).group(1)
	place = movie[:len(str(index))-(len(movie))]
	data = {"place": place,
			"movie_title": movie_title,
			"rating": ratings[index],
			"year": year,
			"star_cast": crew[index],
			}
	list.append(data)


In [203]:
for movie in list:
	print(movie['place'], '-', movie['movie_title'], '('+movie['year'] +
		') -', 'Starring:', movie['star_cast'], movie['rating'])


In [204]:
#saving the list as dataframe
#then converting into .csv file
df = pd.DataFrame(list)
df.to_csv('imdb_top_250_movies.csv',index=False)


In [205]:
df

In [207]:
import requests

API_KEY = '85748e0c-cf08-45fa-afdb-1f665eb2c13a'

def get_scrapeops_url(url):
    payload = {'api_key': API_KEY, 'url': url}
    proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
    return proxy_url

r = requests.get(get_scrapeops_url('http://quotes.toscrape.com/page/1/'))
print(r.text)


NameError: name 'urlencode' is not defined