# Natural Language Processing Recipes

## Extracting the Data

Here, we are going to cover various sources of text data and ways to extract it, which can act as information or insights for businesses.

Recipe 1. Text data collection using APIs

Recipe 2. Reading PDF file in Python

Recipe 3. Reading word document

Recipe 4. Reading JSON object

Recipe 5. Reading HTML page and HTML parsing

Recipe 6. Regular expressions

Recipe 7. String handling

Recipe 8. Web scraping

### Introduction

Before getting into details of the book, let’s see the different possible datasources available in general. We need to identify potential data sources for a business’s benefit.

**Client Data** For any problem statement, one of the sources is their own data that is already present. But it depends on the business where they store it. Data storage depends on the type of business, amount of data, and cost associated with different sources.
- SQL databases
- Hadoop clusters
- Cloud storage
- Flat files

**Free source** A huge amount of data is freely available over the internet. We just need to streamline the problem and start exploring multiple free data sources.
- Free APIs like Twitter
- Wikipedia
- Government data (e.g. http://data.gov)
- Census data (e.g. http://www.census.gov/data.html)
- Health care claim data (e.g. https://www.healthdata.gov/)

**Web scraping** Extracting the content/data from websites, blogs, forums, and retail websites for reviews with the permission from the respective sources using web scraping packages in Python. There are a lot of other sources like crime data, accident data, and economic data that can also be leveraged for analysis based on the problem statement.



#### Collecting the data from Twitter API

- You want to collect text data using Twitter APIs.<br>
--> Twitter has a gigantic amount of data with a lot of value in it. Social mediamarketers are making their living from it. There is an enormous amount of tweets every day, and every tweet has some story to tell. When all of this
data is collected and analyzed, it gives a tremendous amount of insights to a business about their company, product, service, etc.<br>
Let’s see how to pull the data in this recipe and then explore how to leverage it


In [8]:
# Fetching the data

# Import the libraries
import numpy as np
import tweepy
import json
import pandas as pd
from tweepy import OAuthHandler

# Credentials
consumer_key = "JrEyiSir2ehNBUncsR8NzMW6H"
consumer_secret = "SZqQaaTEBiUfp3Gnwd6lkgcpiwgzuyuUbbWVL453BjDKighEdG"
access_token = "2310353096-bR4mYstbYwXoZVIOCwRFYOjdLLqxGK4ZPt0cOB8"
access_token_secret = "yMXQ7Mse2B69WpYpppojjbcPtPvdkOW1OqRddxtN8wj76"

# caliing API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)


The query below will pull the top 10 tweets when the product ABC is searched. The API will pull English tweets since the language given is ‘en’ and it will exclude retweets.

In [9]:
# Provide the query you want to pull the data
query = "ABC"
Tweets = api.search(query, count=10, lang='en', exclude='retweets', tweet_mode="extended")
print(Tweets)

[Status(_api=<tweepy.api.API object at 0x7fad6d6e0ca0>, _json={'created_at': 'Wed Aug 26 09:18:25 +0000 2020', 'id': 1298550357698764800, 'id_str': '1298550357698764800', 'full_text': '@dialogue4sanity @Lwmdead @ABC And it is going to get worse', 'truncated': False, 'display_text_range': [31, 59], 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'dialogue4sanity', 'name': 'Open Mind', 'id': 950257016856969216, 'id_str': '950257016856969216', 'indices': [0, 16]}, {'screen_name': 'Lwmdead', 'name': '禹会🇹🇼', 'id': 1262845802952585216, 'id_str': '1262845802952585216', 'indices': [17, 25]}, {'screen_name': 'ABC', 'name': 'ABC News', 'id': 28785486, 'id_str': '28785486', 'indices': [26, 30]}], 'urls': []}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'in_reply_to_status_id': 1298543505200099328, 'in_reply_to_status_id_str': '1298543505200099328', 

#### Collecting data from PDF

Most of the time your data will be stored as PDF files. We need to extract text from these files and store it for further analysis.

--> Let’s follow the steps in this section to extract data from PDF files.

In [17]:
# Import the Libraries
import PyPDF2
from PyPDF2 import PdfFileReader

# Creating a pdf file object
pdf = open("/home/kunal/Desktop/Image_Stitching.pdf","rb") 

# Creating pdf reader objecct
pdf_reader = PyPDF2.PdfFileReader(pdf)

# Checking the number of pages in pdf file
print(pdf_reader.numPages)


8


In [19]:
# Creating a page object
page = pdf_reader.getPage(0)

# Finally extracting text from the page
print(page.extractText())

# Closing the pdf file 
pdf.close()

International Journal of Computer Applications (0975 

 
8887)
 
Volume 9
9

 
No.
6
,
 
August
 
2014
 
1
 
Image Stitching b
ased on Feature Extraction 
Techniques: A Survey
 
 
Ebtsam
 
Adel
 
 
Information System Dept. 
Faculty of Computers and 
Information,
 
 
Mansoura University, Egypt
                                                         
 
Moh
ammed Elmogy
                   
 
  
Information Technology Dept. 
Faculty of Computers and 
Information,
 
 
Mansoura University, Egypt
                                             
            
 
 
Hazem Elbakry
 
Information System Dept
 
Faculty of 
Computers and 
Information,
 
Mansoura University, Egypt
 
 
 
ABSTRACT
 
Image stitching (Mosaicing) is considered as an active 
research area in computer vision and computer graphics. 
Image stitching is concerned with combining two or more 
images
 
of the same scene into one 
high resolution 
image 
which is called panoramic
 
image. Image
 
stitching
 
techniques
 
can be 
catego

#### Collecting Data from Word Files

Let us look at another small recipe by reading Word files in Python.

--> The simplest way to do this is by using the docx library.