# Natural Language Processing Recipes

## Extracting the Data

Here, we are going to cover various sources of text data and ways to extract it, which can act as information or insights for businesses.

Recipe 1. Text data collection using APIs

Recipe 2. Reading PDF file in Python

Recipe 3. Reading word document

Recipe 4. Reading JSON object

Recipe 5. Reading HTML page and HTML parsing

Recipe 6. Regular expressions

Recipe 7. String handling

Recipe 8. Web scraping

### Introduction

Before getting into details of the book, let’s see the different possible datasources available in general. We need to identify potential data sources for a business’s benefit.

**Client Data** For any problem statement, one of the sources is their own data that is already present. But it depends on the business where they store it. Data storage depends on the type of business, amount of data, and cost associated with different sources.
- SQL databases
- Hadoop clusters
- Cloud storage
- Flat files

**Free source** A huge amount of data is freely available over the internet. We just need to streamline the problem and start exploring multiple free data sources.
- Free APIs like Twitter
- Wikipedia
- Government data (e.g. http://data.gov)
- Census data (e.g. http://www.census.gov/data.html)
- Health care claim data (e.g. https://www.healthdata.gov/)

**Web scraping** Extracting the content/data from websites, blogs, forums, and retail websites for reviews with the permission from the respective sources using web scraping packages in Python. There are a lot of other sources like crime data, accident data, and economic data that can also be leveraged for analysis based on the problem statement.



#### Collecting the data from Twitter API

- You want to collect text data using Twitter APIs.<br>
--> Twitter has a gigantic amount of data with a lot of value in it. Social mediamarketers are making their living from it. There is an enormous amount of tweets every day, and every tweet has some story to tell. When all of this
data is collected and analyzed, it gives a tremendous amount of insights to a business about their company, product, service, etc.<br>
Let’s see how to pull the data in this recipe and then explore how to leverage it


In [3]:
# Fetching the data

# Import the libraries
import numpy as np
import tweepy
import json
import pandas as pd
from tweepy import OAuthHandler

# Credentials
consumer_key = "JrEyiSir2ehNBUncsR8NzMW6H"
consumer_secret = "SZqQaaTEBiUfp3Gnwd6lkgcpiwgzuyuUbbWVL453BjDKighEdG"
access_token = "2310353096-bR4mYstbYwXoZVIOCwRFYOjdLLqxGK4ZPt0cOB8"
access_token_secret = "yMXQ7Mse2B69WpYpppojjbcPtPvdkOW1OqRddxtN8wj76"

# caliing API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)


The query below will pull the top 10 tweets when the product ABC is searched. The API will pull English tweets since the language given is ‘en’ and it will exclude retweets.

In [4]:
# Provide the query you want to pull the data
query = "ABC"
Tweets = api.search(query, count=10, lang='en', exclude='retweets', tweet_mode="extended")
print(Tweets)

[Status(_api=<tweepy.api.API object at 0x7fd7dd7ce220>, _json={'created_at': 'Sat Sep 12 06:53:38 +0000 2020', 'id': 1304674512903041024, 'id_str': '1304674512903041024', 'full_text': 'The British government however... https://t.co/rhBuEcZ3sZ', 'truncated': False, 'display_text_range': [0, 33], 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/rhBuEcZ3sZ', 'expanded_url': 'https://twitter.com/abc/status/1304011512915296256', 'display_url': 'twitter.com/abc/status/130…', 'indices': [34, 57]}]}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 125759100, 'id_str': '125759100', 'name': 'Pete W Sutton', 'screen_name': 'Suttope', 'location': 'Brissle', 'descript

#### Collecting data from PDF

Most of the time your data will be stored as PDF files. We need to extract text from these files and store it for further analysis.

--> Let’s follow the steps in this section to extract data from PDF files.

In [8]:
# Import the Libraries
import PyPDF2
from PyPDF2 import PdfFileReader

# Creating a pdf file object
pdf = open("/home/kunal/Github/NLP/Data/Neural_Model.pdf","rb") # path to the .pdf file

# Creating pdf reader objecct
pdf_reader = PyPDF2.PdfFileReader(pdf)

# Checking the number of pages in pdf file
print(pdf_reader.numPages)


8


In [9]:
# Creating a page object
page = pdf_reader.getPage(0)

# Finally extracting text from the page
print(page.extractText())

# Closing the pdf file 
pdf.close()

arXiv:1506.05869v3  [cs.CL]  22 Jul 2015ANeuralConversationalModel
OriolVinyals
VINYALS
@
GOOGLE
.
COM
Google

QuocV.Le
QVL
@
GOOGLE
.
COM
Google
Abstract
Conversationalmodelingisanimportanttaskin

naturallanguageunderstandingandmachinein-

telligence.Althoughpreviousapproachesex-

ist,theyareoftenrestrictedtospecicdomains

(e.g.,bookinganairlineticket)andrequirehand-

craftedrules.Inthispaper,wepresentasim-

pleapproachforthistaskwhichusestherecently

proposedsequencetosequenceframework.Our

modelconversesbypredictingthenextsentence

giventheprevioussentenceorsentencesina

conversation.Thestrengthofourmodelisthat

itcanbetrainedend-to-endandthusrequires

muchfewerhand-craftedrules.Wendthatthis

straightforwardmodelcangeneratesimplecon-

versationsgivenalargeconversationaltraining

dataset.Ourpreliminaryresultssuggestthat,de-

spiteoptimizingthewrongobjectivefunction,

themodelisabletoconversewell.Itisable

extractknowledgefrombothadomainspecic

dataset,andfromalarge,noisy,andgeneraldo

#### Collecting Data from Word Files

Let us look at another small recipe by reading Word files in Python.

--> The simplest way to do this is by using the docx library.