The first stage of any NLP project is to extract the required textual data. This text data is usually unstructured and stored in various locations and formats.

The purpose of this notebook is to show how to extract text data from the most common sources.

We will cover text extraction from the following sources:
1. Tweets
2. Word Documents
3. PDFs
4. Text from Images
5. CSV files
6. Excel files
7. Facebook Posts
8. RSS Feeds

# Extract Tweets
***
Tweets can be extracted and fed into an NLP model to get a wider public view. We shall use the `tweepy` library to extract the tweets for the target words. 

First you need to generate the required tokens and secret security information:

* Go to [this link](https://apps.twitter.com/) and click "Create New App"
* Choose Create your Twitter Application, fill in the details and you will get your token from "Keys and Access Tokens" tab.

In [None]:
import tweepy
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

auth = tweepy.auth.OAuthHandler( enter_key_consumer, \
                               enter_secret_consumer )

auth.set_access_token( token, secret )
api = tweepy.API( auth )

def get_tweets( api, keywords, count ):
    return api.search( q=keywords, result_type='recent', \
                     lang='en', count=count )

tweets = get_tweets( api, ['FinTechExplained', 'MachineLearning'], 5)

for tweet in tweets:
    print(tweet.text)

# How to Read a Word Document
***
Use the `docx` library to read and extract text from the word documents

`pip install docx`

In [None]:
all_text = []
doc = docx.Document( filename )
for paragraph in doc.paragraphs:
    all_text.append( para.text )
    
print('\n'.join(all_text))

# How to Read a PDF Document
***
Use the `PyPDF2` library to work with PDFs

In [None]:
reader = PyPDF2.PdfFileReader( open( file_name, 'rb' ))
print(pdfReader.getPage(0).extractText())

# Read text from Image
***
Use the `pytesseract` library to process and read text from images

In [None]:
import Image
from tesseract import image_to_string
print( image_to_string(Image.open(image_to_string), lang='en'))

# Read Text from a CSV File
***
Use the `pandas` library

In [None]:
import pandas as pd
dataframe = pd.read_csv(file_path, sep=',', delimiter)

# Read Text from Excel Spreadsheet
***
Use the `pandas` library to read text from an excel spreadsheet

In [None]:
data_frame = pd.read_excel( file, sheet_name )

# Extract Posts from Facebook
***
The first step is to get the required token, by navigating to [this link](developers.facebook.com/tools/explorer).

Then go to My apps, and add a new app and create your app ID.

Then get your user access token from [here](developers.facebook.com/tools/explorer)

You can extend the expiration date of the user access token by visiting [this link](developers.facebook.com/tools/accesstoken)

Then, install the libraries:
`pip install facebook-sdk`

Now we can use code to get the 5 posts of a specific user:

In [None]:
import facebook
import requests

token_url = 'https://graph.facebook.com/oauth/access_token'
params = dict( client_id, client_secret , grant_type='client_credentials')

token_response = requests.get( token_url, params)
graph = facebook.GraphAPI( token_response.text.split('=')[1])
posts = graph.get_connections( graph.get_object(user_id)['id'],\
                             'posts?limit=5')

# Extract from RSS Feed
***
Install the `feedparser` library to extract the RSS feeds:

`pip install feedparser`

Then we can use the feedparser library to extract the keys:

In [None]:
import feedparser
feed = feedparser.parse(rss_feed_url)
for entry in feed.entries:
    print(entry.keys())