# Import URL and convert to Text for Keyword Analysis

## Table of Contents
1. Load Packages and Data
2. Read URL and Extract Text
3. Summary

## Load Packages and Data

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [2]:
url_link = input()

http://regenerativeagriculturepodcast.com/


## Read URL and Extract Text

In [4]:
# read the link and parse with beautiful soup
html = urlopen(url_link).read()
soup = BeautifulSoup(html, features="html.parser")

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

Regenerative Agriculture Podcast
Toggle navigation
About
Contact
Episodes
All Episodes
Archives
2020
NovemberOctoberSeptemberAugustJulyJuneMayAprilMarchFebruaryJanuary2019
DecemberNovemberOctoberSeptemberAugustJulyJuneMayAprilMarchFebruary2018
DecemberNovemberOctoberAugustJulyJuneMayApril
Preview Mode Links will not work in preview mode
Regenerative Agriculture Podcast
Get Email Updates | Privacy Policy
Reversing Soil Degradation with Dwayne BeckNov 3, 2020Dr. Dwayne Beck is well known
for being one of the pioneers of no-till agriculture in central
South Dakota and across the High Plains. For more than three
decades, Dr. Beck has been creating comprehensive systems for both
irrigated and dryland crop production throughout the region,
educating growers on the power of crop...
Read MoreUpdating Soil Analysis to Consider Microbial Influence with Rick HaneyOct 6, 2020Rick Haney is a renowned
researcher at the U.S. Department of Agriculture and the creator of
the Haney Soil Analysis, an inn

## Summary

The string `text` may now be called with the function created in `NLP - analyze pdf for keywords` to obtain keywords, key phrases, and a wordcloud.  

### Future Work

In addition to creating a functions for reading in a pdf or url and getting all the key info returned (for ease of calling), this would be great to host on a website.  

Also, the following function (created for production of a chatbot) below could be of use for refining the information within the string through removal of stop words and punctuation/symbols as well as creating uniform case.  For use, the prior work would have to be modified to analyze a list.  Since this function seperates words, it would not be helpful in identifying key phrases.  Perhaps the function could be modified such that word tokenization is removed.  

In [13]:
# requisite packages
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re
from spellchecker import SpellChecker

# create functions for cleaning the text

# remove patterns from words so they can be recognized 
def reduce_length(text):
    pattern = re.compile(r"(.)\1{2,}")
    return pattern.sub(r"\1\1", text)

# define a specialized function for the CountVectorizer analyzer
# split, remove punctuation and other sybols, lowercase
# filter stop words
def text_cleaner(text):
    tokens = word_tokenize(text) # split
    tokens = [word.lower() for word in tokens] # lowercase
    # remove punctuation
    table = str.maketrans('', '', string.punctuation)
    stripped = [word.translate(table) for word in tokens]
    # remove non alphanetic
    words = [word for word in stripped if word.isalpha()]
    # filter stop words
    stop_words = stopwords.words('english')
    words = [word for word in words if not word in stop_words]
    # correct spelling
    words_reduced = [reduce_length(word) for word in words]
    spell = SpellChecker()
    correct_words = [spell.correction(word) for word in words_reduced]
    return correct_words

# example of modified text_cleaner() at work
text_cleaner('I am teeeeeest!ing that the! function< is working# as *** expected and> removing stop words, as wellllll as, ?!punctuation.') 

['test',
 'ing',
 'function',
 'working',
 'expected',
 'removing',
 'stop',
 'words',
 'well',
 'punctuation']

### Watermark

In [8]:
# use watermark in a notebook with the following call
%load_ext watermark

# %watermark? #<-- watermark documentation

%watermark -a "H.GRYK" -d -t -v -p sys
%watermark -p urllib.request
%watermark -p bs4



The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
H.GRYK 2020-12-04 14:48:06 

CPython 3.7.7
IPython 7.18.1

sys 3.7.7 (default, May  6 2020, 11:45:54) [MSC v.1916 64 bit (AMD64)]
urllib.request unknown
bs4 4.8.0
