Here's our plan for parsing the text data:

1. Convert text to all lower case for normalcy.
1. Remove any accented characters, non-ASCII characters.
1. Remove special characters.
1. Stem or lemmatize the words.
1. Remove stopwords.
1. Store the clean text and the original text for use in future notebooks.


In [21]:
import unicodedata
import re
import json
import acquire

import nltk
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

[nltk_data] Downloading package wordnet to /Users/hector/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hector/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [None]:
#takes in a string and cleans it
def basic_clean():
    original = acquire.get_article_text()
    article = original.lower()
    article = unicodedata.normalize('NFKD', article).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    article = re.sub(r"[^a-z0-9'\s]", '', article)
    
    return article

In [4]:
original = 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in Glassdoor’s #1 Best Job in America.'
print(original)

The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in Glassdoor’s #1 Best Job in America.


In [5]:
article = original.lower()
print(article)

the rumors are true! the time has arrived. codeup has officially opened applications to our new data science career accelerator, with only 25 seats available! this immersive program is one of a kind in san antonio, and will help you land a job in glassdoor’s #1 best job in america.


We'll go about this in three steps:

- unicodedata.normalize removes any inconsistencies in unicode character encoding.
- .encode to convert the resulting string to the ASCII character set. We'll ignore any errors in conversion, meaning we'll drop anything that isn't an ASCII character.
- .decode to turn the resulting bytes object back into a string.


In [6]:
article = unicodedata.normalize('NFKD', article).encode('ascii', 'ignore').decode('utf-8', 'ignore')
print(article)

the rumors are true! the time has arrived. codeup has officially opened applications to our new data science career accelerator, with only 25 seats available! this immersive program is one of a kind in san antonio, and will help you land a job in glassdoors #1 best job in america.


In [7]:
#remove anything that is not a through z, a number, a single quote, or whitespace
article = re.sub(r"[^a-z0-9'\s]", '', article)
print(article)

the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america


2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [None]:
#tokenizes the original string in this instance.
def tokenize():
    tokenizer = nltk.tokenize.ToktokTokenizer()

    print(tokenizer.tokenize(original, return_str=True))
    

In [8]:
tokenizer = nltk.tokenize.ToktokTokenizer()

print(tokenizer.tokenize(original, return_str=True))

The rumors are true ! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator , with only 25 seats available ! This immersive program is one of a kind in San Antonio , and will help you land a job in Glassdoor ’ s #1 Best Job in America .


3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [None]:
def stem():
    stems = [ps.stem(word) for word in article.split()]
    article_stemmed = ' '.join(stems)
    return article_stemmed

In [9]:
# Create the nltk stemmer object, then use it
ps = nltk.porter.PorterStemmer()

ps.stem('call'), ps.stem('called'), ps.stem('calling')

('call', 'call', 'call')

Stemming has a fixed set of rules, hence, the root stems may not be lexicographically correct. This means that the stemmed words may not be semantically correct, and might have a chance of not being present in the dictionary...**as seen below:**

In [10]:
#applied to article
stems = [ps.stem(word) for word in article.split()]
article_stemmed = ' '.join(stems)
print(article_stemmed)

the rumor are true the time ha arriv codeup ha offici open applic to our new data scienc career acceler with onli 25 seat avail thi immers program is one of a kind in san antonio and will help you land a job in glassdoor 1 best job in america


4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

Lemmatization is very similar to stemming, however, the base form in this case is known as the root word, but not the root stem. The difference is that the root word is always a lexicographically correct word (present in the dictionary), but the root stem may not be so. Thus, root word, also known as the lemma, will always be present in the dictionary.

In [None]:
def lemmatize():
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in article.split()]
    article_lemmatized = ' '.join(lemmas)
    
    return article_lemmatized

In [13]:
wnl = nltk.stem.WordNetLemmatizer()

for word in 'study studies'.split():
    print('stem:', ps.stem(word), '-- lemma:', wnl.lemmatize(word))

stem: studi -- lemma: study
stem: studi -- lemma: study


In [14]:
lemmas = [wnl.lemmatize(word) for word in article.split()]
article_lemmatized = ' '.join(lemmas)

print(article_lemmatized)

the rumor are true the time ha arrived codeup ha officially opened application to our new data science career accelerator with only 25 seat available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america


5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

- This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [None]:
def remove_stopwords():
    stopword_list = stopwords.words('english')
    stopword_list.remove('no')
    stopword_list.remove('not')
    words = article.split()
    filtered_words = [w for w in words if w not in stopword_list]

    print('Removed {} stopwords'.format(len(words) - len(filtered_words)))
    print('---')

    article_without_stopwords = ' '.join(filtered_words)

    return article_without_stopwords

Words which have little or no significance, especially when constructing meaningful features from text, are known as stop words (or stopwords). These are usually words that end up having the maximum frequency if you do a simple term or word frequency in a corpus. Typically, these can be articles, conjunctions, prepositions and so on. Some examples of stopwords: a, an, the, and like.

Before removing stopwords, we want to segment text into linguistic units such as words or numbers. This process is called tokenization.

In [18]:
stopword_list = stopwords.words('english')

stopword_list.remove('no')
stopword_list.remove('not')

stopword_list[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [19]:
words = article.split()
filtered_words = [w for w in words if w not in stopword_list]

print('Removed {} stopwords'.format(len(words) - len(filtered_words)))
print('---')

article_without_stopwords = ' '.join(filtered_words)

print(article_without_stopwords)

Removed 20 stopwords
---
rumors true time arrived codeup officially opened applications new data science career accelerator 25 seats available immersive program one kind san antonio help land job glassdoors 1 best job america


6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [22]:
# get data 
df = acquire.get_blog_articles()
df

AttributeError: 'NoneType' object has no attribute 'get_text'