# Parsing Text

## Import Libraries

In [23]:
#Imports for processing and cleaning language
import unicodedata
import re
import json

import nltk
nltk.download('wordnet')
  
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

#Pandas
import pandas as pd

#See the acquire.py file 
from acquire import get_blog_df, get_articles

#Disable warnings
import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/malachihale/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Acquire Data

We will use one blog post as an example to build our functions.

In [2]:
url = 'https://codeup.com/data-science/codeups-data-science-career-accelerator-is-here/'

In [3]:
blog_articles = get_blog_articles(url)

In [4]:
content = blog_articles['contents']
content

'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Student

## Create Function to Clean the Content

Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [5]:
content = content.lower()
print(content)

the rumors are true! the time has arrived. codeup has officially opened applications to our new data science career accelerator, with only 25 seats available! this immersive program is one of a kind in san antonio, and will help you land a job in glassdoor’s #1 best job in america.
data science is a method of providing actionable intelligence from data. the data revolution has hit san antonio, resulting in an explosion in data scientist positions across companies like usaa, accenture, booz allen hamilton, and heb. we’ve even seen utsa invest $70 m for a cybersecurity center and school of data science. we built a program to specifically meet the growing demands of this industry.
our program will be 18 weeks long, full-time, hands-on, and project-based. our curriculum development and instruction is led by senior data scientist, maggie giust, who has worked at heb, capital group, and rackspace, along with input from dozens of practitioners and hiring partners. students will work with real

In [6]:
content = unicodedata.normalize('NFKD', content)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')

print(content)

the rumors are true! the time has arrived. codeup has officially opened applications to our new data science career accelerator, with only 25 seats available! this immersive program is one of a kind in san antonio, and will help you land a job in glassdoors #1 best job in america.
data science is a method of providing actionable intelligence from data. the data revolution has hit san antonio, resulting in an explosion in data scientist positions across companies like usaa, accenture, booz allen hamilton, and heb. weve even seen utsa invest $70 m for a cybersecurity center and school of data science. we built a program to specifically meet the growing demands of this industry.
our program will be 18 weeks long, full-time, hands-on, and project-based. our curriculum development and instruction is led by senior data scientist, maggie giust, who has worked at heb, capital group, and rackspace, along with input from dozens of practitioners and hiring partners. students will work with real d

In [7]:
content = re.sub(r"[^a-z0-9'\s]", '', content)
print(content)

the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america
data science is a method of providing actionable intelligence from data the data revolution has hit san antonio resulting in an explosion in data scientist positions across companies like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demands of this industry
our program will be 18 weeks long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who has worked at heb capital group and rackspace along with input from dozens of practitioners and hiring partners students will work with real data sets realistic problems a

In [8]:
def basic_clean(content):
    #Make lowercase
    content = content.lower()
    
    #Normalize unicode characters 
    content = unicodedata.normalize('NFKD', content)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')
    
    #Remove special characters
    content = re.sub(r"[^a-z0-9'\s]", '', content)
    
    return content

## Create a Function to Tokenize the Content

Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [9]:
# Create the tokenizer
tokenizer = nltk.tokenize.ToktokTokenizer()

# Use the tokenizer
content = tokenizer.tokenize(content, return_str = True)

print(content)

the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america
data science is a method of providing actionable intelligence from data the data revolution has hit san antonio resulting in an explosion in data scientist positions across companies like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demands of this industry
our program will be 18 weeks long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who has worked at heb capital group and rackspace along with input from dozens of practitioners and hiring partners students will work with real data sets realistic problems a

In [10]:
def tokenize(content):
    # Create the tokenizer
    tokenizer = nltk.tokenize.ToktokTokenizer()

    # Use the tokenizer
    content = tokenizer.tokenize(content, return_str = True)
    
    return content

## Create a Function to Stem the Content

Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [11]:
#Create the nltk stemmer object
ps = nltk.porter.PorterStemmer()


stems = [ps.stem(word) for word in content.split()]
content_stemmed = ' '.join(stems)
print(content_stemmed)

the rumor are true the time ha arriv codeup ha offici open applic to our new data scienc career acceler with onli 25 seat avail thi immers program is one of a kind in san antonio and will help you land a job in glassdoor 1 best job in america data scienc is a method of provid action intellig from data the data revolut ha hit san antonio result in an explos in data scientist posit across compani like usaa accentur booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecur center and school of data scienc we built a program to specif meet the grow demand of thi industri our program will be 18 week long fulltim handson and projectbas our curriculum develop and instruct is led by senior data scientist maggi giust who ha work at heb capit group and rackspac along with input from dozen of practition and hire partner student will work with real data set realist problem and the entir data scienc pipelin from collect to deploy they will receiv profession develop train in resum

In [12]:
def stem(content):
    #Create the nltk stemmer object
    ps = nltk.porter.PorterStemmer()


    stems = [ps.stem(word) for word in content.split()]
    content_stemmed = ' '.join(stems)
    return content_stemmed

## Create a Function to Lemmatize the Content

Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [13]:
wnl = nltk.stem.WordNetLemmatizer()

lemmas = [wnl.lemmatize(word) for word in content.split()]
content_lemmatized = ' '.join(lemmas)

print(content_lemmatized)

the rumor are true the time ha arrived codeup ha officially opened application to our new data science career accelerator with only 25 seat available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america data science is a method of providing actionable intelligence from data the data revolution ha hit san antonio resulting in an explosion in data scientist position across company like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demand of this industry our program will be 18 week long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who ha worked at heb capital group and rackspace along with input from dozen of practitioner and hiring partner student will work with real data set realistic problem and the entire data

In [14]:
def lemmatize(content):
    wnl = nltk.stem.WordNetLemmatizer()

    lemmas = [wnl.lemmatize(word) for word in content.split()]
    content_lemmatized = ' '.join(lemmas)

    return content_lemmatized 

## Create a Function to Remove Stopwords

Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove

In [15]:
stopword_list = stopwords.words('english')

stopword_list.remove('no')
stopword_list.remove('not')

stopword_list[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [16]:
words = content.split()
filtered_words = [w for w in words if w not in stopword_list]

print('Removed {} stopwords'.format(len(words) - len(filtered_words)))
print('---')

content_without_stopwords = ' '.join(filtered_words)

print(content_without_stopwords)

Removed 122 stopwords
---
rumors true time arrived codeup officially opened applications new data science career accelerator 25 seats available immersive program one kind san antonio help land job glassdoors 1 best job america data science method providing actionable intelligence data data revolution hit san antonio resulting explosion data scientist positions across companies like usaa accenture booz allen hamilton heb weve even seen utsa invest 70 cybersecurity center school data science built program specifically meet growing demands industry program 18 weeks long fulltime handson projectbased curriculum development instruction led senior data scientist maggie giust worked heb capital group rackspace along input dozens practitioners hiring partners students work real data sets realistic problems entire data science pipeline collection deployment receive professional development training resume writing interviewing continuing education prepare smooth transition workforce focus applie

## Acquire News and Blog Article Data

Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [225]:
categories = ["business",
"sports",
"technology",
"entertainment"]

In [226]:
news_df = get_articles(categories)

In [227]:
news_df.head()

Unnamed: 0,title,contents,category
0,Navi Mutual Fund offers SIPs starting at ₹500,Navi Mutual Fund is offering investors the opt...,business
1,'Man who takes 6 months parental leave is a lo...,Several Twitter users criticised US-based Pala...,business
2,Cognizant had to choose clients to serve: CEO ...,Cognizant CEO Brian Humphries has said the fir...,business
3,"Delhi HC notice to RBI, SBI over banning UPI p...",The Delhi High Court on Thursday issued notice...,business
4,Elon Musk and Jeff Bezos are now worth nearly ...,The combined net worth of the world's two rich...,business


Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [228]:
codeup_df = get_blog_df()

In [229]:
codeup_df

Unnamed: 0,title,contents
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...


## Create a Function that Cleans, Stems, and Lemmatizes the DataFrames

For each dataframe, produce the following columns:

 - title to hold the title
 - original to hold the original article/post content
 - clean to hold the normalized and tokenized original with the stopwords removed.
 - stemmed to hold the stemmed version of the cleaned data.
 - lemmatized to hold the lemmatized version of the cleaned data.

In [230]:
news_df = news_df.rename(columns = {'contents': 'original'})

In [231]:
codeup_df = codeup_df.rename(columns = {'contents': 'original'})

In [232]:
news_df.head()

Unnamed: 0,title,original,category
0,Navi Mutual Fund offers SIPs starting at ₹500,Navi Mutual Fund is offering investors the opt...,business
1,'Man who takes 6 months parental leave is a lo...,Several Twitter users criticised US-based Pala...,business
2,Cognizant had to choose clients to serve: CEO ...,Cognizant CEO Brian Humphries has said the fir...,business
3,"Delhi HC notice to RBI, SBI over banning UPI p...",The Delhi High Court on Thursday issued notice...,business
4,Elon Musk and Jeff Bezos are now worth nearly ...,The combined net worth of the world's two rich...,business


In [196]:
codeup_df

Unnamed: 0,title,original,stemmed
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...,the rumor are true! the time ha arrived. codeu...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust\nData Sci...,by dimitri antoni and maggi giust data science...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri Antoniou\nA week ago, Codeup launch...","by dimitri antoni a week ago, codeup launch ou..."
3,10 Tips to Crush It at the SA Tech Job Fair,SA Tech Job Fair\nThe third bi-annual San Anto...,sa tech job fair the third bi-annu san antonio...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...,competitor bootcamp are closing. is the model ...


In [233]:
def clean_dataframe(df):
    clean = []
    for i in df.original:
    
        normalized_data = basic_clean(i)
    
        clean.append(tokenize(normalized_data))
    
    clean_df = pd.DataFrame(clean)
    
    clean_df = clean_df.rename(columns = {0: "clean"})
    
    df = pd.concat([df, clean_df], axis=1)
    
    return df

In [234]:
def stem_dataframe(df):
    stemmed = []
    for i in df.clean:
    
        stemmed.append(stem(i))
    
    stemmed_df = pd.DataFrame(stemmed)
    
    stemmed_df = stemmed_df.rename(columns = {0: "stemmed"})
    
    df = pd.concat([df, stemmed_df], axis=1)
    
    return df

In [235]:
def lemmatize_dataframe(df):
    lemmatized = []
    
    for i in df.clean:
    
        lemmatized.append(lemmatize(i))
    
    lemmatized_df = pd.DataFrame(lemmatized)
    
    lemmatized_df = lemmatized_df.rename(columns = {0: "lemmatized"})
    
    df = pd.concat([df, lemmatized_df], axis=1)
    
    return df

In [236]:
codeup_df = clean_dataframe(codeup_df)
codeup_df = stem_dataframe(codeup_df)
codeup_df = lemmatize_dataframe(codeup_df)

In [237]:
codeup_df

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...,the rumors are true the time has arrived codeu...,the rumor are true the time ha arriv codeup ha...,the rumor are true the time ha arrived codeup ...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust\nData Sci...,by dimitri antoniou and maggie giust\ndata sci...,by dimitri antoni and maggi giust data scienc ...,by dimitri antoniou and maggie giust data scie...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri Antoniou\nA week ago, Codeup launch...",by dimitri antoniou\na week ago codeup launche...,by dimitri antoni a week ago codeup launch our...,by dimitri antoniou a week ago codeup launched...
3,10 Tips to Crush It at the SA Tech Job Fair,SA Tech Job Fair\nThe third bi-annual San Anto...,sa tech job fair\nthe third biannual san anton...,sa tech job fair the third biannual san antoni...,sa tech job fair the third biannual san antoni...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...,competitor bootcamps are closing is the model ...,competitor bootcamp are close is the model in ...,competitor bootcamps are closing is the model ...


In [238]:
news_df = clean_dataframe(news_df)
news_df = stem_dataframe(news_df)
news_df = lemmatize_dataframe(news_df)

In [240]:
news_df.head()

Unnamed: 0,title,original,category,clean,stemmed,lemmatized
0,Navi Mutual Fund offers SIPs starting at ₹500,Navi Mutual Fund is offering investors the opt...,business,navi mutual fund is offering investors the opt...,navi mutual fund is offer investor the option ...,navi mutual fund is offering investor the opti...
1,'Man who takes 6 months parental leave is a lo...,Several Twitter users criticised US-based Pala...,business,several twitter users criticised usbased palan...,sever twitter user criticis usbas palantir tec...,several twitter user criticised usbased palant...
2,Cognizant had to choose clients to serve: CEO ...,Cognizant CEO Brian Humphries has said the fir...,business,cognizant ceo brian humphries has said the fir...,cogniz ceo brian humphri ha said the firm doe ...,cognizant ceo brian humphries ha said the firm...
3,"Delhi HC notice to RBI, SBI over banning UPI p...",The Delhi High Court on Thursday issued notice...,business,the delhi high court on thursday issued notice...,the delhi high court on thursday issu notic to...,the delhi high court on thursday issued notice...
4,Elon Musk and Jeff Bezos are now worth nearly ...,The combined net worth of the world's two rich...,business,the combined net worth of the world ' s two ri...,the combin net worth of the world ' s two rich...,the combined net worth of the world ' s two ri...
