# Prepare Exercises

__1) Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:__

* Lowercase everything
* Normalize unicode characters
* Replace anything that is not a letter, number, whitespace or a single quote.

In [1]:
import numpy as np
import pandas as pd
import acquire
import unicodedata
import re

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

In [2]:
#Acquire data for testing
blog_articles = acquire.get_blog_articles()

news_articles = acquire.get_news_articles()

In [3]:
def basic_clean(string):
    """
    This function will perform basic cleaning of a string. It will reduce all characters 
    to lower case, normalize unicode characters, and remove anything that is not a 
    letter, number, whitespace, or a single quote.
    """
    
    #Lower case everything
    string = string.lower()
    
    #Normalize unicode characters, 
    #encode into ascii byte strings and ignore unknown chars,
    #decode back into a UTF-8 string that we can work with
    string = unicodedata.normalize('NFKD', string).encode('ascii', 'ignore').decode('UTF-8')
    
    #Use regex to replace anything that is not a letter, number, whitespace, or a single quote
    string = re.sub(r"[^a-z0-9\s']", '', string)
    
    return string

In [4]:
original = news_articles.content[0]
original

'A WHO technical advisory group which met on Tuesday to consider Bharat Biotech\'s COVID-19 vaccine Covaxin for emergency use listing is likely to announce its decision soon. "If all is in place and all goes well and if the committee is satisfied, we would expect a recommendation within the next 24 hours or so," WHO spokesperson Margaret Harris told reporters. '

In [5]:
#For testing
cleaned = basic_clean(original)
cleaned

"a who technical advisory group which met on tuesday to consider bharat biotech's covid19 vaccine covaxin for emergency use listing is likely to announce its decision soon if all is in place and all goes well and if the committee is satisfied we would expect a recommendation within the next 24 hours or so who spokesperson margaret harris told reporters "

__2) Define a function named tokenize. It should take in a string and tokenize all the words in the string.__

In [6]:
def tokenize(string):
    """
    This function will tokenize all the words in the given string and return the 
    tokenized string.
    """
    
    #Create the tokenizer
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    #Use the tokenizer
    string = tokenizer.tokenize(string, return_str = True)
    
    return string

In [7]:
#For testing
#untokenized
cleaned

"a who technical advisory group which met on tuesday to consider bharat biotech's covid19 vaccine covaxin for emergency use listing is likely to announce its decision soon if all is in place and all goes well and if the committee is satisfied we would expect a recommendation within the next 24 hours or so who spokesperson margaret harris told reporters "

In [8]:
#Tokenized
tokenized = tokenize(cleaned)
tokenized

"a who technical advisory group which met on tuesday to consider bharat biotech ' s covid19 vaccine covaxin for emergency use listing is likely to announce its decision soon if all is in place and all goes well and if the committee is satisfied we would expect a recommendation within the next 24 hours or so who spokesperson margaret harris told reporters"

__3) Define a function named stem. It should accept some text and return the text after applying stemming to all the words.__

In [9]:
def stem(string):
    """
    This function will accept some text and return a stemmed version of the text.
    """
    
    #Create porter stemmer
    ps = nltk.porter.PorterStemmer()
    
    #Apply the stemmer to each word in the string to create a list of stemmed words
    stems = [ps.stem(word) for word in string.split()]
    
    #join our list of stemmed words into a string
    string_stemmed = ' '.join(stems)
    
    return string_stemmed

In [10]:
#For testing
#unstemmed
tokenized

"a who technical advisory group which met on tuesday to consider bharat biotech ' s covid19 vaccine covaxin for emergency use listing is likely to announce its decision soon if all is in place and all goes well and if the committee is satisfied we would expect a recommendation within the next 24 hours or so who spokesperson margaret harris told reporters"

In [11]:
#stemmed
stemmed = stem(tokenized)
stemmed

"a who technic advisori group which met on tuesday to consid bharat biotech ' s covid19 vaccin covaxin for emerg use list is like to announc it decis soon if all is in place and all goe well and if the committe is satisfi we would expect a recommend within the next 24 hour or so who spokesperson margaret harri told report"

__4) Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.__

In [12]:
# Need to download this the first time.
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/johnathonsmith/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [13]:
def lemmatize(string):
    """
    This function accepts some text and returns the lemmatized version of the string.
    """
    
    #Create the lemmatizer
    wnl = nltk.stem.WordNetLemmatizer()
    
    #Use the lemmatizer on each word in the string to create a list of lemmatized words
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    #Join the lemmatized words into one string
    string_lemmatized = ' '.join(lemmas)
    
    return string_lemmatized

In [14]:
#For testing
#Unlemmatized
tokenized

"a who technical advisory group which met on tuesday to consider bharat biotech ' s covid19 vaccine covaxin for emergency use listing is likely to announce its decision soon if all is in place and all goes well and if the committee is satisfied we would expect a recommendation within the next 24 hours or so who spokesperson margaret harris told reporters"

In [15]:
#Lemmatized
lemmatized = lemmatize(tokenized)
lemmatized

"a who technical advisory group which met on tuesday to consider bharat biotech ' s covid19 vaccine covaxin for emergency use listing is likely to announce it decision soon if all is in place and all go well and if the committee is satisfied we would expect a recommendation within the next 24 hour or so who spokesperson margaret harris told reporter"

__5) Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.__

This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [16]:
#Download the stopword corpus
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/johnathonsmith/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [17]:
def remove_stopwords(string, extra_words = [], exclude_words = []):
    """
    This function will accept a string and return a version of the text without any stopwords.
    It will also allow the user to add extra words to remove or exclude words from the removal list.
    """
    #Get the standard english stop word list from nltk
    stop_words = stopwords.words('english')
    
    #Add the extra words to be removed to the stop word list
    for word in extra_words:
        stop_words.append(word)
    
    #Remove the words to be excluded from the stop word list
    for word in exclude_words:
        stop_words.remove(word)
    
    #Create a list of words to be checked by splitting the given string
    words = string.split()
    
    #Now filter out all of the stop words
    filtered_words = [word for word in words if word not in stop_words]
    
    #Join the list of filtered words into a string
    filtered_string = ' '.join(filtered_words)
    
    return filtered_string

In [18]:
#For testing
#String with stop words
lemmatized

"a who technical advisory group which met on tuesday to consider bharat biotech ' s covid19 vaccine covaxin for emergency use listing is likely to announce it decision soon if all is in place and all go well and if the committee is satisfied we would expect a recommendation within the next 24 hour or so who spokesperson margaret harris told reporter"

In [19]:
#Create a list of extra words and words to exclude
extra_words = ['group', 'met', 'tuesday']
exclude_words = ['the']

In [20]:
#String without stop words
filtered = remove_stopwords(lemmatized, extra_words, exclude_words)
filtered

"technical advisory consider bharat biotech ' covid19 vaccine covaxin emergency use listing likely announce decision soon place go well the committee satisfied would expect recommendation within the next 24 hour spokesperson margaret harris told reporter"

__6) Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.__

In [21]:
news_df = acquire.get_news_articles()
news_df

Unnamed: 0,title,author,date,category,content
0,India's Covaxin may get WHO approval in next 2...,Kiran Khatri,26 Oct 2021,business,A WHO technical advisory group which met on Tu...
1,I decided to support Doge as it felt like the ...,Pragya Swastik,26 Oct 2021,business,Tesla CEO and the world's richest person Elon ...
2,Which companies have $1 trillion or more marke...,Pragya Swastik,26 Oct 2021,business,Tesla has become the latest company to surpass...
3,Elon Musk tweets 'Wild $T1mes' after Tesla hit...,Pragya Swastik,26 Oct 2021,business,Tesla CEO and the world's richest person Elon ...
4,How many years did it take for various compan...,Pragya Swastik,26 Oct 2021,business,Tesla took 18 years to hit the $1-trillion m-c...
...,...,...,...,...,...
95,'Rust' shooting an avoidable tragedy: Brandon ...,Kriti Kambiri,26 Oct 2021,entertainment,Late actor Brandon Lee's fianceé Eliza Hutton ...
96,I've only made 5 really good films in my caree...,Kriti Kambiri,26 Oct 2021,entertainment,Actress Kristen Stewart has said that she thin...
97,"Jr NTR's fan injured in accident, actor helps ...",Kriti Kambiri,26 Oct 2021,entertainment,A fan of Telugu actor Jr NTR was injured in a ...
98,"It seems unreal: Adarsh on working with Meryl,...",Udit Gupta,26 Oct 2021,entertainment,Adarsh Gourav has started shooting for 'Extrap...


__7) Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.__

In [22]:
codeup_df = acquire.get_blog_articles()
codeup_df

Unnamed: 0,title,date,category,content
0,Codeup’s Data Science Career Accelerator is Here!,"Sep 30, 2018",Data Science,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,"Oct 31, 2018",Data Science,By Dimitri Antoniou and Maggie Giust Data Scie...
2,Data Science VS Data Analytics: What’s The Dif...,"Oct 17, 2018",Data Science,"By Dimitri Antoniou A week ago, Codeup launche..."
3,10 Tips to Crush It at the SA Tech Job Fair,"Aug 14, 2018",Tips for Prospective Students,SA Tech Job Fair The third bi-annual San Anton...
4,Competitor Bootcamps Are Closing. Is the Model...,"Aug 14, 2018",Codeup News,Competitor Bootcamps Are Closing. Is the Model...


__8) For each dataframe, produce the following columns:__

* title to hold the title
* original to hold the original article/post content
* clean to hold the normalized and tokenized original with the stopwords removed.
* stemmed to hold the stemmed version of the cleaned data.
* lemmatized to hold the lemmatized version of the cleaned data.

In [23]:
news_df.rename(columns = {'content':'original'}, inplace = True)
codeup_df.rename(columns = {'content':'original'}, inplace = True)

In [24]:
#news_df first
news_df['clean'] = news_df['original']

#apply the basic_clean, tokenize, and remove_stopwords functions
news_df['clean'] = news_df['clean'].apply(basic_clean)
news_df['clean'] = news_df['clean'].apply(tokenize)

#create the stemmed column
news_df['stemmed'] = news_df['clean']

#apply the stem function
news_df['stemmed'] = news_df['stemmed'].apply(stem).apply(remove_stopwords)

#create the lematize column
news_df['lemmatized'] = news_df['clean']

#apply the lemmatize function
news_df['lemmatized'] = news_df['lemmatized'].apply(lemmatize).apply(remove_stopwords)

#apply the remove_stopwords function to the 'clean' column
news_df['clean'] = news_df['clean'].apply(remove_stopwords)

news_df

Unnamed: 0,title,author,date,category,original,clean,stemmed,lemmatized
0,India's Covaxin may get WHO approval in next 2...,Kiran Khatri,26 Oct 2021,business,A WHO technical advisory group which met on Tu...,technical advisory group met tuesday consider ...,technic advisori group met tuesday consid bhar...,technical advisory group met tuesday consider ...
1,I decided to support Doge as it felt like the ...,Pragya Swastik,26 Oct 2021,business,Tesla CEO and the world's richest person Elon ...,tesla ceo world ' richest person elon musk sai...,tesla ceo world ' richest person elon musk sai...,tesla ceo world ' richest person elon musk sai...
2,Which companies have $1 trillion or more marke...,Pragya Swastik,26 Oct 2021,business,Tesla has become the latest company to surpass...,tesla become latest company surpass 1 trillion...,tesla ha becom latest compani surpass 1 trilli...,tesla ha become latest company surpass 1 trill...
3,Elon Musk tweets 'Wild $T1mes' after Tesla hit...,Pragya Swastik,26 Oct 2021,business,Tesla CEO and the world's richest person Elon ...,tesla ceo world ' richest person elon musk twe...,tesla ceo world ' richest person elon musk twe...,tesla ceo world ' richest person elon musk twe...
4,How many years did it take for various compan...,Pragya Swastik,26 Oct 2021,business,Tesla took 18 years to hit the $1-trillion m-c...,tesla took 18 years hit 1trillion mcap milesto...,tesla took 18 year hit 1trillion mcap mileston...,tesla took 18 year hit 1trillion mcap mileston...
...,...,...,...,...,...,...,...,...
95,'Rust' shooting an avoidable tragedy: Brandon ...,Kriti Kambiri,26 Oct 2021,entertainment,Late actor Brandon Lee's fianceé Eliza Hutton ...,late actor brandon lee ' fiancee eliza hutton ...,late actor brandon lee ' fiance eliza hutton c...,late actor brandon lee ' fiancee eliza hutton ...
96,I've only made 5 really good films in my caree...,Kriti Kambiri,26 Oct 2021,entertainment,Actress Kristen Stewart has said that she thin...,actress kristen stewart said thinks done five ...,actress kristen stewart ha said think ha onli ...,actress kristen stewart ha said think ha done ...
97,"Jr NTR's fan injured in accident, actor helps ...",Kriti Kambiri,26 Oct 2021,entertainment,A fan of Telugu actor Jr NTR was injured in a ...,fan telugu actor jr ntr injured road accident ...,fan telugu actor jr ntr wa injur road accid an...,fan telugu actor jr ntr wa injured road accide...
98,"It seems unreal: Adarsh on working with Meryl,...",Udit Gupta,26 Oct 2021,entertainment,Adarsh Gourav has started shooting for 'Extrap...,adarsh gourav started shooting ' extrapolation...,adarsh gourav ha start shoot ' extrapol ' eigh...,adarsh gourav ha started shooting ' extrapolat...


In [25]:
#Now do the codeup_df
codeup_df['clean'] = codeup_df['original']

#apply the basic_clean, and tokenize functions
codeup_df['clean'] = codeup_df['clean'].apply(basic_clean)
codeup_df['clean'] = codeup_df['clean'].apply(tokenize)

#create the stemmed column
codeup_df['stemmed'] = codeup_df['clean']

#apply the stem and stop_words functions
codeup_df['stemmed'] = codeup_df['stemmed'].apply(stem).apply(remove_stopwords)

#create the lematize column
codeup_df['lemmatized'] = codeup_df['clean']

#apply the lemmatize function
codeup_df['lemmatized'] = codeup_df['lemmatized'].apply(lemmatize).apply(remove_stopwords)

#apply the remove_stopwords function to the 'clean' column
codeup_df['clean'] = codeup_df['clean'].apply(remove_stopwords)

codeup_df

Unnamed: 0,title,date,category,original,clean,stemmed,lemmatized
0,Codeup’s Data Science Career Accelerator is Here!,"Sep 30, 2018",Data Science,The rumors are true! The time has arrived. Cod...,rumors true time arrived codeup officially ope...,rumor true time ha arriv codeup ha offici open...,rumor true time ha arrived codeup ha officiall...
1,Data Science Myths,"Oct 31, 2018",Data Science,By Dimitri Antoniou and Maggie Giust Data Scie...,dimitri antoniou maggie giust data science big...,dimitri antoni maggi giust data scienc big dat...,dimitri antoniou maggie giust data science big...
2,Data Science VS Data Analytics: What’s The Dif...,"Oct 17, 2018",Data Science,"By Dimitri Antoniou A week ago, Codeup launche...",dimitri antoniou week ago codeup launched imme...,dimitri antoni week ago codeup launch immers d...,dimitri antoniou week ago codeup launched imme...
3,10 Tips to Crush It at the SA Tech Job Fair,"Aug 14, 2018",Tips for Prospective Students,SA Tech Job Fair The third bi-annual San Anton...,sa tech job fair third biannual san antonio te...,sa tech job fair third biannual san antonio te...,sa tech job fair third biannual san antonio te...
4,Competitor Bootcamps Are Closing. Is the Model...,"Aug 14, 2018",Codeup News,Competitor Bootcamps Are Closing. Is the Model...,competitor bootcamps closing model danger prog...,competitor bootcamp close model danger program...,competitor bootcamps closing model danger prog...


__9) Ask yourself:__

* If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?

Since this corpus is pretty small, I would prefer to use lemmatized text.


* If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?

This corpus is larger, but not too large, so I would prefer to use lemmatized text.


* If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

Since this corpus is massive, I would definitely prefer to use stemmed text. It will be much faster.