# Exercises

The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

    - Lowercase everything
    - Normalize unicode characters
    - Replace anything that is not a letter, number, whitespace or a single quote.

<br>

2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.
<br>

3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.
<br>

4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.
<br>

5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

    - This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.
<br>

6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.
<br>

7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.
<br>

8. For each dataframe, produce the following columns:

    - title to hold the title
    - original to hold the original article/post content
    - clean to hold the normalized and tokenized original with the stopwords removed.
    - stemmed to hold the stemmed version of the cleaned data.
    - lemmatized to hold the lemmatized version of the cleaned data.
<br>

9. Ask yourself:

    - If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
    - If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
    - If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

In [18]:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd

import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import acquire

import warnings
warnings.filterwarnings('ignore')

<hr style="border:2px solid black"> </hr>

In [19]:
#specify the categories desired
categories = ["business", "sports", "technology", "entertainment"]

#use function from acquire.py
news_df = acquire.get_all_news_articles(categories)

In [3]:
#take a look
news_df.head()

Unnamed: 0,title,content,category
0,"Reliance Industries vaccinates 98% of workers,...",Reliance Industries has said in a statement th...,business
1,I will most likely not be on future earnings c...,Tesla CEO and the world's second-richest perso...,business
2,"Musk criticises Apple's 'walled garden', cobal...",Tesla's billionaire CEO Elon Musk criticised A...,business
3,Speculation around our plans for crypto not tr...,Amazon on Monday denied speculations that it w...,business
4,Govt may lower import duty on EVs if Tesla man...,The government is open to consider reducing im...,business


In [4]:
#use first article [0] to use as test string
test_string = news_df.content[0]
test_string

'Reliance Industries has said in a statement that over 98% of its workers have received at least one dose of COVID-19 vaccine so far. The billionaire Mukesh Ambani-led conglomerate had over 2.36 lakh employees, of March 31. Besides Reliance, Hindustan Unilever has also given at least one shot to 90% of employees, while Infosys inoculated 59% employees and TCS 70%.'

<hr style="border:2px solid black"> </hr>

### #1. Define a function named basic_clean

In [5]:
def basic_clean(string):
    '''
    This function takes in the original text.
    The text is all lowercased, 
    the text is encoded in ascii and any characters that are not ascii are ignored.
    The text is then decoded in utf-8 and any characters that are not ascii are ignored
    Additionally, special characters are all removed.
    A clean article is then returned
    '''
    #lowercase
    string = string.lower()
    
    #normalize
    string = unicodedata.normalize('NFKD', string)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')
    
    #remove special characters and replaces it with blank
    string = re.sub(r"[^a-z0-9'\s]", '', string)
    
    return string


In [6]:
#make sure function works
basic_clean(test_string)

'reliance industries has said in a statement that over 98 of its workers have received at least one dose of covid19 vaccine so far the billionaire mukesh ambaniled conglomerate had over 236 lakh employees of march 31 besides reliance hindustan unilever has also given at least one shot to 90 of employees while infosys inoculated 59 employees and tcs 70'

<hr style="border:1px solid black"> </hr>

### #2. Define a function named tokenize

In [7]:
def tokenize(string):
    '''
    This function takes in a string
    and returns the string as individual tokens put back into the string
    '''
    #create the tokenizer
    tokenizer = nltk.tokenize.ToktokTokenizer()

    #use the tokenizer
    string = tokenizer.tokenize(string, return_str = True)

    return string

In [8]:
#make sure function works
tokenize(test_string)

'Reliance Industries has said in a statement that over 98 % of its workers have received at least one dose of COVID-19 vaccine so far. The billionaire Mukesh Ambani-led conglomerate had over 2.36 lakh employees , of March 31. Besides Reliance , Hindustan Unilever has also given at least one shot to 90 % of employees , while Infosys inoculated 59 % employees and TCS 70 % .'

<hr style="border:1px solid black"> </hr>

### #3. Define a function named stem

In [9]:
def stem(text):
    '''
    This function takes in text
    and returns the stem word joined back into the text
    '''
    #create porter stemmer
    ps = nltk.porter.PorterStemmer()
    
    #use the stem, split string using each word
    stems = [ps.stem(word) for word in text.split()]
    
    #join stem word to string
    text_stemmed = ' '.join(stems)

    return text_stemmed

In [10]:
#make sure function works, only root words (no past tense)
stem(test_string)

'relianc industri ha said in a statement that over 98% of it worker have receiv at least one dose of covid-19 vaccin so far. the billionair mukesh ambani-l conglomer had over 2.36 lakh employees, of march 31. besid reliance, hindustan unilev ha also given at least one shot to 90% of employees, while infosi inocul 59% employe and tc 70%.'

<hr style="border:1px solid black"> </hr>

### #4. Define a function named lemmatize.

In [11]:
def lemmatize(text):
    '''
    This function takes in text
    and returns the lemmatized word joined back into the text
    '''
    #create the lemmatizer
    wnl = nltk.stem.WordNetLemmatizer()
    
    #look at the article 
    lemmas = [wnl.lemmatize(word) for word in text.split()]
    
    #join lemmatized words into article
    text_lemmatized= ' '.join(lemmas)

    return text_lemmatized

In [12]:
#make sure function works
lemmatize(test_string)

'Reliance Industries ha said in a statement that over 98% of it worker have received at least one dose of COVID-19 vaccine so far. The billionaire Mukesh Ambani-led conglomerate had over 2.36 lakh employees, of March 31. Besides Reliance, Hindustan Unilever ha also given at least one shot to 90% of employees, while Infosys inoculated 59% employee and TCS 70%.'

<hr style="border:1px solid black"> </hr>

### #5. Define a function named remove_stopwords

In [15]:
def remove_stopwords(string, extra_words = [], exclude_words = []):
    '''
    This function takes in text, extra words and exclude words
    and returns a list of text with stopword removed
    '''
    #create stopword list
    stopword_list = stopwords.words('english')
    
    #remove excluded words from list
    stopword_list = set(stopword_list) - set(exclude_words)
    
    #add the extra words to the list
    stopword_list = stopword_list.union(set(extra_words))
    
    #split the string into different words
    words = string.split()
    
    #create a list of words that are not in the list
    filtered_words = [word for word in words if word not in stopword_list]
    
    #join the words that are not stopwords (filtered words) back into the string
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords

In [16]:
remove_stopwords(test_string)

'Reliance Industries said statement 98% workers received least one dose COVID-19 vaccine far. The billionaire Mukesh Ambani-led conglomerate 2.36 lakh employees, March 31. Besides Reliance, Hindustan Unilever also given least one shot 90% employees, Infosys inoculated 59% employees TCS 70%.'

<hr style="border:1px solid black"> </hr>

### #6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [17]:
#take a look at news_ddf
news_df.head()

Unnamed: 0,title,content,category
0,"Reliance Industries vaccinates 98% of workers,...",Reliance Industries has said in a statement th...,business
1,I will most likely not be on future earnings c...,Tesla CEO and the world's second-richest perso...,business
2,Speculation around our plans for crypto not tr...,Amazon on Monday denied speculations that it w...,business
3,"Musk criticises Apple's 'walled garden', cobal...",Tesla's billionaire CEO Elon Musk criticised A...,business
4,Factually incorrect: INOX on report of Amazon ...,INOX Leisure denied a report that claimed Amaz...,business


In [18]:
#apply all functions just created to the content column to prep
news_df['content'].apply(basic_clean)\
.apply(tokenize)\
.apply(lemmatize)\
.apply(remove_stopwords)

0     reliance industry ha said statement 98 worker ...
1     tesla ceo world ' secondrichest person elon mu...
2     amazon monday denied speculation wa looking ac...
3     tesla ' billionaire ceo elon musk criticised a...
4     inox leisure denied report claimed amazon indi...
                            ...                        
93    marathi actor umesh kamat ha issued statement ...
94    veteran actress savita bajaj unwell facing fin...
95    shefali shah speaking international emmy award...
96    taking instagram story singer aditya narayan d...
97    actor adil hussain feel bollywood rope actor n...
Name: content, Length: 98, dtype: object

In [17]:
#apply all functions just created to the title column to prep
news_df['title'].apply(basic_clean)\
.apply(tokenize)\
.apply(lemmatize)\
.apply(remove_stopwords)

0     reliance industry vaccinates 98 worker hul ino...
1       likely future earnings call tesla ceo elon musk
2     musk criticises apple ' ' walled garden ' coba...
3     speculation around plan crypto true amazon bit...
4     govt may lower import duty ev tesla manufactur...
                            ...                        
93    got purely merit ' incredible kubbra sait int ...
94             love shah rukh khan would love work arya
95              ' part aditya participating bigg bos 15
96      ' delhi crime ' changed everything shefali shah
97    next 72 hour critical shoaib give update dad '...
Name: title, Length: 98, dtype: object

<hr style="border:1px solid black"> </hr>

### #7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [20]:
#bring in blogs using acquire.py
codeup_df = acquire.acquire_codeup_blog()

In [20]:
#take a look at the material
codeup_df.head()

Unnamed: 0,title,published_date,blog_image,content
0,Codeup’s Data Science Career Accelerator is Here!,"September 30, 2018",https://codeup.com/wp-content/uploads/2018/10/...,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,"October 31, 2018",https://codeup.com/wp-content/uploads/2018/10/...,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"October 17, 2018",https://codeup.com/wp-content/uploads/2018/10/...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,"August 14, 2018",,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,"August 14, 2018",,Competitor Bootcamps Are Closing. Is the Model...


In [21]:
#apply all functions just created to the content column to prep
codeup_df['content'].apply(basic_clean)\
.apply(tokenize)\
.apply(lemmatize)\
.apply(remove_stopwords)

0    rumor true time ha arrived codeup ha officiall...
1    dimitri antoniou maggie giust data science big...
2    dimitri antoniou week ago codeup launched imme...
3    sa tech job fair third biannual san antonio te...
4    competitor bootcamps closing model danger prog...
Name: content, dtype: object

<hr style="border:1px solid black"> </hr>

### #8. For each dataframe, make columns: title, original, clean, stemmed, lemmatized

In [23]:
################################ PREP ARTICLES ################################

#take dataframe, specify the column, extra and exclude words
def prep_article_data(df, column, extra_words=[], exclude_words=[]):
    '''
    This function take in a df and the string name for a text column with 
    option to pass lists for extra_words and exclude_words and
    returns a df with the text article title, original text, stemmed text,
    lemmatized text, cleaned, tokenized, & lemmatized text with stopwords removed.
    '''
    #chain together clean, tokenize, remove stopwords
    df['clean'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    #chain clean, tokenize, stem, remove stopwords
    df['stemmed'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(stem)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    #clean clean, tokenize, lemmatize, remove stopwords
    df['lemmatized'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(lemmatize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    return df[['title', column,'clean', 'stemmed', 'lemmatized']]

In [24]:
prep_article_data(news_df, 'content', extra_words =[], exclude_words=[])

Unnamed: 0,title,content,clean,stemmed,lemmatized
0,"Reliance Industries vaccinates 98% of workers,...",Reliance Industries has said in a statement th...,reliance industries said statement 98 workers ...,relianc industri ha said statement 98 worker r...,reliance industry ha said statement 98 worker ...
1,I will most likely not be on future earnings c...,Tesla CEO and the world's second-richest perso...,tesla ceo world ' secondrichest person elon mu...,tesla ceo world ' secondrichest person elon mu...,tesla ceo world ' secondrichest person elon mu...
2,Speculation around our plans for crypto not tr...,Amazon on Monday denied speculations that it w...,amazon monday denied speculations looking acce...,amazon monday deni specul wa look accept bitco...,amazon monday denied speculation wa looking ac...
3,"Musk criticises Apple's 'walled garden', cobal...",Tesla's billionaire CEO Elon Musk criticised A...,tesla ' billionaire ceo elon musk criticised a...,tesla ' billionair ceo elon musk criticis appl...,tesla ' billionaire ceo elon musk criticised a...
4,Factually incorrect: INOX on report of Amazon ...,INOX Leisure denied a report that claimed Amaz...,inox leisure denied report claimed amazon indi...,inox leisur deni report claim amazon india dis...,inox leisure denied report claimed amazon indi...
...,...,...,...,...,...
93,"Among all my co-stars, I am closest to Tiger S...","Tara Sutaria, who made her Bollywood debut wit...",tara sutaria made bollywood debut tiger shroff...,tara sutaria made bollywood debut tiger shroff...,tara sutaria made bollywood debut tiger shroff...
94,I love every chapter of my life: Arjun on his ...,"In his latest Instagram post, Arjun Kapoor sha...",latest instagram post arjun kapoor shared ' ' ...,hi latest instagram post arjun kapoor share ' ...,latest instagram post arjun kapoor shared ' ' ...
95,"I love Shah Rukh Khan, would love to work with...","Actor Arya, whose Tamil-language sports action...",actor arya whose tamillanguage sports action f...,actor arya whose tamillanguag sport action fil...,actor arya whose tamillanguage sport action fi...
96,It was really bad: South actress Parul on expe...,South Indian actress Parul Yadav spoke about t...,south indian actress parul yadav spoke time go...,south indian actress parul yadav spoke time go...,south indian actress parul yadav spoke time go...


<hr style="border:1px solid black"> </hr>

### #9. Ask yourself

If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?

In [None]:
lemmatize- smaller dataset, it is slower

If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?

In [None]:
stemming- larger dataset, it is faster

If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?