# Prepare Exercises

The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

### 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

### 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

### 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

### 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

### 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

### 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

### 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

### 8. For each dataframe, produce the following columns:

- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

### 9. Ask yourself:

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

In [1]:
# imports
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import warnings
warnings.filter="ignore"

In [2]:
# 1 basic_clean

def basic_clean(string):
    '''
    basic cleaning function: lowercase, normalize, and remove characters
    ''' 
    # lowercase everthing
    string = string.lower()
    # normalize unicode charaters
    string = unicodedata.normalize('NFKD', string)\
    .encode('ascii','ignore')\
    .decode('utf-8', 'ignore')
    # replace anything that is not a letter, number, whitespace or a single quote
    string = re.sub(r"[^a-z0-9'\s]", '', string)
    string = re.sub(r"\n", '', string)
    
    return string

In [3]:
# making a line of text
text = "Our 25th Anniversary Ale is here! Learn about this year's blend and find out where it's available near you here: https://fal.cn/3jkQt"

In [4]:
# testing out my basic_clean function with the text
basic_clean(text)

"our 25th anniversary ale is here learn about this year's blend and find out where it's available near you here httpsfalcn3jkqt"

In [5]:
# tokenize function

def tokenize(string):
    '''
    function to take a string and tokenize all the words
    '''
    # make the tokenizer
    tokenizer = nltk.tokenize.ToktokTokenizer()
    # use the tokenizer
    string = tokenizer.tokenize(string, return_str=True)
    
    return string

In [8]:
# testing out the tokenize function on the text
tokenized = tokenize(text)

In [12]:
# stem function

def stem(string):
    '''
    function to take some text and apply stemming to all the words
    '''
    # Create porter stemmer.
    ps = nltk.porter.PorterStemmer()
    
    # Use the stemmer to stem each word in the list of words we created by using split.
    stems = [ps.stem(word) for word in string.split()]
    
    # Join our lists of words into a string again and assign to a variable.
    string = ' '.join(stems)
    
    return string

In [14]:
# trying out the stem function on the text
stem(text)

"our 25th anniversari ale is here! learn about thi year' blend and find out where it' avail near you here: https://fal.cn/3jkqt"

In [21]:
# lemmatize function

def lemmatize(string):
    '''
    function to take some text and apply lemmatization to each word
    '''
    # Create the lemmatizer.
    wnl = nltk.stem.WordNetLemmatizer()
    
    # Use the lemmatizer on each word in the list of words we created by using split.
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    # Join our list of words into a string again and assign to a variable.
    string = ' '.join(lemmas)
    
    return string

In [23]:
# testing out the lemmatize function on the text
lemmatize(text)

"Our 25th Anniversary Ale is here! Learn about this year's blend and find out where it's available near you here: https://fal.cn/3jkQt"

In [26]:
# remove_stopwords function

def remove_stopwords(string, extra_words=[], exclude_words=[]):
    '''
    function to take some text and remove all the stopwords
    '''
    # Create stopword_list.
    stopword_list = stopwords.words('english')
    
    # Remove 'exclude_words' from stopword_list to keep these in my text.
    stopword_list = set(stopword_list) - set(exclude_words)
    
    # Add in 'extra_words' to stopword_list.
    stopword_list = stopword_list.union(set(extra_words))

    # Split words in string.
    words = string.split()
    
    # Create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Join words in the list back into strings and assign to a variable.
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords

In [27]:
# testing out remove_stopwords function on the text
remove_stopwords(text)

"Our 25th Anniversary Ale here! Learn year's blend find available near here: https://fal.cn/3jkQt"

In [36]:
import acquire
from time import strftime
import warnings
warnings.filter="ignore"

In [31]:
# Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df

# define categories
categories = ["business", "sports", "technology", "entertainment"]

# use get_all_new_article function from acquire.py file 

news_df = acquire.get_all_news_articles(categories)

In [33]:
# looking at the news_df 
news_df

Unnamed: 0,title,content,category
0,Facebook changes its company name to 'Meta',Facebook on Thursday announced it's changing t...,business
1,'Man who takes 6 months parental leave is a lo...,Several Twitter users criticised US-based Pala...,business
2,"Delhi HC notice to RBI, SBI over banning UPI p...",The Delhi High Court on Thursday issued notice...,business
3,Who are the top 10 new entrants on Hurun India...,Ace investor Rakesh Jhunjhunwala is the top ne...,business
4,"Legacy companies eat Ola, Ather, Tork & SmartE...",Bajaj Auto's MD Rajiv Bajaj on Thursday took a...,business
...,...,...,...
95,Looking at February 2022: Ali Fazal on wedding...,Actor Ali Fazal has revealed that his wedding ...,entertainment
96,"Chose to do stunts myself, no body doubles: Ni...",Actress Nitu Chandra has revealed that she cho...,entertainment
97,Somewhere in my mind I knew I'll survive: Mahe...,"Mahesh Manjrekar, who was diagnosed with urina...",entertainment
98,"Witnessed exorcism as child, most frightening ...",Actor Emraan Hashmi has said that he witnessed...,entertainment


In [34]:
# use all the functions to see if they work on news_df's content column
news_df['content'].apply(basic_clean)\
.apply(tokenize)\
.apply(lemmatize)\
.apply(remove_stopwords)

0     facebook thursday announced ' changing company...
1     several twitter user criticised usbased palant...
2     delhi high court thursday issued notice rbi sb...
3     ace investor rakesh jhunjhunwala top new entra...
4     bajaj auto ' md rajiv bajaj thursday took jibe...
                            ...                        
95    actor ali fazal ha revealed wedding plan actre...
96    actress nitu chandra ha revealed chose stunt w...
97    mahesh manjrekar wa diagnosed urinary bladder ...
98    actor emraan hashmi ha said witnessed exorcism...
99    actor chris evans ha said pinch every day voic...
Name: content, Length: 100, dtype: object

In [37]:
# Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df
today = strftime('%Y-%m-%d')
codeup_df = acquire.get_blog_articles()



  soup = BeautifulSoup(response.text)


  soup = BeautifulSoup(response.text)


In [39]:
codeup_df.head() # check_yo_head

Unnamed: 0,title,published,content
0,Codeup Launches First Podcast: Hire Tech,"Aug 25, 2021",Any podcast enthusiasts out there? We are plea...
1,Why Should I Become a System Administrator?,"Aug 23, 2021","With so many tech careers in demand, why choos..."
2,Announcing our Candidacy for Accreditation!,"Jun 30, 2021",Did you know that even though we’re an indepen...
3,Codeup Takes Over More of the Historic Vogue B...,"Jun 21, 2021",Codeup is moving into another floor of our His...
4,Inclusion at Codeup During Pride Month (and Al...,"Jun 4, 2021",Happy Pride Month! Pride Month is a dedicated ...


In [40]:
#  For each dataframe, produce the following columns: title, original, clean, stemmed, lemmatized

def prep_article_data(df, column, extra_words=[], exclude_words=[]):
    '''
    This function take in a df and the string name for a text column with 
    option to pass lists for extra_words and exclude_words and
    returns a df with the text article title, original text, stemmed text,
    lemmatized text, cleaned, tokenized, & lemmatized text with stopwords removed.
    '''
    df['clean'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['stemmed'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(stem)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['lemmatized'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(lemmatize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    return df[['title', column,'clean', 'stemmed', 'lemmatized']]

In [41]:
# testing out the prep_article_data function on the news_df
prep_article_data(news_df, 'content', extra_words = ['ha'], exclude_words = ['no']).head()

Unnamed: 0,title,content,clean,stemmed,lemmatized
0,Facebook changes its company name to 'Meta',Facebook on Thursday announced it's changing t...,facebook thursday announced ' changing company...,facebook thursday announc ' chang compani ' na...,facebook thursday announced ' changing company...
1,'Man who takes 6 months parental leave is a lo...,Several Twitter users criticised US-based Pala...,several twitter users criticised usbased palan...,sever twitter user criticis usbas palantir tec...,several twitter user criticised usbased palant...
2,"Delhi HC notice to RBI, SBI over banning UPI p...",The Delhi High Court on Thursday issued notice...,delhi high court thursday issued notice rbi sb...,delhi high court thursday issu notic rbi sbi n...,delhi high court thursday issued notice rbi sb...
3,Who are the top 10 new entrants on Hurun India...,Ace investor Rakesh Jhunjhunwala is the top ne...,ace investor rakesh jhunjhunwala top new entra...,ace investor rakesh jhunjhunwala top new entra...,ace investor rakesh jhunjhunwala top new entra...
4,"Legacy companies eat Ola, Ather, Tork & SmartE...",Bajaj Auto's MD Rajiv Bajaj on Thursday took a...,bajaj auto ' md rajiv bajaj thursday took jibe...,bajaj auto ' md rajiv bajaj thursday took jibe...,bajaj auto ' md rajiv bajaj thursday took jibe...
