# Exercises

The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

    - Lowercase everything
    - Normalize unicode characters
    - Replace anything that is not a letter, number, whitespace or a single quote.

<br>

2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.
<br>

3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.
<br>

4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.
<br>

5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

    - This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.
<br>

6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.
<br>

7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.
<br>

8. For each dataframe, produce the following columns:

    - title to hold the title
    - original to hold the original article/post content
    - clean to hold the normalized and tokenized original with the stopwords removed.
    - stemmed to hold the stemmed version of the cleaned data.
    - lemmatized to hold the lemmatized version of the cleaned data.
<br>

9. Ask yourself:

    - If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
    - If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
    - If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

In [3]:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd

import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

<hr style="border:2px solid black"> </hr>

### #1. Define a function named basic_clean

In [2]:
def basic_clean(string):
    '''
    This function takes in the original text.
    The text is all lowercased and any characters that are not ascii are ignored
    additionally, special characters are all removed.
    A clean article is then returned
    '''
    #lowercase
    string = original.lower()
    
    #normalize
    string = unicodedata.normalize('NFKD', string)\
    .encode('ascii', 'ignore')\
    .decode('utf-8')
    
    #remove special characters and replaces it with blank
    string = re.sub(r"[^a-z0-9'\s]", '', string)
    
    return string


<hr style="border:1px solid black"> </hr>

### #2. Define a function named tokenize

In [None]:
def tokenize(string):
    '''
    This function takes in a string
    and returns the string as individual tokens put back into the string
    '''
    #create the tokenizer
    tokenizer = nltk.tokenize.ToktokTokenizer()

    #use the tokenizer
    string = tokenizer.tokenize(string, return_str = True)

    return string

<hr style="border:1px solid black"> </hr>

### #3. Define a function named stem

In [None]:
def stem(text):
    '''
    This function takes in text
    and returns the stem word joined back into the text
    '''
    #create porter stemmer
    ps = nltk.porter.PorterStemmer()
    
    #use the stem
    stems = [ps.stem(word) for word in text.split()]
    
    #join stem word to string
    text_stemmed = ' '.join(stems)

    return text_stemmed

<hr style="border:1px solid black"> </hr>

### #4. Define a function named lemmatize.

In [None]:
def lemmatize(text):
    '''
    This function takes in text
    and returns the lemmatized word joined back into the text
    '''
    #create the lemmatizer
    wnl = nltk.stem.WordNetLemmatizer()
    
    #look at the article 
    lemmas = [wnl.lemmatize(word) for word in text.split()]
    
    #join lemmatized words into article
    text_lemmatized= ' '.join(lemmas)

    return text_lemmatized

<hr style="border:1px solid black"> </hr>

### #5. Define a function named remove_stopwords

In [None]:
def remove_stopwords(text, extra_words = [], exclude_words=[]):
    '''
    This function takes in text, extra words and exclude words
    and returns a list of text with stopword removed
    '''
    #standard Enlgish languarge stopwords list from nltk
    stopword_list = stopwords.words('english')
    
    #remove excluded words
    for word in exlud_words:
        stopword_list.remove(word)
    
    #add extra words
    for word in exra_words:
        stopword_list.append(word)
    
    words = string.split()
    filtered_words = [word for word in words if word not in stopword_list]

    print('Removed {} stopwords'.format(len(words) - len(filtered_words)))
    print('---')

    text_without_stopwords = ' '.join(filtered_words)

    print(text_without_stopwords)
    
    return text_without_stopwords

<hr style="border:1px solid black"> </hr>

### #6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

<hr style="border:1px solid black"> </hr>

### #7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

<hr style="border:1px solid black"> </hr>

### #8. For each dataframe, make columns: title, original, clean, stemmed, lemmatized

<hr style="border:1px solid black"> </hr>

### #9. Ask yourself

If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?

If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?

f your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?