# Prepare Notebook
This notebook will be used to work through the NLP prepare exercises and then turned into a .py file.

In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd
import numpy as np

import acquire as aq

#### In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

### 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [2]:
def basic_clean(dirty_words):
    '''This function takes in words (single word, article, paragraph, etc) and then 
    lowercases all letters, normalizes the letters, and removes special characters'''
    all_lower_case_words = dirty_words.lower()
    normalized_words = unicodedata.normalize('NFKD', all_lower_case_words).encode('ascii', 'ignore').decode('utf-8')
    remove_special_characters = re.sub(r"[^a-z0-9'\s]", '', normalized_words)
    return remove_special_characters

In [3]:
basic_clean('This is @ test!')

'this is  test'

### 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [4]:
def tokenize(string):
    '''
    This function takes in a string and
    returns a tokenized string.
    '''
    # Create tokenizer.
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    # Use tokenizer
    string = tokenizer.tokenize(string, return_str = True)
    
    return string

### 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [5]:
def stem(string):
    '''
    This function takes in a string and
    returns a string with words stemmed.
    '''
    # Create porter stemmer.
    ps = nltk.porter.PorterStemmer()
    
    # Use the stemmer to stem each word in the list of words we created by using split.
    stems = [ps.stem(word) for word in string.split()]
    
    # Join our lists of words into a string again and assign to a variable.
    string = ' '.join(stems)
    
    return string

### 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [6]:
def lemmatize(string):
    '''
    This function takes in string for and
    returns a string with words lemmatized.
    '''
    # Create the lemmatizer.
    wnl = nltk.stem.WordNetLemmatizer()
    
    # Use the lemmatizer on each word in the list of words we created by using split.
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    # Join our list of words into a string again and assign to a variable.
    string = ' '.join(lemmas)
    
    return string

### 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

- This function should define two optional parameters, extra_words and exclude_words. 
- These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [7]:
def remove_stopwords(string, extra_words = [], exclude_words = []):
    '''
    This function takes in a string, optional extra_words and exclude_words parameters
    with default empty lists and returns a string.
    '''
    # Create stopword_list.
    stopword_list = stopwords.words('english')
    
    # Remove 'exclude_words' from stopword_list to keep these in my text.
    stopword_list = set(stopword_list) - set(exclude_words)
    
    # Add in 'extra_words' to stopword_list.
    stopword_list = stopword_list.union(set(extra_words))

    # Split words in string.
    words = string.split()
    
    # Create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Join words in the list back into strings and assign to a variable.
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords

### 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [8]:
news_df = aq.get_inshorts_articles()
news_df.head()

Unnamed: 0,title,published,author,content,category
0,"If you die in metaverse, you die in real life:...",2021-10-30T07:16:16.000Z,Kiran Khatri,After Facebook changed the company's name to M...,business
1,Elon Musk becomes world's 1st person to cross ...,2021-10-30T09:21:54.000Z,Kiran Khatri,Tesla CEO Elon Musk has become the world's fir...,business
2,"'Squid Game' crypto surges 1,00,000% in days, ...",2021-10-30T06:13:05.000Z,Kiran Khatri,"A cryptocurrency called 'Squid Game', inspired...",business
3,Mumbai Police shares meme on Facebook name cha...,2021-10-30T15:57:03.000Z,Arshiya Chopra,After Facebook changed its company name to 'Me...,business
4,"Reliance Jio declines AGR dues moratorium, bec...",2021-10-30T07:20:30.000Z,Kiran Khatri,Reliance Jio has reportedly said it won't opt ...,business


### 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [9]:
codeup_df = aq.get_blog_articles()
codeup_df.head()

Unnamed: 0,title,published,content
0,Boris – Behind the Billboards,"Oct 3, 2021",
1,Is Codeup the Best Bootcamp in San Antonio…or ...,"Sep 16, 2021",Looking for the best data science bootcamp in ...
2,Codeup Launches First Podcast: Hire Tech,"Aug 25, 2021",Any podcast enthusiasts out there? We are plea...
3,Why Should I Become a System Administrator?,"Aug 23, 2021","With so many tech careers in demand, why choos..."
4,Announcing our Candidacy for Accreditation!,"Jun 30, 2021",Did you know that even though we’re an indepen...


### 8. For each dataframe, produce the following columns:
- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

In [10]:
def prep_article_data(df, column, extra_words=[], exclude_words=[]):
    '''
    This function take in a df and the string name for a text column with 
    option to pass lists for extra_words and exclude_words and
    returns a df with the text article title, original text, stemmed text,
    lemmatized text, cleaned, tokenized, & lemmatized text with stopwords removed.
    '''
    df['clean'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['stemmed'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(stem)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['lemmatized'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(lemmatize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    return df[['title', column,'clean', 'stemmed', 'lemmatized']]

In [14]:
prep_article_data(codeup_df, 'content').head()

Unnamed: 0,title,content,clean,stemmed,lemmatized
0,Boris – Behind the Billboards,,,,
1,Is Codeup the Best Bootcamp in San Antonio…or ...,Looking for the best data science bootcamp in ...,looking best data science bootcamp world best ...,look best data scienc bootcamp world best code...,looking best data science bootcamp world best ...
2,Codeup Launches First Podcast: Hire Tech,Any podcast enthusiasts out there? We are plea...,podcast enthusiasts pleased announce release c...,ani podcast enthusiast pleas announc releas co...,podcast enthusiast pleased announce release co...
3,Why Should I Become a System Administrator?,"With so many tech careers in demand, why choos...",many tech careers demand choose system adminis...,mani tech career demand whi choos system admin...,many tech career demand choose system administ...
4,Announcing our Candidacy for Accreditation!,Did you know that even though we’re an indepen...,know even though independent school multiple r...,know even though independ school multipl regul...,know even though independent school multiple r...


In [15]:
prep_article_data(news_df, 'content').head()

Unnamed: 0,title,content,clean,stemmed,lemmatized
0,"If you die in metaverse, you die in real life:...",After Facebook changed the company's name to M...,facebook changed company ' name meta tesla ceo...,facebook chang compani ' name meta tesla ceo e...,facebook changed company ' name meta tesla ceo...
1,Elon Musk becomes world's 1st person to cross ...,Tesla CEO Elon Musk has become the world's fir...,tesla ceo elon musk become world ' first perso...,tesla ceo elon musk ha becom world ' first per...,tesla ceo elon musk ha become world ' first pe...
2,"'Squid Game' crypto surges 1,00,000% in days, ...","A cryptocurrency called 'Squid Game', inspired...",cryptocurrency called ' squid game ' inspired ...,cryptocurr call ' squid game ' inspir south ko...,cryptocurrency called ' squid game ' inspired ...
3,Mumbai Police shares meme on Facebook name cha...,After Facebook changed its company name to 'Me...,facebook changed company name ' meta ' mumbai ...,facebook chang compani name ' meta ' mumbai po...,facebook changed company name ' meta ' mumbai ...
4,"Reliance Jio declines AGR dues moratorium, bec...",Reliance Jio has reportedly said it won't opt ...,reliance jio reportedly said ' opt government ...,relianc jio ha reportedli said ' opt govern ' ...,reliance jio ha reportedly said ' opt governme...


### 9. Ask yourself:
- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you 
- prefer to use stemmed or lemmatized text?

In [None]:
stemmed 