# Exercises
The end result of this exercise should be a file named `prepare.py` that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import acquire as a
import prepare as p

import requests
import bs4

import warnings
warnings.filterwarnings("ignore")

## 1. Define a function named `basic_clean`. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [2]:
def basic_clean(sentence):
    """
    This function takes in a string and
    - lowercases it
    - normalizes unicode characters to ASCII
    - replaces anything that is NOT a:
        - letter: a-z
        - number: 0-9
        - sgl quote: '
        - whitespace: \s
    returns cleaned string
    """
    clean = sentence.lower()
    clean = unicodedata.normalize('NFKD', clean).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    clean = re.sub(r"[^a-z0-9'\s]", '', clean)
    
    return clean

**Test function**

In [3]:
sentence = "Paul Erdős and George Pólya are influential Hungarian mathematicians who contributed a lot to \
the field. Erdős's name contains the Hungarian letter 'ő' ('o' with double acute accent), but is often \
incorrectly written as Erdos or Erdös either by mistake or out of typographical necessity"

sentence

"Paul Erdős and George Pólya are influential Hungarian mathematicians who contributed a lot to the field. Erdős's name contains the Hungarian letter 'ő' ('o' with double acute accent), but is often incorrectly written as Erdos or Erdös either by mistake or out of typographical necessity"

In [4]:
sentence = basic_clean(sentence)

In [5]:
sentence

"paul erdos and george polya are influential hungarian mathematicians who contributed a lot to the field erdos's name contains the hungarian letter 'o' 'o' with double acute accent but is often incorrectly written as erdos or erdos either by mistake or out of typographical necessity"

## 2. Define a function named `tokenize`. It should take in a string and tokenize all the words in the string.

In [6]:
def tokenize(sentence):
    """
    This function takes in a string
    - tokenizes the entire string
    returns tokenized string
    """
    
    # Create the tokenizer
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    # Use the tokenizer
    return tokenizer.tokenize(sentence, return_str = True)

**Test function**

In [7]:
tokenize(sentence)

"paul erdos and george polya are influential hungarian mathematicians who contributed a lot to the field erdos ' s name contains the hungarian letter ' o ' ' o ' with double acute accent but is often incorrectly written as erdos or erdos either by mistake or out of typographical necessity"

## 3. Define a function named `stem`. It should accept some text and return the text after applying stemming to all the words.

In [8]:
def stem(sentence):
    """
    This function takes in a string
    - strips each word to it's stem
    returns stripped string
    """
    # Create porter stemmer.
    ps = nltk.porter.PorterStemmer() 
    
    # Apply the stemmer to each word in our string.
    stems = [ps.stem(word) for word in sentence.split()]
    
    # Join our lists of words into a string again
    sentence_stemmed = ' '.join(stems)
    
    return sentence_stemmed

**Test function**

In [9]:
stem(sentence)

"paul erdo and georg polya are influenti hungarian mathematician who contribut a lot to the field erdos' name contain the hungarian letter 'o' 'o' with doubl acut accent but is often incorrectli written as erdo or erdo either by mistak or out of typograph necess"

In [10]:
sentence_stemmed = stem(sentence)

## 4. Define a function named `lemmatize`. It should accept some text and return the text after applying lemmatization to each word.

In [11]:
def lemmatize(sentence):
    """
    This function takes in a string
    - strips each word to it's lexicographically correct stem word 
    returns stripped string
    """
    # Download if not done so already.
    nltk.download('wordnet')
    
    # Create the Lemmatizer.
    wnl = nltk.stem.WordNetLemmatizer()
    
    # Apply the lemmatizer on each word in the string.
    lemmas = [wnl.lemmatize(word) for word in sentence.split()]
    
    # Join our list of words into a string again; assign to a variable to save changes.
    sentence_lemmatized = ' '.join(lemmas)
    
    return sentence_lemmatized

In [12]:
lemmatize(sentence)

[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


"paul erdos and george polya are influential hungarian mathematician who contributed a lot to the field erdos's name contains the hungarian letter 'o' 'o' with double acute accent but is often incorrectly written a erdos or erdos either by mistake or out of typographical necessity"

In [13]:
sentence_lemmatized = lemmatize(sentence)

[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## 5. Define a function named `remove_stopwords`. It should accept some text and return the text after removing all the stopwords.

This function should define two optional parameters, `extra_words` and `exclude_words`. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [14]:
def remove_stopwords(sentence, extra_words=None, exclude_words=None):
    """
    Takes in a string, and optional list of:
    - extra_words: words to include
    - exclude_words: words to exclude
    returns a string filtered for stopwords and optional exclusions.
    .
    .
    .
    .
    If recieving error: `Resource stopwords not found. Please use the NLTK Downloader to obtain the resource.`
    In Terminal type:
    python -c "import nltk; nltk.download('stopwords')"
    """
    
    # standard English language stopwords list from nltk
    stopword_list = stopwords.words('english')
        
    # Add to stopword list 
    stopword_list = stopwords.words('english') + extra_words
    
    # Remove from stopword list
    stopword_list = [word for word in stopword_list if word not in exclude_words]
    
    # Split words in lemmatized string.
    words = lemmatize(sentence).split()
    
    # Create a list of words from string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Join words in the list back into strings; assign to a variable to keep changes.
    article_without_stopwords = ' '.join(filtered_words)
    
    return article_without_stopwords
    

In [15]:
remove_stopwords(sentence, extra_words=["o", "'"], exclude_words=["no"])

[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


"paul erdos george polya influential hungarian mathematician contributed lot field erdos's name contains hungarian letter 'o' 'o' double acute accent often incorrectly written erdos erdos either mistake typographical necessity"

In [16]:
exclude_words=["no"]
extra_words=["o", "'"]


stopword_list = stopwords.words('english') + extra_words
stopword_list = [word for word in stopword_list if word not in exclude_words]

In [17]:
list1 = [1, 2, 3, 4]
list2 = [2, 3]

list1 + list2

[1, 2, 3, 4, 2, 3]

In [18]:
l3 = [x for x in list1 if x not in list2]
l3

[1, 4]

## 6. Use your data from the `acquire` to produce a dataframe of the news articles. Name the dataframe `news_df`.

In [2]:
categories = ["business", "sports", "technology", "entertainment"]
news_df = a.get_all_news_articles(categories)

In [3]:
news_df.head(2)

Unnamed: 0,title,content,category
0,"I'm done with Zoom meetings, about to cancel a...","JPMorgan CEO Jamie Dimon said that he's ""done ...",business
1,Ethereum's 27-yr-old Co-founder now world's yo...,Ethereum blockchain's 27-year-old Co-founder V...,business


## 7. Make another dataframe for the Codeup blog posts. Name the dataframe `codeup_df`.

In [4]:
codeup_df = a.convert_to_df()

In [5]:
codeup_df.head(2)

Unnamed: 0,title,date posted,category,content
0,Codeup’s Data Science Career Accelerator is Here!,"Posted on September 30, 2018",In Data Science,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,"Posted on October 31, 2018",In Data Science,By Dimitri Antoniou and Maggie Giust\nData Sci...


## 8. For each dataframe, produce the following columns:

- `title` to hold the title
- `original` to hold the original article/post content
- `clean` to hold the normalized and tokenized original with the stopwords removed.
- `stemmed` to hold the stemmed version of the cleaned data.
- `lemmatized` to hold the lemmatized version of the cleaned data.

# `news_df`

In [6]:
news_df = p.remove_columns(news_df, cols_to_remove=["category"])
news_df.head(2)

Unnamed: 0,title,content
0,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f..."
1,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...


In [7]:
news_df = news_df.rename(columns={"content": "original"})

In [13]:
p.basic_clean(p.tokenize(news_df.original[0]))

"speaking about india ' s second covid19 wave  former rbi governor raghuram rajan said   i think what went wrong was simply  that   we underestimated the virus and its ability to adapt  after the first wave   there was a sense that we had endured the worst  and we had come through  and it was time to open up  and that complacency hurt us   he added "

In [21]:
news_df['clean'] = news_df.original.apply(p.basic_clean).apply(p.tokenize)
news_df.head(2)

Unnamed: 0,title,original,clean
0,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f...",speaking about india ' s second covid19 wave f...
1,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...,indian commercial pilots association icpa on t...


In [23]:
news_df['stemmed'] = news_df.clean.apply(p.stem)
news_df.head(2)

Unnamed: 0,title,original,clean,stemmed
0,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f...",speaking about india ' s second covid19 wave f...,speak about india ' s second covid19 wave form...
1,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...,indian commercial pilots association icpa on t...,indian commerci pilot associ icpa on tuesday s...


In [24]:
news_df['lemmatized'] = news_df.clean.apply(p.lemmatize)
news_df.head(2)

[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Down

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f...",speaking about india ' s second covid19 wave f...,speak about india ' s second covid19 wave form...,speaking about india ' s second covid19 wave f...
1,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...,indian commercial pilots association icpa on t...,indian commerci pilot associ icpa on tuesday s...,indian commercial pilot association icpa on tu...


In [25]:
news_df.head(2)

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f...",speaking about india ' s second covid19 wave f...,speak about india ' s second covid19 wave form...,speaking about india ' s second covid19 wave f...
1,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...,indian commercial pilots association icpa on t...,indian commerci pilot associ icpa on tuesday s...,indian commercial pilot association icpa on tu...


# `codeup_df`

In [26]:
codeup_df.head(2)

Unnamed: 0,title,date posted,category,content
0,Codeup’s Data Science Career Accelerator is Here!,"Posted on September 30, 2018",In Data Science,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,"Posted on October 31, 2018",In Data Science,By Dimitri Antoniou and Maggie Giust\nData Sci...


In [28]:
codeup_df = p.remove_columns(codeup_df, cols_to_remove=["category", "date posted"])
codeup_df.head(2)

Unnamed: 0,title,content
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust\nData Sci...


In [29]:
codeup_df = codeup_df.rename(columns={"content": "original"})
codeup_df.head(2)

Unnamed: 0,title,original
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust\nData Sci...


In [30]:
codeup_df['clean'] = codeup_df.original.apply(p.basic_clean).apply(p.tokenize)
codeup_df.head(2)

Unnamed: 0,title,original,clean
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...,the rumors are true the time has arrived codeu...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust\nData Sci...,by dimitri antoniou and maggie giust\ndata sci...


In [31]:
codeup_df['stemmed'] = codeup_df.clean.apply(p.stem)
codeup_df.head(2)

Unnamed: 0,title,original,clean,stemmed
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...,the rumors are true the time has arrived codeu...,the rumor are true the time ha arriv codeup ha...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust\nData Sci...,by dimitri antoniou and maggie giust\ndata sci...,by dimitri antoni and maggi giust data scienc ...


In [32]:
codeup_df['lemmatized'] = codeup_df.clean.apply(p.lemmatize)
codeup_df.head(2)

[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...,the rumors are true the time has arrived codeu...,the rumor are true the time ha arriv codeup ha...,the rumor are true the time ha arrived codeup ...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust\nData Sci...,by dimitri antoniou and maggie giust\ndata sci...,by dimitri antoni and maggi giust data scienc ...,by dimitri antoniou and maggie giust data scie...


In [6]:
news_df = p.full_df(news_df, "category")
news_df.head(2)

[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/agomez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Down

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,"I'm done with Zoom meetings, about to cancel a...","JPMorgan CEO Jamie Dimon said that he's ""done ...",jpmorgan ceo jamie dimon said that he ' s done...,jpmorgan ceo jami dimon said that he ' s done ...,jpmorgan ceo jamie dimon said that he ' s done...
1,Ethereum's 27-yr-old Co-founder now world's yo...,Ethereum blockchain's 27-year-old Co-founder V...,ethereum blockchain ' s 27yearold cofounder vi...,ethereum blockchain ' s 27yearold cofound vita...,ethereum blockchain ' s 27yearold cofounder vi...


In [9]:
cols =["category", "date posted"]
codeup_df = p.full_df(codeup_df, cols)
codeup_df.head(2)

KeyError: "[('category', 'date posted')] not found in axis"

## 9. Ask yourself:

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text? **Either**
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text? **Stem**
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text? **Stem**