# Curriculum:

## A couple of introductory notes:

- Our goal with parsing text is to reduce the variability of the text itself; letters aren't letters to a computer

- Entymology doesn't matter with parsing.  We're simply trying to get the statisticts of the word

### Some Terms:

- 'parsing' means breaking down text into smaller parts.  Literally, parsing from 'blindly' to 'blind'

- 'nlt' is the Natural Language Toolkit from Python

- '.encode' something to ASCII means forcing your non-ASCII characters into an ASCII format

- '.decode' comes after '.encode', and transforms the encoded ASCII character back into a string for Python

- 'Tokenization' - similar to the term 'parsing' - means breaking text down into discrete words / punctuation / etc.

- 'Stemming' - chopping words up into their base forms - 'calls', 'called', 'calling' all have the root stem 'call'

- 'Lemma' - is the root word, not root stem like 'stemming' is

- 'Lemmatization' - is like 'stemming' but WAY more refined and makes the output easier to read grammatically

- 'Stopword' - a word w/ little to no significance; words common in ANY document, like 'the', 'like', 'a', 'to', etc.

### What follows is the Workflow to adhere to when parsing text:

1.) convert all text to lowercase;

2.) Remove any accented and non-ASCII characters; 

3.) Remove all special characters; 

4.) Stem (if dataset is LARGE) or Lemmatize (if I can fit on my laptop) all words;

5.) Remove all stopwords; and 

6.) Store the clean text and the original text for use in future notebooks

In [1]:
import unicodedata
import re # for replacing non-alphanumeric characters, etc
import json

from requests import get
from bs4 import BeautifulSoup
import os

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd
import numpy as np

import acquire

In [2]:
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)

### Running through a couple of things from 'acquire' so we can work with raw data and visualize how things get dwindled down.

In [3]:
# print(response.text) - Shows all the text.  ALL the text

# print(response.text[:400]) - prints the first 400 characters

# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')

# soup.prettify() - to show me all the text in HTML

# soup.find_all("a") - to find all the anchor tags

# soup.find(h1) - finds all the h1 - main header - tags

# soup.get_text() - Shows me all the text from within a matching piece of soup.  In other words, the text between the tags

# soup.get_text() to show all the text from within a matching piece of soup, ie: the text between the tags

# soup.select("body a") - shows me the bodies of all 'a' anchors


## From here, shite went terribly South.  My 'acquire' function did not match the function that the curriculum was assuming, so from this point on, I am picking up *Ryan*'s lecture from the beginning:

In [4]:
# this is just for the lesson example
import unicodedata

import re
import json

# natural language toolkit
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

from nltk.stem import PorterStemmer

import pandas as pd

In [5]:
original = """Paul Erdős and George Pólya are influential Hungarian mathematicians who contributed a lot to the field. 
Erdős's name contains the Hungarian letter 'ő' ('o' with double acute accent), but is often incorrectly written
as Erdos or Erdös either by mistake or out of typographical necessity"""
original

"Paul Erdős and George Pólya are influential Hungarian mathematicians who contributed a lot to the field. \nErdős's name contains the Hungarian letter 'ő' ('o' with double acute accent), but is often incorrectly written\nas Erdos or Erdös either by mistake or out of typographical necessity"

In [6]:
# lowercase and remove accented characters and any non-ASCII characters
# Encode to ASCII, to convert special characters into ASCII 
# Decode from ASCII to UTF-8 so we have a normal Python string
string = original.lower()
string = unicodedata.normalize('NFKD', string)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')    
string

"paul erdos and george polya are influential hungarian mathematicians who contributed a lot to the field. \nerdos's name contains the hungarian letter 'o' ('o' with double acute accent), but is often incorrectly written\nas erdos or erdos either by mistake or out of typographical necessity"

In [7]:
# Remove any special characters and replace with an empty string
string = re.sub(r"[^a-z0-9'\s]", '', string)
string

"paul erdos and george polya are influential hungarian mathematicians who contributed a lot to the field \nerdos's name contains the hungarian letter 'o' 'o' with double acute accent but is often incorrectly written\nas erdos or erdos either by mistake or out of typographical necessity"

### Now that we have lowercased everything and done all the ASCII encoding / decoding, time to Tokenize

Hint for remembering:

"Yo!  That text is all little, and we've put the ASCII back into a string!"

"Betta tokenize!"

In [8]:
# We can accomplish the above two step by using a tokenizer
tokenizer = nltk.tokenize.ToktokTokenizer()

string = tokenizer.tokenize(string, return_str=True)
string

"paul erdos and george polya are influential hungarian mathematicians who contributed a lot to the field \nerdos ' s name contains the hungarian letter ' o ' ' o ' with double acute accent but is often incorrectly written\nas erdos or erdos either by mistake or out of typographical necessity"

### Now's the time to either Stem or Lemmatize.  We stem if there's a ton of data because it's brute force, but takes up way less computer energy to do.  We Lemmatize data that can fit onto our laptops because while it's way more refined than stemming (lemmatized words are in the dictionary whereas stemmed words are not), it is more computationally intensive

In [9]:
# Stemming is a super basic way to get only the "stem" of a word
ps = nltk.porter.PorterStemmer()
ps.stem('call'), ps.stem('called'), ps.stem('calling')

('call', 'call', 'call')

In [10]:
stems = [ps.stem(word) for word in string.split()]
stems

['paul',
 'erdo',
 'and',
 'georg',
 'polya',
 'are',
 'influenti',
 'hungarian',
 'mathematician',
 'who',
 'contribut',
 'a',
 'lot',
 'to',
 'the',
 'field',
 'erdo',
 "'",
 's',
 'name',
 'contain',
 'the',
 'hungarian',
 'letter',
 "'",
 'o',
 "'",
 "'",
 'o',
 "'",
 'with',
 'doubl',
 'acut',
 'accent',
 'but',
 'is',
 'often',
 'incorrectli',
 'written',
 'as',
 'erdo',
 'or',
 'erdo',
 'either',
 'by',
 'mistak',
 'or',
 'out',
 'of',
 'typograph',
 'necess']

In [11]:
wnl = nltk.stem.WordNetLemmatizer()

In [12]:
lemmas = [wnl.lemmatize(word) for word in string.split()]

In [13]:
lemmas

['paul',
 'erdos',
 'and',
 'george',
 'polya',
 'are',
 'influential',
 'hungarian',
 'mathematician',
 'who',
 'contributed',
 'a',
 'lot',
 'to',
 'the',
 'field',
 'erdos',
 "'",
 's',
 'name',
 'contains',
 'the',
 'hungarian',
 'letter',
 "'",
 'o',
 "'",
 "'",
 'o',
 "'",
 'with',
 'double',
 'acute',
 'accent',
 'but',
 'is',
 'often',
 'incorrectly',
 'written',
 'a',
 'erdos',
 'or',
 'erdos',
 'either',
 'by',
 'mistake',
 'or',
 'out',
 'of',
 'typographical',
 'necessity']

In [14]:
stopword_list = stopwords.words('english')
print(len(stopword_list))
stopword_list[0:21]

179


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself']

In [15]:
clean_stems = [w for w in stems if w not in stopword_list]
clean_stems

['paul',
 'erdo',
 'georg',
 'polya',
 'influenti',
 'hungarian',
 'mathematician',
 'contribut',
 'lot',
 'field',
 'erdo',
 "'",
 'name',
 'contain',
 'hungarian',
 'letter',
 "'",
 "'",
 "'",
 "'",
 'doubl',
 'acut',
 'accent',
 'often',
 'incorrectli',
 'written',
 'erdo',
 'erdo',
 'either',
 'mistak',
 'typograph',
 'necess']

In [16]:
clean_lemmas = [w for w in lemmas if w not in stopword_list]
clean_lemmas

['paul',
 'erdos',
 'george',
 'polya',
 'influential',
 'hungarian',
 'mathematician',
 'contributed',
 'lot',
 'field',
 'erdos',
 "'",
 'name',
 'contains',
 'hungarian',
 'letter',
 "'",
 "'",
 "'",
 "'",
 'double',
 'acute',
 'accent',
 'often',
 'incorrectly',
 'written',
 'erdos',
 'erdos',
 'either',
 'mistake',
 'typographical',
 'necessity']

## SO REMEMBER THE NLP PROCESS IN ORDER:

- lowercase() all the things

- remove special characters with re.sub and replace with empty string

- tokenize your string

- lemmatize (or stem if your corpus is gigantic)

- remove your stopwords

- keep the original text

- write the original text and the transformed text to disk for later

# Exercises

- The end result of this exercise should be a file named prepare.py that defines the requested functions

- In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.


### 1.) Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- It should lowercase everything

- It should normalize the unicode characters

- It should replace anything that is NOT a letter, number, whitespace, or single-quote

In [49]:
string = ""

def basic_clean(string):
    """
    Function take a string, lowercase it, normalize it, and remove all non-ASCII characters
    """
    
    # to lowercase all the text:
    string = string.lower()
    string = unicodedata.normalize('NFKD', string)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore') 
    
    # to remove anything that is not a letter, number, whitespace, or single-quote:
    string = re.sub(r"[^a-zA-Z0-9'\s]", '', string)

    return string

**Testing out the (^^) function:**

In [50]:
basic_clean("kjads;fioeinf8vh84h8348** 0-_e9u34j)")

'kjadsfioeinf8vh84h8348 0e9u34j'

#### Checks out

### 2.) Define a function named 'tokenize' that takes in a string and tokenizes all the words in it

In [51]:
def tokenize(string):
    """
    Function to take in a string and break it down into discrete parts
    """
    
    # to lowercase all the text:
    string = string.lower()
    string = unicodedata.normalize('NFKD', string)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore') 
    
    # to remove anything that is not a letter, number, whitespace, or single-quote:
    string = re.sub(r"[^a-zA-Z0-9'\s]", '', string)

    # applying the tokenizer object
    tokenizer = nltk.tokenize.ToktokTokenizer()
    string = tokenizer.tokenize(string, return_str=True)
    
    return string


**Testing out the (^^) function:**

In [52]:
tokenize("kalIEOINEknd;lapoivanwonpvo i hpwoihfpioY)*P$Y028ty0")

'kalieoinekndlapoivanwonpvo i hpwoihfpioypy028ty0'

#### Checks out

### 3.) Define a function named 'stem' that accepts a string and returns it after stemming all the words.

In [76]:
def stem(string):
    """
    Function that takes in text and stems all the words down into their base forms
    """
    
    # to lowercase all the text:
    string = string.lower()
    string = unicodedata.normalize('NFKD', string)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore') 
    
    # to remove anything that is not a letter, number, whitespace, or single-quote:
    string = re.sub(r"[^a-zA-Z0-9'\s]", '', string)
    
    # applying the tokenizer object
    tokenizer = nltk.tokenize.ToktokTokenizer()
    string = tokenizer.tokenize(string, return_str=True)
    
    # stemming everything
    ps = nltk.porter.PorterStemmer()
    words = [string]
    stems = [ps.stem(word) for word in string.split()]
    string_stemmed = ' '.join(stems)

    return string_stemmed


**Testing out the (^^) function:**

In [80]:
stem("There once was a man from Nantucket")

'there onc wa a man from nantucket'

#### Checks out.  'Onc' and 'wa' are the stemmed 'words'

### 4.) Define a function called 'lemmatize' that accepts some text and returns it after the lemmatization of each word

In [83]:
def lemmatize(text):
    """
    Function that takes in text and stems all the words down into their base forms
    """
    
    # to lowercase all the text:
    text = text.lower()
    text = unicodedata.normalize('NFKD', text)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore') 
    
    # to remove anything that is not a letter, number, whitespace, or single-quote:
    text = re.sub(r"[^a-zA-Z0-9'\s]", '', text)
    
    # applying the tokenizer object
    tokenizer = nltk.tokenize.ToktokTokenizer()
    text = tokenizer.tokenize(text, return_str=True)
    
    # lemmatizing everything
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in text.split()]
    text_lemmatized = ' '.join(lemmas)

    return text_lemmatized

**Testing out the (^^) function:**

In [86]:
lemmatize("Firecracker Chicken or Shrimp. Battered and deep fried sauteed in honey with yellow onions and jalapenos.")

'firecracker chicken or shrimp battered and deep fried sauteed in honey with yellow onion and jalapeno'

#### Checks out.  The period after 'Shrimp' and the 's' being removed at the end of 'jalapenos' indicate it works. 

### 5.) Define a function called 'remove_stopwords' that accepts some text and returns it after removing all the stopwords.

In [89]:
# take a look at the first 20 words in the stopwords dictionary.  This is in English.  Tried in Spanish.  Same thing.

stopword_list = stopwords.words('english')
print(stopword_list[:20])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']


In [96]:
def remove_stopwords(text, extra_words="", exclude_words=""):
    """
    Function that takes in text and stems all the words down into their base forms
    """
    
    # to lowercase all the text:
    text = text.lower()
    text = unicodedata.normalize('NFKD', text)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore') 
    
    # to remove anything that is not a letter, number, whitespace, or single-quote:
    text = re.sub(r"[^a-zA-Z0-9'\s]", '', text)
    
    # applying the tokenizer object
    tokenizer = nltk.tokenize.ToktokTokenizer()
    text = tokenizer.tokenize(text, return_str=True)
    
    # lemmatizing everything
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in text.split()]
    text_lemmatized = ' '.join(lemmas)

    # removing stopwords
    words = text.split()
    filtered_words = [w for w in words if w not in stopword_list]
    print("removed {} stopwords".format(len(words) - len(filtered_words)))
    print("***************")
    text_without_stopwords = ' '.join(filtered_words)
    
    return text_without_stopwords

**Test out the (^^) function:**

In [97]:
remove_stopwords("Firecracker Chicken or Shrimp. Battered and deep fried sauteed in honey with yellow onions and jalapenos")

removed 5 stopwords
***************


'firecracker chicken shrimp battered deep fried sauteed honey yellow onions jalapenos'

In [98]:
remove_stopwords("Alright, stop what you're doing because I'm about to ruin", "doing", "alright")

removed 9 stopwords
***************


"alright stop ' ' ruin"

#### Checks out.

### 6.) Define a function called 'prep_article' that takes in the dictionary representing an article and returns a dictionary that looks like thie:

{

    "title" : "the original title",
    
    "original" : original `#` the entire text in its original format
    
    "stemmed" : article_stemmed `#` all the text, stemmed
    
    "lemmatized" : article_lemmatized `#` all the text, lemmatized
    
    "clean" : article_without_stopwords `#` all the text lemmatized and all the stopwords removed
}

### 7.) Define a function named 'prepare_article_data' that takes in the list of article dictionaries, applies the 'prep_article' function to each one, and returns the transformed data.

In [None]:
def prep_article_data(df):
    '''
    This function takes in the news articles df and
    returns the df with original columns plus cleaned
    and lemmatized content without stopwords.
    '''
    # Do basic clean on article content
    df = basic_clean(df, 'content')
    
    # Tokenize clean article content
    df = tokenize(df, 'basic_clean')
    
    # Stem cleaned and tokenized article content
    df = stem(df, 'clean_tokes')
    
    # Remove stopwords from Lemmatized article content
    df = remove_stopwords(df, 'stemmed')
    
    # Lemmatize cleaned and tokenized article content
    df = lemmatize(df, 'clean_tokes')
    
    # Remove stopwords from Lemmatized article content
    df = remove_stopwords(df, 'lemmatized')
    
    return df[['topic', 'title', 'author', 'content', 'clean_stemmed', 'clean_lemmatized']]