# Parsing Text     
Taking acquired data and breaking it down into smaller components; analyzing a sentence by its parts and then describing the syntactic roles among those elements. 

Planning and Sequence: 

1. Convert text to all lower-case for normalcy and consistency.
2. Remove any accented characters, non-ASCII (American Standard Code for Information Interchange) characters.
3. Remove special characters. (clarification listed below) 
4. Stem or lemmatize the words. (notes below)
5. Remove stopwords. (notes below)
6. Store the clean text and the original text for use in future notebooks.

+ "Special characters include all printable characters that are neither letters nor numbers. These include punctuation or technical, mathematical characters. ASCII also includes the space (a non-visible but printable character), and therefore, does not belong to the control characters category, as one might suspect." - [source](https://www.ionos.com/digitalguide/server/know-how/ascii-codes-overview-of-all-characters-on-the-ascii-table/)

+ "Stemming and Lemmatization both generate the foundation sort of the inflected words and therefore the only difference is that stem may not be an actual word whereas, lemma is an actual language word. Stemming follows an algorithm with steps to perform on the words which makes it faster." - [source](https://towardsdatascience.com/stemming-vs-lemmatization-2daddabcb221)

+ Stop-words are words which don't have strong, meaningful connotations; for instance, ‘and’, ‘a’, ‘it's’, ‘they’. Articles, prepositions, pronouns, conjunctions, and the like.  Although useful in natural language, computational inference from these words is weak; at least for the time being. 

In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import acquire

***
## Exercises

The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

- This function should define two optional parameters, extra_words and exclude_words. 
    - These parameters should define any additional stop words to include, and any words that we don't want to remove.

6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

8. For each dataframe, produce the following columns:

- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

9. Ask yourself:

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

***

#### 1. Define a function named basic_clean.      
It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

*** 
#### 2. Define a function named tokenize.      
It should take in a string and tokenize all the words in the string.

***

#### 3. Define a function named stem.      
It should accept some text and return the text after applying stemming to all the words.

***
#### 4. Define a function named lemmatize.       
It should accept some text and return the text after applying lemmatization to each word.

*** 
#### 5. Define a function named remove_stopwords.      
It should accept some text and return the text after removing all the stopwords.

- This function should define two optional parameters, extra_words and exclude_words. 
    - These parameters should define any additional stop words to include, and any words that we don't want to remove.

***
#### 6. Use your data from the acquire to produce a dataframe of the news articles.      
Name the dataframe news_df.


***

#### 7. Make another dataframe for the Codeup blog posts.      
Name the dataframe codeup_df.

***
#### 8. For each dataframe, produce the following columns:

- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

***
#### 9. Ask yourself:

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?