- environment setup

In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import acquire

# Lesson 

In [2]:
original = acquire.get_codeup_blog('https://codeup.com/codeups-data-science-career-accelerator-is-here')
print(original)

{'title': 'Codeup’s Data Science Career Accelerator is Here!', 'published_date': 'September 30, 2018', 'blog_image': 'https://codeup.com/wp-content/uploads/2018/10/Data-Science-7.png', 'content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum develo



  soup = BeautifulSoup(response.text)


In [3]:
article = original['content']

article

'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Student

In [4]:
# first we must put all the characters in the text in lowercase
## for normalacy

article = article.lower()
print(article)

the rumors are true! the time has arrived. codeup has officially opened applications to our new data science career accelerator, with only 25 seats available! this immersive program is one of a kind in san antonio, and will help you land a job in glassdoor’s #1 best job in america.
data science is a method of providing actionable intelligence from data. the data revolution has hit san antonio, resulting in an explosion in data scientist positions across companies like usaa, accenture, booz allen hamilton, and heb. we’ve even seen utsa invest $70 m for a cybersecurity center and school of data science. we built a program to specifically meet the growing demands of this industry.
our program will be 18 weeks long, full-time, hands-on, and project-based. our curriculum development and instruction is led by senior data scientist, maggie giust, who has worked at heb, capital group, and rackspace, along with input from dozens of practitioners and hiring partners. students will work with real

## Removing accented characters

In [5]:
# Then we must remove any accented or inconsistent characters 

article = unicodedata.normalize('NFKD', article)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')

print(article)

the rumors are true! the time has arrived. codeup has officially opened applications to our new data science career accelerator, with only 25 seats available! this immersive program is one of a kind in san antonio, and will help you land a job in glassdoors #1 best job in america.
data science is a method of providing actionable intelligence from data. the data revolution has hit san antonio, resulting in an explosion in data scientist positions across companies like usaa, accenture, booz allen hamilton, and heb. weve even seen utsa invest $70 m for a cybersecurity center and school of data science. we built a program to specifically meet the growing demands of this industry.
our program will be 18 weeks long, full-time, hands-on, and project-based. our curriculum development and instruction is led by senior data scientist, maggie giust, who has worked at heb, capital group, and rackspace, along with input from dozens of practitioners and hiring partners. students will work with real d

## Removing Special Characters

In [6]:
# remove anything that is not a through z, a number, a single quote, or whitespace
article = re.sub(r"[^a-z0-9'\s]", '', article)
print(article)

the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america
data science is a method of providing actionable intelligence from data the data revolution has hit san antonio resulting in an explosion in data scientist positions across companies like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demands of this industry
our program will be 18 weeks long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who has worked at heb capital group and rackspace along with input from dozens of practitioners and hiring partners students will work with real data sets realistic problems a

## Tokenization

In [7]:
# then tokenize to transform each individual word into a 'token' that can be analyzed individually 

tokenizer = nltk.tokenize.ToktokTokenizer()
## create tokenizer

print(tokenizer.tokenize(article, return_str=True))
# run tokenizer on article
## return_str will put the tokens back in a string
## otherwise it will return a list of each individual word

the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america
data science is a method of providing actionable intelligence from data the data revolution has hit san antonio resulting in an explosion in data scientist positions across companies like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demands of this industry
our program will be 18 weeks long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who has worked at heb capital group and rackspace along with input from dozens of practitioners and hiring partners students will work with real data sets realistic problems a

## Stemming and Lemmatization

### Stemming

- stemming will attempt to determine the root word of each word in the article
- not always as grammatically correct as lemmatization
- much faster than lemmatization though

In [8]:
# Create the nltk stemmer object, then use it
ps = nltk.porter.PorterStemmer()

# test stemmer
ps.stem('call'), ps.stem('called'), ps.stem('calling')

('call', 'call', 'call')

In [9]:
# now we run the stemmer on the article

stems = [ps.stem(word) for word in article.split()]
article_stemmed = ' '.join(stems)
print(article_stemmed)

the rumor are true the time ha arriv codeup ha offici open applic to our new data scienc career acceler with onli 25 seat avail thi immers program is one of a kind in san antonio and will help you land a job in glassdoor 1 best job in america data scienc is a method of provid action intellig from data the data revolut ha hit san antonio result in an explos in data scientist posit across compani like usaa accentur booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecur center and school of data scienc we built a program to specif meet the grow demand of thi industri our program will be 18 week long fulltim handson and projectbas our curriculum develop and instruct is led by senior data scientist maggi giust who ha work at heb capit group and rackspac along with input from dozen of practition and hire partner student will work with real data set realist problem and the entir data scienc pipelin from collect to deploy they will receiv profession develop train in resum

In [10]:
pd.Series(stems).value_counts().head(10)
# check to see which words occur the most frequently, now that all the words have been normalized

data      13
and       13
to         9
in         8
a          8
our        7
scienc     7
the        6
of         6
learn      6
dtype: int64

### Lemmatization

- similar to stemming, but much more robust
- words tend to be more grammatically correct
- use over stemming if possible
- significantly slower then stemming

In [11]:
wnl = nltk.stem.WordNetLemmatizer()
# create lemmatization object

In [12]:
lemmas = [wnl.lemmatize(word) for word in article.split()]
article_lemmatized = ' '.join(lemmas)
# run object on article

print(article_lemmatized)

the rumor are true the time ha arrived codeup ha officially opened application to our new data science career accelerator with only 25 seat available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america data science is a method of providing actionable intelligence from data the data revolution ha hit san antonio resulting in an explosion in data scientist position across company like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demand of this industry our program will be 18 week long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who ha worked at heb capital group and rackspace along with input from dozen of practitioner and hiring partner student will work with real data set realistic problem and the entire data

In [13]:
pd.Series(lemmas).value_counts()[:10]
# see which words occur most frequently now that the words have been normalized

and        13
data       13
to          9
in          8
a           8
our         7
science     7
with        6
of          6
will        6
dtype: int64

## Removing Stopwords

- stopwords tend to have the most frequency
- typically insignificant words like 'the', 'and', 'a', or 'or'
- removing them helps the accuracy of finding the most prominant words

In [14]:
stopword_list = stopwords.words('english')
# create stopwords list



In [15]:
stopword_list.remove('no')
stopword_list.remove('not')
# .remove('') can be used to take out any word you do not want to be in the stopword list
# .append('') can be used to add any word you would like to be considered a stopword 

stopword_list[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [16]:
words = article.split()
filtered_words = [w for w in words if w not in stopword_list]
# remove any unneeded stopwords 

print('Removed {} stopwords'.format(len(words) - len(filtered_words)))
print('---')

article_without_stopwords = ' '.join(filtered_words)

print(article_without_stopwords)

Removed 122 stopwords
---
rumors true time arrived codeup officially opened applications new data science career accelerator 25 seats available immersive program one kind san antonio help land job glassdoors 1 best job america data science method providing actionable intelligence data data revolution hit san antonio resulting explosion data scientist positions across companies like usaa accenture booz allen hamilton heb weve even seen utsa invest 70 cybersecurity center school data science built program specifically meet growing demands industry program 18 weeks long fulltime handson projectbased curriculum development instruction led senior data scientist maggie giust worked heb capital group rackspace along input dozens practitioners hiring partners students work real data sets realistic problems entire data science pipeline collection deployment receive professional development training resume writing interviewing continuing education prepare smooth transition workforce focus applie

# Exercises

In [17]:
original = "Paul Erdős and George Pólya are influential Hungarian mathematicians who contributed a lot to \
the field. Erdős's name contains the Hungarian letter 'ő' ('o' with double acute accent), but is often \
incorrectly written as Erdos or Erdös either by mistake or out of typographical necessity"

original

"Paul Erdős and George Pólya are influential Hungarian mathematicians who contributed a lot to the field. Erdős's name contains the Hungarian letter 'ő' ('o' with double acute accent), but is often incorrectly written as Erdos or Erdös either by mistake or out of typographical necessity"

## 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [18]:
## lowercase article 

article = original.lower()

article

"paul erdős and george pólya are influential hungarian mathematicians who contributed a lot to the field. erdős's name contains the hungarian letter 'ő' ('o' with double acute accent), but is often incorrectly written as erdos or erdös either by mistake or out of typographical necessity"

In [19]:
## normalize unicode

article = unicodedata.normalize('NFKD', article)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')

print(article)

paul erdos and george polya are influential hungarian mathematicians who contributed a lot to the field. erdos's name contains the hungarian letter 'o' ('o' with double acute accent), but is often incorrectly written as erdos or erdos either by mistake or out of typographical necessity


In [20]:
## replace special characters

# remove anything that is not a through z, a number, a single quote, or whitespace
article = re.sub(r"[^a-z0-9'\s]", '', article)
print(article)

paul erdos and george polya are influential hungarian mathematicians who contributed a lot to the field erdos's name contains the hungarian letter 'o' 'o' with double acute accent but is often incorrectly written as erdos or erdos either by mistake or out of typographical necessity


In [21]:
## make functions

def basic_clean(original):
    '''
    This function will take in a string of text and begin basic cleaning
    - will make all characters lowercase
    - will normalize unicode
    - will remove special characters
    - will return edited text
    '''
    
    article = original.lower()
    # make all characters lowercase
    
    article = unicodedata.normalize('NFKD', article)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')
    # normalize unicode
    
    article = re.sub(r"[^a-z0-9'\s]", '', article)
    # remove any special characters
    
    return article

In [22]:
article2 = basic_clean(original)

In [23]:
article2

"paul erdos and george polya are influential hungarian mathematicians who contributed a lot to the field erdos's name contains the hungarian letter 'o' 'o' with double acute accent but is often incorrectly written as erdos or erdos either by mistake or out of typographical necessity"

## 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [24]:
tokenizer = nltk.tokenize.ToktokTokenizer()
# make tokenizer object

In [25]:
print(tokenizer.tokenize(article, return_str=True))
# run tokenizer, return as str

paul erdos and george polya are influential hungarian mathematicians who contributed a lot to the field erdos ' s name contains the hungarian letter ' o ' ' o ' with double acute accent but is often incorrectly written as erdos or erdos either by mistake or out of typographical necessity


In [26]:
# make function

def tokenize(article):
    '''
    This function will take in a string and run a ToktokTokenizer object
    - will return a string of the token words
    - expected to be used after basic_clean function
    '''
    
    tokenizer = nltk.tokenize.ToktokTokenizer()
    # make tokenizer object
    
    token_article = tokenizer.tokenize(article, return_str=True)
    # run tokenizer, return as str
    
    return token_article

In [27]:
article2 = tokenize(article2)

article2

"paul erdos and george polya are influential hungarian mathematicians who contributed a lot to the field erdos ' s name contains the hungarian letter ' o ' ' o ' with double acute accent but is often incorrectly written as erdos or erdos either by mistake or out of typographical necessity"

## 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [28]:
# Create the nltk stemmer object, then use it
ps = nltk.porter.PorterStemmer()

In [29]:
stems = [ps.stem(word) for word in article.split()]
# run stemmer object on article to create stems

article_stemmed = ' '.join(stems)
# create stemmed article by joining stems

article_stemmed

"paul erdo and georg polya are influenti hungarian mathematician who contribut a lot to the field erdos' name contain the hungarian letter 'o' 'o' with doubl acut accent but is often incorrectli written as erdo or erdo either by mistak or out of typograph necess"

In [30]:
# make functions

def stem(article):
    '''
    This function will take in a string and run a PorterStemmer object
    - will return a string of all the stems of the words in the article
    '''
    
    # Create the nltk stemmer object, then use it
    ps = nltk.porter.PorterStemmer()
    
    stems = [ps.stem(word) for word in article.split()]
    # run stemmer object on article to create stems
    
    article_stemmed = ' '.join(stems)
    # create stemmed article by joining stems
    
    return article_stemmed

In [31]:
article2_stemmed = stem(article2)

article2_stemmed

"paul erdo and georg polya are influenti hungarian mathematician who contribut a lot to the field erdo ' s name contain the hungarian letter ' o ' ' o ' with doubl acut accent but is often incorrectli written as erdo or erdo either by mistak or out of typograph necess"

## 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [32]:
wnl = nltk.stem.WordNetLemmatizer()
# create lemmatizer object

In [33]:
lemmas = [wnl.lemmatize(word) for word in article.split()]
## run lemmatizer object on article

In [34]:
article_lemmatized = ' '.join(lemmas)
# create lemmatized article by joining lemmatized words together

article_lemmatized

"paul erdos and george polya are influential hungarian mathematician who contributed a lot to the field erdos's name contains the hungarian letter 'o' 'o' with double acute accent but is often incorrectly written a erdos or erdos either by mistake or out of typographical necessity"

In [35]:
# make functions

def lemmatize(article):
    '''
    This functions will take in a string and run a WordNetLemmatizer object
    - will return a string of lemmatized words from the article
    '''
    
    wnl = nltk.stem.WordNetLemmatizer()
    # create lemmatizer object
    
    lemmas = [wnl.lemmatize(word) for word in article.split()]
    ## run lemmatizer object on article
    
    article_lemmatized = ' '.join(lemmas)
    # create lemmatized article by joining lemmatized words together
    
    return article_lemmatized

In [36]:
article2_lemmatized = lemmatize(article2)

article2_lemmatized

"paul erdos and george polya are influential hungarian mathematician who contributed a lot to the field erdos ' s name contains the hungarian letter ' o ' ' o ' with double acute accent but is often incorrectly written a erdos or erdos either by mistake or out of typographical necessity"

## 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

- This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [37]:
stopword_list = stopwords.words('english')
# create stop words list

In [54]:
remove_stopword = []
# create list of words to remove from stop words


In [55]:
append_stopword = []
# create list of words to add to stop words

In [40]:
words = article.split()
# split article into individual words

In [41]:
filtered_words = [w for w in words if w not in stopword_list]
# filter for words in stop words

In [42]:
article_without_stopwords = ' '.join(filtered_words)
# recreate article out of remaining words

article_without_stopwords

"paul erdos george polya influential hungarian mathematicians contributed lot field erdos's name contains hungarian letter 'o' 'o' double acute accent often incorrectly written erdos erdos either mistake typographical necessity"

In [43]:
# make functions

def remove_stopwords(article, append_stopword, remove_stopword):
    '''
    This function will take in a string in the form of an article and remove standard English stop words
    as well as a list of words to add to the stop words list if desired
    and a list of words to add to the stop words list if also desrired
    - will return a string of remaining words once all desired stop words have been removed 
    '''
    
    stopword_list = stopwords.words('english')
    # create standard English stop words list
    
    for word in append_stopword:
        stopword_list.append(word)
    # add any extra stop words
    
    for word in remove_stopword:
        stopword_list.remove(word)
    # remove any unwanted stop words 
    
    words = article.split()
    # split article into individual words
    
    filtered_words = [w for w in words if w not in stopword_list]
    # filter for words in stop words
    
    article_without_stopwords = ' '.join(filtered_words)
    # recreate article out of remaining words
    
    return article_without_stopwords

In [44]:
remove_stopword = ['o', 'the']
# create list of words to remove from stop words

In [45]:
append_stopword = ['polya', 'lot']
# create list of words to add to stop words

In [46]:
article2_without_stopwords = remove_stopwords(article2, append_stopword, remove_stopword)

article2_without_stopwords

"paul erdos george influential hungarian mathematicians contributed the field erdos ' name contains the hungarian letter ' o ' ' o ' double acute accent often incorrectly written erdos erdos either mistake typographical necessity"

## 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [47]:
categories = ['business', 'sports' 'technology', 'entertainment']

news_df = acquire.get_all_news_articles(categories)



  soup = BeautifulSoup(response.text)


In [48]:
news_df.head()

Unnamed: 0,title,content,category
0,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f...",business
1,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...,business
2,World's biggest jeweller says it will no longe...,"Pandora, the world's biggest jeweller, has sai...",business
3,South Korea's richest woman gets fortune worth...,South Korea’s richest woman Hong Ra-hee added ...,business
4,Samsung pledges ₹37 crore to India to fight CO...,Samsung has pledged $5 million (around ₹37 cro...,business


## 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [49]:
urls = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
'https://codeup.com/data-science-myths/',
'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/']

codeup_df = acquire. get_blog_articles(urls)



  soup = BeautifulSoup(response.text)


In [50]:
codeup_df

Unnamed: 0,title,published_date,blog_image,content
0,Codeup’s Data Science Career Accelerator is Here!,"September 30, 2018",https://codeup.com/wp-content/uploads/2018/10/...,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,"October 31, 2018",https://codeup.com/wp-content/uploads/2018/10/...,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"October 17, 2018",https://codeup.com/wp-content/uploads/2018/10/...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,"August 14, 2018",,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,"August 14, 2018",,Competitor Bootcamps Are Closing. Is the Model...


## 8. For each dataframe, produce the following columns:

- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

#### - news_df

In [51]:
news_df.head()

Unnamed: 0,title,content,category
0,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f...",business
1,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...,business
2,World's biggest jeweller says it will no longe...,"Pandora, the world's biggest jeweller, has sai...",business
3,South Korea's richest woman gets fortune worth...,South Korea’s richest woman Hong Ra-hee added ...,business
4,Samsung pledges ₹37 crore to India to fight CO...,Samsung has pledged $5 million (around ₹37 cro...,business


In [52]:
## rename columns
news_df = news_df.rename(columns = {'content' : 'original'})

news_df.head()

Unnamed: 0,title,original,category
0,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f...",business
1,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...,business
2,World's biggest jeweller says it will no longe...,"Pandora, the world's biggest jeweller, has sai...",business
3,South Korea's richest woman gets fortune worth...,South Korea’s richest woman Hong Ra-hee added ...,business
4,Samsung pledges ₹37 crore to India to fight CO...,Samsung has pledged $5 million (around ₹37 cro...,business


In [56]:
append_stopword = []

append_stopword = []

In [57]:
## make clean columns

news_df['clean'] = pd.Series([basic_clean(string) for string in news_df.original])

news_df['clean'] = pd.Series([tokenize(string) for string in news_df.clean])

news_df['clean'] = pd.Series([remove_stopwords(string, append_stopword, append_stopword) for string in news_df.clean])

news_df.head()

Unnamed: 0,title,original,category,clean
0,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f...",business,speaking india ' second covid19 wave former rb...
1,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...,business,indian commercial pilots association icpa tues...
2,World's biggest jeweller says it will no longe...,"Pandora, the world's biggest jeweller, has sai...",business,pandora world ' biggest jeweller said ' stop u...
3,South Korea's richest woman gets fortune worth...,South Korea’s richest woman Hong Ra-hee added ...,business,south koreas richest woman hong rahee added an...
4,Samsung pledges ₹37 crore to India to fight CO...,Samsung has pledged $5 million (around ₹37 cro...,business,samsung pledged 5 million around 37 crore help...


In [59]:
## make stemmed column 

news_df['stemmed'] = pd.Series([stem(string) for string in news_df.clean])

news_df.head()

Unnamed: 0,title,original,category,clean,stemmed
0,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f...",business,speaking india ' second covid19 wave former rb...,speak india ' second covid19 wave former rbi g...
1,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...,business,indian commercial pilots association icpa tues...,indian commerci pilot associ icpa tuesday said...
2,World's biggest jeweller says it will no longe...,"Pandora, the world's biggest jeweller, has sai...",business,pandora world ' biggest jeweller said ' stop u...,pandora world ' biggest jewel said ' stop use ...
3,South Korea's richest woman gets fortune worth...,South Korea’s richest woman Hong Ra-hee added ...,business,south koreas richest woman hong rahee added an...,south korea richest woman hong rahe ad anoth 7...
4,Samsung pledges ₹37 crore to India to fight CO...,Samsung has pledged $5 million (around ₹37 cro...,business,samsung pledged 5 million around 37 crore help...,samsung pledg 5 million around 37 crore help i...


In [60]:
## make lemmatized columns

news_df['lemmatized'] = pd.Series([lemmatize(string) for string in news_df.clean])

news_df.head()

Unnamed: 0,title,original,category,clean,stemmed,lemmatized
0,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f...",business,speaking india ' second covid19 wave former rb...,speak india ' second covid19 wave former rbi g...,speaking india ' second covid19 wave former rb...
1,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...,business,indian commercial pilots association icpa tues...,indian commerci pilot associ icpa tuesday said...,indian commercial pilot association icpa tuesd...
2,World's biggest jeweller says it will no longe...,"Pandora, the world's biggest jeweller, has sai...",business,pandora world ' biggest jeweller said ' stop u...,pandora world ' biggest jewel said ' stop use ...,pandora world ' biggest jeweller said ' stop u...
3,South Korea's richest woman gets fortune worth...,South Korea’s richest woman Hong Ra-hee added ...,business,south koreas richest woman hong rahee added an...,south korea richest woman hong rahe ad anoth 7...,south korea richest woman hong rahee added ano...
4,Samsung pledges ₹37 crore to India to fight CO...,Samsung has pledged $5 million (around ₹37 cro...,business,samsung pledged 5 million around 37 crore help...,samsung pledg 5 million around 37 crore help i...,samsung pledged 5 million around 37 crore help...


#### - codeup_df

In [61]:
codeup_df

Unnamed: 0,title,published_date,blog_image,content
0,Codeup’s Data Science Career Accelerator is Here!,"September 30, 2018",https://codeup.com/wp-content/uploads/2018/10/...,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,"October 31, 2018",https://codeup.com/wp-content/uploads/2018/10/...,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"October 17, 2018",https://codeup.com/wp-content/uploads/2018/10/...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,"August 14, 2018",,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,"August 14, 2018",,Competitor Bootcamps Are Closing. Is the Model...


In [62]:
## rename column

codeup_df = codeup_df.rename(columns = {'content' : 'original'})

codeup_df

Unnamed: 0,title,published_date,blog_image,original
0,Codeup’s Data Science Career Accelerator is Here!,"September 30, 2018",https://codeup.com/wp-content/uploads/2018/10/...,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,"October 31, 2018",https://codeup.com/wp-content/uploads/2018/10/...,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"October 17, 2018",https://codeup.com/wp-content/uploads/2018/10/...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,"August 14, 2018",,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,"August 14, 2018",,Competitor Bootcamps Are Closing. Is the Model...


In [63]:
## make clean columns

codeup_df['clean'] = pd.Series([basic_clean(string) for string in codeup_df.original])

codeup_df['clean'] = pd.Series([tokenize(string) for string in codeup_df.clean])

codeup_df['clean'] = pd.Series([remove_stopwords(string, append_stopword, append_stopword) for string in codeup_df.clean])

codeup_df.head()

Unnamed: 0,title,published_date,blog_image,original,clean
0,Codeup’s Data Science Career Accelerator is Here!,"September 30, 2018",https://codeup.com/wp-content/uploads/2018/10/...,The rumors are true! The time has arrived. Cod...,rumors true time arrived codeup officially ope...
1,Data Science Myths,"October 31, 2018",https://codeup.com/wp-content/uploads/2018/10/...,By Dimitri Antoniou and Maggie Giust\nData Sci...,dimitri antoniou maggie giust data science big...
2,Data Science VS Data Analytics: What’s The Dif...,"October 17, 2018",https://codeup.com/wp-content/uploads/2018/10/...,"By Dimitri Antoniou\nA week ago, Codeup launch...",dimitri antoniou week ago codeup launched imme...
3,10 Tips to Crush It at the SA Tech Job Fair,"August 14, 2018",,SA Tech Job Fair\nThe third bi-annual San Anto...,sa tech job fair third biannual san antonio te...
4,Competitor Bootcamps Are Closing. Is the Model...,"August 14, 2018",,Competitor Bootcamps Are Closing. Is the Model...,competitor bootcamps closing model danger prog...


In [66]:
## make stemmed column

codeup_df['stemmed'] = pd.Series([stem(string) for string in codeup_df.clean])

codeup_df

Unnamed: 0,title,published_date,blog_image,original,clean,stemmed
0,Codeup’s Data Science Career Accelerator is Here!,"September 30, 2018",https://codeup.com/wp-content/uploads/2018/10/...,The rumors are true! The time has arrived. Cod...,rumors true time arrived codeup officially ope...,rumor true time arriv codeup offici open appli...
1,Data Science Myths,"October 31, 2018",https://codeup.com/wp-content/uploads/2018/10/...,By Dimitri Antoniou and Maggie Giust\nData Sci...,dimitri antoniou maggie giust data science big...,dimitri antoni maggi giust data scienc big dat...
2,Data Science VS Data Analytics: What’s The Dif...,"October 17, 2018",https://codeup.com/wp-content/uploads/2018/10/...,"By Dimitri Antoniou\nA week ago, Codeup launch...",dimitri antoniou week ago codeup launched imme...,dimitri antoni week ago codeup launch immers d...
3,10 Tips to Crush It at the SA Tech Job Fair,"August 14, 2018",,SA Tech Job Fair\nThe third bi-annual San Anto...,sa tech job fair third biannual san antonio te...,sa tech job fair third biannual san antonio te...
4,Competitor Bootcamps Are Closing. Is the Model...,"August 14, 2018",,Competitor Bootcamps Are Closing. Is the Model...,competitor bootcamps closing model danger prog...,competitor bootcamp close model danger program...


In [67]:
## make lemmatize column

codeup_df['lemmatize'] = pd.Series([lemmatize(string) for string in codeup_df.clean])

codeup_df

Unnamed: 0,title,published_date,blog_image,original,clean,stemmed,lemmatize
0,Codeup’s Data Science Career Accelerator is Here!,"September 30, 2018",https://codeup.com/wp-content/uploads/2018/10/...,The rumors are true! The time has arrived. Cod...,rumors true time arrived codeup officially ope...,rumor true time arriv codeup offici open appli...,rumor true time arrived codeup officially open...
1,Data Science Myths,"October 31, 2018",https://codeup.com/wp-content/uploads/2018/10/...,By Dimitri Antoniou and Maggie Giust\nData Sci...,dimitri antoniou maggie giust data science big...,dimitri antoni maggi giust data scienc big dat...,dimitri antoniou maggie giust data science big...
2,Data Science VS Data Analytics: What’s The Dif...,"October 17, 2018",https://codeup.com/wp-content/uploads/2018/10/...,"By Dimitri Antoniou\nA week ago, Codeup launch...",dimitri antoniou week ago codeup launched imme...,dimitri antoni week ago codeup launch immers d...,dimitri antoniou week ago codeup launched imme...
3,10 Tips to Crush It at the SA Tech Job Fair,"August 14, 2018",,SA Tech Job Fair\nThe third bi-annual San Anto...,sa tech job fair third biannual san antonio te...,sa tech job fair third biannual san antonio te...,sa tech job fair third biannual san antonio te...
4,Competitor Bootcamps Are Closing. Is the Model...,"August 14, 2018",,Competitor Bootcamps Are Closing. Is the Model...,competitor bootcamps closing model danger prog...,competitor bootcamp close model danger program...,competitor bootcamps closing model danger prog...


## 9. Ask yourself:

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?

If a file is that small I would lemmatize the text becuase it is more thourough and wouldn't take that long

- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?

This file is a little larger but I would still lemmatize to be thourough

- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

This file is way too large for lemmatizing, I would have to stem this file