# Exercises

The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

In [1]:
import pandas as pd

import unicodedata
import re

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/nickolaspedrimiranda/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/nickolaspedrimiranda/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

* Lowercase everything
* Normalize unicode characters
* Replace anything that is not a letter, number, whitespace or a single quote.

In [2]:
def basic_clean(filthy_data):
    filthy_data = filthy_data.lower()
    filthy_data = unicodedata.normalize('NFKD', filthy_data).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    clean_data = re.sub(r"[^a-z0-9'\s]", "", filthy_data)
    return clean_data

## 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [3]:
def tokenize(data):
    tokenizer = ToktokTokenizer()
    data = tokenizer.tokenize(data, return_str=True)
    return data

## 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [4]:
def stem(data):
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in data.split()]
    stemmed_data = ' '.join(stems)
    return stemmed_data

## 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.



In [5]:
def lemmatize(data):
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in data.split()]
    lemmatized_data = ' '.join(lemmas)
    return lemmatized_data

## 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.



In [6]:
# This function takes in a string, optional extra_words and exclued_words parameters with default empty lists and returns a string
def remove_stopwords(string, extra_words = [], exclude_words = []):
    stopword_list = stopwords.words('english')
    # use set casting to remove any excluded stopwords
    stopword_set = set(stopword_list) - set(exclude_words)
    # add in extra words to stopwords set using a union
    stopword_set = stopword_set.union(set(extra_words))
    # split the document by spaces
    words = string.split()
    # every word in our document that is not a stopword
    filtered_words = [word for word in words if word not in stopword_set]
    # join it back together with spaces
    string_without_stopwords = ' '.join(filtered_words)
    return string_without_stopwords

In [7]:
#def remove_stopwords(data, exclude=[], include=[]):
#    stopwords_list = stopwords.words('english')
#    words = [word for word in data.split() if word not in stopwords_list]
#    new_data = ' '.join(words)
#    return new_data

## 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [8]:
news = pd.read_csv('news.csv')
news = news.rename(columns={'content':'original'})
news = news.drop(columns='category')

In [9]:
def cleanse(df, col='', stem=False, lem=True):
    df['clean'] = df[col].apply(basic_clean)
    return df

In [10]:
cleanse(news, 'original')

Unnamed: 0,title,original,clean
0,Govt probing accounts of Adani Group-run Mumba...,The Ministry of Corporate Affairs has opened a...,the ministry of corporate affairs has opened a...
1,IndiGo Co-founder Gangwal to buy SpiceJet stak...,IndiGo Co-founder Rakesh Gangwal is at advance...,indigo cofounder rakesh gangwal is at advanced...
2,"Who is KP Ramasamy, a farmer's son who entered...",KP Ramasamy has made his debut on Forbes India...,kp ramasamy has made his debut on forbes india...
3,"Zomato, McDonald's fined ₹1 lakh for deliverin...",Food delivery platform Zomato and fast food ch...,food delivery platform zomato and fast food ch...
4,India scraps plan to impose restrictions on la...,The Indian government has reversed its decisio...,the indian government has reversed its decisio...
5,"Trader wins ₹8,225 compensation from Zerodha o...","Trader Vijay Gupta won ₹8,225 compensation fro...",trader vijay gupta won 8225 compensation from ...
6,I'm surprised by how much difference being Ind...,World Bank President Ajay Banga said he's surp...,world bank president ajay banga said he's surp...
7,"Debt isn't bad, it should be there for develop...",World Bank President Ajay Banga said debt isn'...,world bank president ajay banga said debt isn'...
8,Stopped buying from ultra-luxury brands like H...,Zerodha Co-founder Nikhil Kamath said he no lo...,zerodha cofounder nikhil kamath said he no lon...
9,"Very concerned, don't know all facts: MoS IT o...",Minister of State for IT Rajeev Chandrasekhar ...,minister of state for it rajeev chandrasekhar ...


In [11]:
news.head()

Unnamed: 0,title,original,clean
0,Govt probing accounts of Adani Group-run Mumba...,The Ministry of Corporate Affairs has opened a...,the ministry of corporate affairs has opened a...
1,IndiGo Co-founder Gangwal to buy SpiceJet stak...,IndiGo Co-founder Rakesh Gangwal is at advance...,indigo cofounder rakesh gangwal is at advanced...
2,"Who is KP Ramasamy, a farmer's son who entered...",KP Ramasamy has made his debut on Forbes India...,kp ramasamy has made his debut on forbes india...
3,"Zomato, McDonald's fined ₹1 lakh for deliverin...",Food delivery platform Zomato and fast food ch...,food delivery platform zomato and fast food ch...
4,India scraps plan to impose restrictions on la...,The Indian government has reversed its decisio...,the indian government has reversed its decisio...


## 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [12]:
import prepare_nlp

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/nickolaspedrimiranda/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/nickolaspedrimiranda/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [13]:
codeup = pd.read_csv('codeup_blogs.csv')
codeup = codeup.rename(columns={'content':'original'})
codeup = codeup.dropna()
codeup.head(3)

Unnamed: 0,title,original
0,Spotlight on APIDA Voices: Celebrating Heritag...,May is traditionally known as Asian American a...
1,Women in tech: Panelist Spotlight – Magdalena ...,Codeup is hosting a Women in Tech Panel in hon...
2,Women in tech: Panelist Spotlight – Rachel Rob...,Codeup is hosting a Women in Tech Panel in hon...


In [29]:
prepare_nlp.cleanse(codeup, 'original')

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Spotlight on APIDA Voices: Celebrating Heritag...,May is traditionally known as Asian American a...,may traditionally known asian american pacific...,may tradit known asian american pacif island a...,may traditionally known asian american pacific...
1,Women in tech: Panelist Spotlight – Magdalena ...,Codeup is hosting a Women in Tech Panel in hon...,codeup hosting women tech panel honor womens h...,codeup host women tech panel honor women histo...,codeup hosting woman tech panel honor woman hi...
2,Women in tech: Panelist Spotlight – Rachel Rob...,Codeup is hosting a Women in Tech Panel in hon...,codeup hosting women tech panel honor womens h...,codeup host women tech panel honor women histo...,codeup hosting woman tech panel honor woman hi...
3,Women in Tech: Panelist Spotlight – Sarah Mellor,Codeup is hosting a Women in Tech Panel in hon...,codeup hosting women tech panel honor womens h...,codeup host women tech panel honor women histo...,codeup hosting woman tech panel honor woman hi...
4,Women in Tech: Panelist Spotlight – Madeleine ...,Codeup is hosting a Women in Tech Panel in hon...,codeup hosting women tech panel honor womens h...,codeup host women tech panel honor women histo...,codeup hosting woman tech panel honor woman hi...
...,...,...,...,...,...
265,Why Isn’t the San Antonio Tech Scene Growing F...,The simple answer is computer programming tale...,simple answer computer programming talent peop...,simpl answer comput program talent peopl take ...,simple answer computer programming talent peop...
266,Why People Can’t Learn Programming on Their Own,"While developing Codeup, we interviewed dozens...",developing codeup interviewed dozens people at...,develop codeup interview dozen peopl attempt l...,developing codeup interviewed dozen people att...
267,What is Our Noble Cause?,In his TEDx San Antonio presentation this fall...,tedx san antonio presentation fall nick longo ...,tedx san antonio present fall nick longo cofou...,tedx san antonio presentation fall nick longo ...
268,Scholarships for Women: Why We’re Doing It,A hot topic that is trending is the special tr...,hot topic trending special treatment ladies te...,hot topic trend special treatment ladi tech in...,hot topic trending special treatment lady tech...


In [15]:
codeup.head()

Unnamed: 0,title,original,clean
0,Spotlight on APIDA Voices: Celebrating Heritag...,May is traditionally known as Asian American a...,may is traditionally known as asian american a...
1,Women in tech: Panelist Spotlight – Magdalena ...,Codeup is hosting a Women in Tech Panel in hon...,codeup is hosting a women in tech panel in hon...
2,Women in tech: Panelist Spotlight – Rachel Rob...,Codeup is hosting a Women in Tech Panel in hon...,codeup is hosting a women in tech panel in hon...
3,Women in Tech: Panelist Spotlight – Sarah Mellor,Codeup is hosting a Women in Tech Panel in hon...,codeup is hosting a women in tech panel in hon...
4,Women in Tech: Panelist Spotlight – Madeleine ...,Codeup is hosting a Women in Tech Panel in hon...,codeup is hosting a women in tech panel in hon...


## 8. For each dataframe, produce the following columns:

* 'title' to hold the title
* 'original' to hold the original article/post content
* 'clean' to hold the normalized and tokenized original with the stopwords removed.
* 'stemmed' to hold the stemmed version of the cleaned data.
* 'lemmatized' to hold the lemmatized version of the cleaned data.

In [16]:
def clean_column(series):
    new_series = []
    #i = 0
    for col in series:
        #print(type(blog), i)
        #i+=1
        data = basic_clean(col)
        data = tokenize(data)
        data = remove_stopwords(data)
        new_series.append(data)
    return new_series

In [17]:
codeup['clean'] = clean_column(codeup.original)
codeup.head()

Unnamed: 0,title,original,clean
0,Spotlight on APIDA Voices: Celebrating Heritag...,May is traditionally known as Asian American a...,may traditionally known asian american pacific...
1,Women in tech: Panelist Spotlight – Magdalena ...,Codeup is hosting a Women in Tech Panel in hon...,codeup hosting women tech panel honor womens h...
2,Women in tech: Panelist Spotlight – Rachel Rob...,Codeup is hosting a Women in Tech Panel in hon...,codeup hosting women tech panel honor womens h...
3,Women in Tech: Panelist Spotlight – Sarah Mellor,Codeup is hosting a Women in Tech Panel in hon...,codeup hosting women tech panel honor womens h...
4,Women in Tech: Panelist Spotlight – Madeleine ...,Codeup is hosting a Women in Tech Panel in hon...,codeup hosting women tech panel honor womens h...


In [18]:
news['clean'] = clean_column(news.original)
news.head()

Unnamed: 0,title,original,clean
0,Govt probing accounts of Adani Group-run Mumba...,The Ministry of Corporate Affairs has opened a...,ministry corporate affairs opened investigatio...
1,IndiGo Co-founder Gangwal to buy SpiceJet stak...,IndiGo Co-founder Rakesh Gangwal is at advance...,indigo cofounder rakesh gangwal advanced stage...
2,"Who is KP Ramasamy, a farmer's son who entered...",KP Ramasamy has made his debut on Forbes India...,kp ramasamy made debut forbes india ' 100 rich...
3,"Zomato, McDonald's fined ₹1 lakh for deliverin...",Food delivery platform Zomato and fast food ch...,food delivery platform zomato fast food chain ...
4,India scraps plan to impose restrictions on la...,The Indian government has reversed its decisio...,indian government reversed decision imposing r...


In [19]:
def stem_column(series):
    stemmed_data = []
    for col in series:
        data = stem(col)
        stemmed_data.append(data)
    return stemmed_data

In [20]:
codeup['stemmed'] = stem_column(codeup.clean)
codeup.head()

Unnamed: 0,title,original,clean,stemmed
0,Spotlight on APIDA Voices: Celebrating Heritag...,May is traditionally known as Asian American a...,may traditionally known asian american pacific...,may tradit known asian american pacif island a...
1,Women in tech: Panelist Spotlight – Magdalena ...,Codeup is hosting a Women in Tech Panel in hon...,codeup hosting women tech panel honor womens h...,codeup host women tech panel honor women histo...
2,Women in tech: Panelist Spotlight – Rachel Rob...,Codeup is hosting a Women in Tech Panel in hon...,codeup hosting women tech panel honor womens h...,codeup host women tech panel honor women histo...
3,Women in Tech: Panelist Spotlight – Sarah Mellor,Codeup is hosting a Women in Tech Panel in hon...,codeup hosting women tech panel honor womens h...,codeup host women tech panel honor women histo...
4,Women in Tech: Panelist Spotlight – Madeleine ...,Codeup is hosting a Women in Tech Panel in hon...,codeup hosting women tech panel honor womens h...,codeup host women tech panel honor women histo...


In [21]:
news['stemmed'] = stem_column(news.clean)
news.head()

Unnamed: 0,title,original,clean,stemmed
0,Govt probing accounts of Adani Group-run Mumba...,The Ministry of Corporate Affairs has opened a...,ministry corporate affairs opened investigatio...,ministri corpor affair open investig account a...
1,IndiGo Co-founder Gangwal to buy SpiceJet stak...,IndiGo Co-founder Rakesh Gangwal is at advance...,indigo cofounder rakesh gangwal advanced stage...,indigo cofound rakesh gangwal advanc stage tal...
2,"Who is KP Ramasamy, a farmer's son who entered...",KP Ramasamy has made his debut on Forbes India...,kp ramasamy made debut forbes india ' 100 rich...,kp ramasami made debut forb india ' 100 riches...
3,"Zomato, McDonald's fined ₹1 lakh for deliverin...",Food delivery platform Zomato and fast food ch...,food delivery platform zomato fast food chain ...,food deliveri platform zomato fast food chain ...
4,India scraps plan to impose restrictions on la...,The Indian government has reversed its decisio...,indian government reversed decision imposing r...,indian govern revers decis impos restrict lapt...


In [22]:
def lem_column(series):
    lem_data = []

    for col in series:
        data = lemmatize(col)
        lem_data.append(data)
    return lem_data

In [23]:
codeup['lemmatized'] = lem_column(codeup.clean)
codeup.head()

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Spotlight on APIDA Voices: Celebrating Heritag...,May is traditionally known as Asian American a...,may traditionally known asian american pacific...,may tradit known asian american pacif island a...,may traditionally known asian american pacific...
1,Women in tech: Panelist Spotlight – Magdalena ...,Codeup is hosting a Women in Tech Panel in hon...,codeup hosting women tech panel honor womens h...,codeup host women tech panel honor women histo...,codeup hosting woman tech panel honor woman hi...
2,Women in tech: Panelist Spotlight – Rachel Rob...,Codeup is hosting a Women in Tech Panel in hon...,codeup hosting women tech panel honor womens h...,codeup host women tech panel honor women histo...,codeup hosting woman tech panel honor woman hi...
3,Women in Tech: Panelist Spotlight – Sarah Mellor,Codeup is hosting a Women in Tech Panel in hon...,codeup hosting women tech panel honor womens h...,codeup host women tech panel honor women histo...,codeup hosting woman tech panel honor woman hi...
4,Women in Tech: Panelist Spotlight – Madeleine ...,Codeup is hosting a Women in Tech Panel in hon...,codeup hosting women tech panel honor womens h...,codeup host women tech panel honor women histo...,codeup hosting woman tech panel honor woman hi...


In [24]:
news['lemmatized'] = lem_column(news.clean)
news.head()

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Govt probing accounts of Adani Group-run Mumba...,The Ministry of Corporate Affairs has opened a...,ministry corporate affairs opened investigatio...,ministri corpor affair open investig account a...,ministry corporate affair opened investigation...
1,IndiGo Co-founder Gangwal to buy SpiceJet stak...,IndiGo Co-founder Rakesh Gangwal is at advance...,indigo cofounder rakesh gangwal advanced stage...,indigo cofound rakesh gangwal advanc stage tal...,indigo cofounder rakesh gangwal advanced stage...
2,"Who is KP Ramasamy, a farmer's son who entered...",KP Ramasamy has made his debut on Forbes India...,kp ramasamy made debut forbes india ' 100 rich...,kp ramasami made debut forb india ' 100 riches...,kp ramasamy made debut forbes india ' 100 rich...
3,"Zomato, McDonald's fined ₹1 lakh for deliverin...",Food delivery platform Zomato and fast food ch...,food delivery platform zomato fast food chain ...,food deliveri platform zomato fast food chain ...,food delivery platform zomato fast food chain ...
4,India scraps plan to impose restrictions on la...,The Indian government has reversed its decisio...,indian government reversed decision imposing r...,indian govern revers decis impos restrict lapt...,indian government reversed decision imposing r...


In [25]:
news.clean.iloc[0]

'ministry corporate affairs opened investigation accounts adani grouprun mumbai navi mumbai airports significant part information sought government pertains period fiscal 2018 2022 prior acquisition adani enterprises said company stated units respond communications applicable legal provisions'

In [26]:
news.lemmatized.iloc[0]

'ministry corporate affair opened investigation account adani grouprun mumbai navi mumbai airport significant part information sought government pertains period fiscal 2018 2022 prior acquisition adani enterprise said company stated unit respond communication applicable legal provision'

## 9. Ask yourself:

* If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
* If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
* If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

In [27]:
# FOR THE 200TB OF TEXT I WOULD STEM, BECAUSE IT WOULD BE FAR TOO COMPUTATIONALLY DEMANDING TO LEMMATIZE IT ESPECIALLY ON PAID COMPUTATIONAL RESOURCES.