# Data Preperation Exercises

In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd
import acquire
import prepare

import warnings
warnings.filterwarnings('ignore')

## In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

In [2]:
df = acquire.get_blog_articles()

In [3]:
df.head()

Unnamed: 0,title,published,content
0,Learn to Code: Python Workshop on 4/23,"Mar 31, 2022","According to LinkedIn, the “#1 Most Promising ..."
1,Coming Soon: Cloud Administration,"Mar 17, 2022",We’re launching a new program out of San Anton...
2,5 Books Every Woman In Tech Should Read,"Mar 8, 2022",On this International Women’s Day 2022 we want...
3,Codeup Start Dates for March 2022,"Jan 26, 2022",As we approach the end of January we wanted to...
4,VET TEC Funding Now Available For Dallas Veterans,"Jan 7, 2022",We are so happy to announce that VET TEC benef...


## 1) Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [4]:
def basic_clean(instr):
    '''
    Clean our data by making everything lowercase, normalize unicode characters, and removing unwanted characters
    '''
    # Lower case
    instr = instr.lower()
    # Normalize
    instr = unicodedata.normalize('NFKD' , instr).encode('ascii','ignore').decode('utf-8', 'ignore')
    # remove unwanted characters
    instr = re.sub(f"[^a-z0-9'\s]", '', instr)
    # Return the cleaned string
    return instr

In [5]:
test = df.content[0]
test

'According to LinkedIn, the “#1 Most Promising Job” is data science! But we here at Codeup understand changing careers can be a daunting idea. That’s where our free Learn to Code workshops come in!\xa0\nOn Saturday 4/23 we will be teaching a free Learn to Code workshop on the programming language Python which is one of the major building blocks of Data Science!\nWhat is data science? What is Python? \nIf you’re curious, join for free to learn the basics of Python from our very own instructors and get an introduction to the field of Data Science. This is all done from the comfort of home.\nSave your seat quickly – our Python workshops are always in high demand! \nWhat you need:\n1. Laptop (does not matter what kind). You need to be able to access WiFi and run an internet browser.\n2. To RSVP!\nYou can register for the event below!'

In [6]:
test_cleaned = basic_clean(test)
test_cleaned

'according to linkedin the 1 most promising job is data science but we here at codeup understand changing careers can be a daunting idea thats where our free learn to code workshops come in \non saturday 423 we will be teaching a free learn to code workshop on the programming language python which is one of the major building blocks of data science\nwhat is data science what is python \nif youre curious join for free to learn the basics of python from our very own instructors and get an introduction to the field of data science this is all done from the comfort of home\nsave your seat quickly  our python workshops are always in high demand \nwhat you need\n1 laptop does not matter what kind you need to be able to access wifi and run an internet browser\n2 to rsvp\nyou can register for the event below'

## 2) Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [7]:
def tokenize(instr):
    '''
    Tokenize the target string. We breakup words and puctuation into descrete units
    '''
    
    tokenizer = ToktokTokenizer()
    
    instr = tokenizer.tokenize(instr, return_str = True)
    
    return instr

In [8]:
test_tokenized = tokenize(test_cleaned)
test_tokenized

'according to linkedin the 1 most promising job is data science but we here at codeup understand changing careers can be a daunting idea thats where our free learn to code workshops come in \non saturday 423 we will be teaching a free learn to code workshop on the programming language python which is one of the major building blocks of data science\nwhat is data science what is python \nif youre curious join for free to learn the basics of python from our very own instructors and get an introduction to the field of data science this is all done from the comfort of home\nsave your seat quickly our python workshops are always in high demand \nwhat you need\n1 laptop does not matter what kind you need to be able to access wifi and run an internet browser\n2 to rsvp\nyou can register for the event below'

## 3) Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [9]:
def stem(instr):
    '''
    '''
    
    ps = nltk.porter.PorterStemmer()
    
    stems = [ps.stem(word) for word in instr.split()]
    
    instr = ' '.join(stems)
    
    return instr

In [10]:
test_stem = stem(test_tokenized)
test_stem

'accord to linkedin the 1 most promis job is data scienc but we here at codeup understand chang career can be a daunt idea that where our free learn to code workshop come in on saturday 423 we will be teach a free learn to code workshop on the program languag python which is one of the major build block of data scienc what is data scienc what is python if your curiou join for free to learn the basic of python from our veri own instructor and get an introduct to the field of data scienc thi is all done from the comfort of home save your seat quickli our python workshop are alway in high demand what you need 1 laptop doe not matter what kind you need to be abl to access wifi and run an internet browser 2 to rsvp you can regist for the event below'

In [11]:
pd.Series(test_stem.split()).value_counts().head(10)

to        8
the       7
of        5
is        5
scienc    4
python    4
what      4
data      4
free      3
you       3
dtype: int64

## 4) Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [12]:
def lemmatize(instr):
    '''
    
    '''
    
    wnl = nltk.stem.WordNetLemmatizer()
    
    lemmas = [wnl.lemmatize(word) for word in instr.split()]
    
    instr = ' '.join(lemmas)
    
    return instr

In [13]:
test_lemma = lemmatize(test_tokenized)
test_lemma

'according to linkedin the 1 most promising job is data science but we here at codeup understand changing career can be a daunting idea thats where our free learn to code workshop come in on saturday 423 we will be teaching a free learn to code workshop on the programming language python which is one of the major building block of data science what is data science what is python if youre curious join for free to learn the basic of python from our very own instructor and get an introduction to the field of data science this is all done from the comfort of home save your seat quickly our python workshop are always in high demand what you need 1 laptop doe not matter what kind you need to be able to access wifi and run an internet browser 2 to rsvp you can register for the event below'

In [14]:
pd.Series(test_lemma.split()).value_counts().head(10)

to         8
the        7
of         5
is         5
what       4
python     4
data       4
science    4
be         3
you        3
dtype: int64

## 5) Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

- This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [15]:
def remove_stopwords(instr, extra_words = [], exclude_words= []):
    '''
    
    '''
    
    stopword_list = stopwords.words('english')
    
    if exclude_words: 
        for word in exclude_words:
            stopword_list.remove(word)
    
    if extra_words:
        for word in extra_words:
            stopword_list.append(word)
    
    words = instr.split()
    
    filtered_words = [w for w in words if w not in stopword_list]
    
    words_removed = ' '.join(filtered_words)
    
    return words_removed

In [16]:
extra_words = []
exclude_words = ['to']

In [17]:
pd.Series(remove_stopwords(test_lemma, exclude_words= exclude_words).split()).value_counts().head()

to          8
python      4
data        4
science     4
workshop    3
dtype: int64

## 6) Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [18]:
news_df = acquire.get_news_articles()

## 7) Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [19]:
codeup_df = acquire.get_blog_articles()

## 8) For each dataframe, produce the following columns:

- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

In [24]:
news_df = news_df.rename(columns={'content':'original'})
news_df['clean'] = prepare.remove_stopwords(prepare.tokenize(prepare.basic_clean(str(news_df.original))))
news_df['stemmed'] = prepare.stem(str(news_df.clean))
news_df['lemmatized'] = prepare.lemmatize(str(news_df.clean))
news_df.head()

Unnamed: 0,category,title,original,author,published,clean,stemmed,lemmatized
0,business,Rupee closes at all-time low of 77.50 against ...,The Indian rupee weakened further on Monday to...,Pragya Swastik,2022-05-09T15:27:43.000Z,0 indian rupee weakened monday 1 microsoft sai...,0 0 indian rupe weaken monday 1 microsoft sai....,0 0 indian rupee weakened monday 1 microsoft s...
1,business,Microsoft to help cover US employees' travel c...,Microsoft has said that it will cover travel c...,Ridham Gambhir,2022-05-10T03:42:26.000Z,0 indian rupee weakened monday 1 microsoft sai...,0 0 indian rupe weaken monday 1 microsoft sai....,0 0 indian rupee weakened monday 1 microsoft s...
2,business,When are you coming to deliver 1st Tesla? Payt...,Paytm CEO Vijay Shekhar Sharma took to Twitter...,Ridham Gambhir,2022-05-10T05:08:13.000Z,0 indian rupee weakened monday 1 microsoft sai...,0 0 indian rupe weaken monday 1 microsoft sai....,0 0 indian rupee weakened monday 1 microsoft s...
3,business,Layout of 'world's first Bitcoin City' in El S...,El Salvador's President Nayib Bukele has share...,Hiral Goyal,2022-05-10T13:24:11.000Z,0 indian rupee weakened monday 1 microsoft sai...,0 0 indian rupe weaken monday 1 microsoft sai....,0 0 indian rupee weakened monday 1 microsoft s...
4,business,"After Musk's Taj Mahal tweet, his mother says ...",After Elon Musk tweeted he visited Taj Mahal i...,Apaar Sharma,2022-05-10T04:18:35.000Z,0 indian rupee weakened monday 1 microsoft sai...,0 0 indian rupe weaken monday 1 microsoft sai....,0 0 indian rupee weakened monday 1 microsoft s...


In [26]:
codeup_df = codeup_df.rename(columns={'content':'original'})
codeup_df['clean'] = prepare.remove_stopwords(prepare.tokenize(prepare.basic_clean(str(codeup_df.original))))
codeup_df['stemmed'] = prepare.stem(str(codeup_df.clean))
codeup_df['lemmatized'] = prepare.lemmatize(str(codeup_df.clean))
codeup_df.head()

Unnamed: 0,title,published,original,clean,stemmed,lemmatized
0,Learn to Code: Python Workshop on 4/23,"Mar 31, 2022","According to LinkedIn, the “#1 Most Promising ...",0 according linkedin 1 promising 1 launching n...,0 0 accord linkedin 1 promis 1 launch n... 1 0...,0 0 according linkedin 1 promising 1 launching...
1,Coming Soon: Cloud Administration,"Mar 17, 2022",We’re launching a new program out of San Anton...,0 according linkedin 1 promising 1 launching n...,0 0 accord linkedin 1 promis 1 launch n... 1 0...,0 0 according linkedin 1 promising 1 launching...
2,5 Books Every Woman In Tech Should Read,"Mar 8, 2022",On this International Women’s Day 2022 we want...,0 according linkedin 1 promising 1 launching n...,0 0 accord linkedin 1 promis 1 launch n... 1 0...,0 0 according linkedin 1 promising 1 launching...
3,Codeup Start Dates for March 2022,"Jan 26, 2022",As we approach the end of January we wanted to...,0 according linkedin 1 promising 1 launching n...,0 0 accord linkedin 1 promis 1 launch n... 1 0...,0 0 according linkedin 1 promising 1 launching...
4,VET TEC Funding Now Available For Dallas Veterans,"Jan 7, 2022",We are so happy to announce that VET TEC benef...,0 according linkedin 1 promising 1 launching n...,0 0 accord linkedin 1 promis 1 launch n... 1 0...,0 0 according linkedin 1 promising 1 launching...


## 9) Ask yourself:

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

- 493 KB I would probably use lemmatized as it is a small enough data set and though lemmatized is more computationally intensive it is more detailed in its approuch using words and not just cutting off the end like stemming. 
- 25 MB  I would still probably use lemmatized for the same reason above.
- 200 TB I would probably switch to stemming here as efficiency would be important here.

In [28]:
codeup_df.memory_usage().sum()

848

In [29]:
news_df.memory_usage().sum()

6528