The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

In [1]:
import pandas as pd
import numpy as np

import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import acquire

In [2]:
articles = acquire.get_news_articles()

In [3]:
articles

[{'title': '8, 7, 6.6, 5.8, 5 & 4.5 is the state of economy: Chidambaram on GDP growth',
  'body': 'Former Finance Minister P Chidambaram on Thursday said, "Nothing sums up the state of the economy better than the following series of numbers: 8, 7, 6.6, 5.8, 5 and 4.5." He added, "Those are the quarterly growth rates of GDP in the last six quarters." "We\'ll be lucky if growth rate touches 5% by the year\'s end," he further said.',
  'catergory': 'business'},
 {'title': 'We are the same animal, we are both a little crazy: Masayoshi Son on Jack Ma',
  'body': 'SoftBank CEO Masayoshi Son has said the decision to invest $20 million in Alibaba in 2000 was driven by gut feeling. "I could smell him...We\'re the same animal. We are both a little crazy," Son said of Jack Ma, Alibaba\'s Co-Founder. Ma said Son insisted on investing $50 million but that he declined saying it was too large a sum.',
  'catergory': 'business'},
 {'title': 'Sundar Pichai rejected Google shares worth millions in 2018

Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [4]:
example = "eSAM!lkMMM$%"

In [5]:
def basic_clean(string):
    string = string.lower()
    string = unicodedata.normalize('NFKD', string).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    string = re.sub(r"[^a-z0-9'\s]", '', string)
    string = re.sub(r'[\r|\n|\r\n]+',' ', string)
    string = string.strip()
    return string

In [6]:
basic_clean("HOW?dddé")

'howddde'

Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [7]:
def tokenize(string):
    tokenizer = nltk.tokenize.ToktokTokenizer()

    return tokenizer.tokenize(string, return_str=True)

In [8]:
tokenize("Hello, WOrld!")

'Hello , WOrld !'

In [9]:
tokenize("email@hi.com")

'email@hi.com'

In [10]:
tokenize("2018-19")

'2018-19'

Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [11]:
def stem(string):
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in string.split()]
    string_of_stems = " ".join(stems)
    return string_of_stems

In [12]:
stem("running into a house")

'run into a hous'

Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [13]:
def lemmatize(string):
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    string_of_lemmas = " ".join(lemmas)
    return string_of_lemmas

In [14]:
lemmatize("running into a house")

'running into a house'

Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [15]:
def remove_stopwords(string, extra_words=[], exclude_words=[]):
    string = tokenize(string)
    
    words = string.split()
    stopword_list = stopwords.words('english')
    
    stopword_list = set(stopword_list) - set(exclude_words)
    
    stopword_list = stopword_list.union(set(extra_words))
    
    filtered_words = [w for w in words if w not in stopword_list]
    final_string = ' '.join(filtered_words)
    return final_string

In [16]:
stopwords_list = ["a", "an", "the", "bob", "jane"]

In [17]:
words_to_exclude = ["an", "bob"]

In [18]:
set(stopwords_list) - set(words_to_exclude)

{'a', 'jane', 'the'}

In [19]:
remove_stopwords("The quick brown fox jumped over the lazy dog")

'The quick brown fox jumped lazy dog'

In [20]:
def prep_article(df):
    df["original"] = df.body
    df["stemmed"] = df.body.apply(basic_clean).apply(stem)
    df["lemmatized"] = df.body.apply(basic_clean).apply(lemmatize)
    df["clean"] = df.body.apply(basic_clean).apply(remove_stopwords)
    df.drop(columns=["body"], inplace=True)
    return df

In [21]:
df = pd.DataFrame(articles)

In [22]:
prep_article(df)

Unnamed: 0,title,catergory,original,stemmed,lemmatized,clean
0,"8, 7, 6.6, 5.8, 5 & 4.5 is the state of econom...",business,Former Finance Minister P Chidambaram on Thurs...,former financ minist p chidambaram on thursday...,former finance minister p chidambaram on thurs...,former finance minister p chidambaram thursday...
1,"We are the same animal, we are both a little c...",business,SoftBank CEO Masayoshi Son has said the decisi...,softbank ceo masayoshi son ha said the decis t...,softbank ceo masayoshi son ha said the decisio...,softbank ceo masayoshi son said decision inves...
2,Sundar Pichai rejected Google shares worth mil...,business,Google's 47-year-old India-born CEO Sundar Pic...,google' 47yearold indiaborn ceo sundar pichai ...,google's 47yearold indiaborn ceo sundar pichai...,google ' 47yearold indiaborn ceo sundar pichai...
3,Gut feeling drove me to invest $20M in Alibaba...,business,"SoftBank Founder and CEO Masayoshi Son, in a d...",softbank founder and ceo masayoshi son in a di...,softbank founder and ceo masayoshi son in a di...,softbank founder ceo masayoshi son discussion ...
4,Maharashtra govt suggests merger of PMC Bank w...,business,In a bid to provide relief to depositors of sc...,in a bid to provid relief to depositor of scam...,in a bid to provide relief to depositor of sca...,bid provide relief depositors scamhit punjab m...
...,...,...,...,...,...,...
95,It motivates me: Ananya Panday on comparisons ...,entertainment,On being asked if comparisons with Sara Ali Kh...,on be ask if comparison with sara ali khan dis...,on being asked if comparison with sara ali kha...,asked comparisons sara ali khan disappoints an...
96,Farhan to produce biopic on mathematician Vash...,entertainment,"Farhan Akhtar's production house, Excel Entert...",farhan akhtar' product hous excel entertain wi...,farhan akhtar's production house excel enterta...,farhan akhtar ' production house excel enterta...
97,"I was bullied in school, that kind of scarred ...",entertainment,Singer Armaan Malik has revealed that he went ...,singer armaan malik ha reveal that he went thr...,singer armaan malik ha revealed that he went t...,singer armaan malik revealed went lot bullying...
98,I'm no one to tell her anything: Gera apologis...,entertainment,Comedian Gaurav Gera has apologised to singer ...,comedian gaurav gera ha apologis to singer neh...,comedian gaurav gera ha apologised to singer n...,comedian gaurav gera apologised singer neha ka...


Define a function named prepare_article_data that takes in the list of articles dictionaries, applies the prep_article function to each one, and returns the transformed data.

In [23]:
codeup_df = acquire.get_blog_articles()
codeup_df = pd.DataFrame(codeup_df)

In [24]:
codeup_df

Unnamed: 0,title,body
0,Codeup’s Data Science Career Accelerator is Here!,\nThe rumors are true! The time has arrived. C...
1,Data Science Myths,\nBy Dimitri Antoniou and Maggie Giust\nData S...
2,Data Science VS Data Analytics: What’s The Dif...,"\nBy Dimitri Antoniou\nA week ago, Codeup laun..."
3,10 Tips to Crush It at the SA Tech Job Fair,\n10 Tips to Crush It at the SA Tech Job Fair\...
4,Competitor Bootcamps Are Closing. Is the Model...,\nCompetitor Bootcamps Are Closing. Is the Mod...


In [25]:
prep_article(codeup_df)

Unnamed: 0,title,original,stemmed,lemmatized,clean
0,Codeup’s Data Science Career Accelerator is Here!,\nThe rumors are true! The time has arrived. C...,the rumor are true the time ha arriv codeup ha...,the rumor are true the time ha arrived codeup ...,rumors true time arrived codeup officially ope...
1,Data Science Myths,\nBy Dimitri Antoniou and Maggie Giust\nData S...,by dimitri antoni and maggi giust data scienc ...,by dimitri antoniou and maggie giust data scie...,dimitri antoniou maggie giust data science big...
2,Data Science VS Data Analytics: What’s The Dif...,"\nBy Dimitri Antoniou\nA week ago, Codeup laun...",by dimitri antoni a week ago codeup launch our...,by dimitri antoniou a week ago codeup launched...,dimitri antoniou week ago codeup launched imme...
3,10 Tips to Crush It at the SA Tech Job Fair,\n10 Tips to Crush It at the SA Tech Job Fair\...,10 tip to crush it at the sa tech job fair sa ...,10 tip to crush it at the sa tech job fair sa ...,10 tips crush sa tech job fair sa tech job fai...
4,Competitor Bootcamps Are Closing. Is the Model...,\nCompetitor Bootcamps Are Closing. Is the Mod...,competitor bootcamp are close is the model in ...,competitor bootcamps are closing is the model ...,competitor bootcamps closing model danger prog...
