### Prepare & Parsing text/data:
After acquiring the data with NLP/web scraping methods, we need to parse the data into small bits.

In [32]:
#need to import nltk (natural language tool kit) to help with parsing:
import nltk; nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jeneyring/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jeneyring/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/jeneyring/nltk_data...


True

Steps to parsing data:
<br>
    1) Convert text to all lower case for normalcy.<br>
    2) Remove any accented characters, non-ASCII characters.<br>
    3) Remove special characters.<br>
    4) Stem or lemmatize the words.(stem = "if b, then c")<br>
    5) Remove stopwords.(if, and, the, etc)<br>
    6) Store the clean text and the original text for use in future notebooks.

#NOTE: if your corpus is large enough, its ok to really clean up/strip down your text to become natural language.

In [36]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

from time import strftime

import pandas as pd

import acquire


In [3]:
original = acquire.get_news_articles()

### Exercise 1)
Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

Lowercase everything
Normalize unicode characters
Replace anything that is not a letter, number, whitespace or a single quote.

In [4]:
#look at the df to consider what does need to be applied: 
#ie. I want to keep my titles upper case but my content can be lowercase
original.head()

Unnamed: 0,title,content,author,date,source,category
0,"You all call me fugitive, which court has ever...",Lalit Modi on Sunday took to Instagram to spea...,Ridham Gambhir,17 Jul,,business
1,World's biggest NFT marketplace OpenSea fires ...,The world's first and biggest NFT marketplace ...,Ridham Gambhir,16 Jul,,business
2,"BCCI had ₹40 cr in bank when I joined & ₹47,68...","In an Instagram post, Lalit Modi asserted that...",Ridham Gambhir,17 Jul,,business
3,A fighter to the core: Mahindra praises PV Sin...,Businessman Anand Mahindra took to Twitter to ...,Ridham Gambhir,17 Jul,,business
4,Twitter's sudden speed for trial after 2 month...,Tesla CEO Elon Musk has opposed Twitter's requ...,Ridham Gambhir,16 Jul,,business


In [5]:
#applying lowercase to one string in dataframe:
string = original.content[0]
string

'Lalit Modi on Sunday took to Instagram to speak about various issues after he revealed his relationship with Sushmita Sen. He wrote, "[Though you all] call me a "fugitive"...tell me which court has ever convicted me...None…Everyone knows how difficult it is to do business in India." Speaking about IPL, he said, “Everyone...knows that I did it all alone"'

In [6]:
#lowering all capitalized letters:
string = string.lower()

In [7]:
#normalizing the content of string:
string=unicodedata.normalize('NFKD', string).encode('ascii', 'ignore').decode('utf-8', 'ignore')

In [8]:
#removing anything not a letter, number, whitespace or single quote:
string = re.sub(r"[^a-z0-9'\s]", '' ,string)

In [9]:
#checking output:
string

'lalit modi on sunday took to instagram to speak about various issues after he revealed his relationship with sushmita sen he wrote though you all call me a fugitivetell me which court has ever convicted menoneeveryone knows how difficult it is to do business in india speaking about ipl he said everyoneknows that i did it all alone'

In [10]:
#putting altogether in a function:

def basic_clean(string):
    """A function that uses NLTK to clean and normalizes a string"""
    string = string.lower()
    string = unicodedata.normalize('NFKD', string).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    string = re.sub(r"[^a-z0-9'\s]", '' ,string)
    return string

Another way to do this:

### Exercise 2)
Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [11]:
#using toktoktokenizer to breakdown words into units:
tokenizer = nltk.tokenize.ToktokTokenizer()

print(tokenizer.tokenize(string, return_str=True))


lalit modi on sunday took to instagram to speak about various issues after he revealed his relationship with sushmita sen he wrote though you all call me a fugitivetell me which court has ever convicted menoneeveryone knows how difficult it is to do business in india speaking about ipl he said everyoneknows that i did it all alone


In [12]:
def tokenize(string):
    """This function will take in a string, tokenize by breaking any leftover words into units and return 
    the tokenized string"""
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    string = tokenizer.tokenize(string, return_str=True)
    return string

In [13]:
#testing out functions:
original2 = acquire.get_news_articles()
string2 = original2.content[1]
string2

'The world\'s first and biggest NFT marketplace OpenSea has fired 20% of its employees citing "cryptocurrency winter" and "broad macroeconomic instability". CEO Devin Finzer shared the news on Twitter and said the affected employees will be provided with a generous severance and healthcare coverage into 2023. He added that the company will also help in the placement of these employees. '

In [18]:
#testing all functions:
cleaned = basic_clean(string2)
cleaned

"the world's first and biggest nft marketplace opensea has fired 20 of its employees citing cryptocurrency winter and broad macroeconomic instability ceo devin finzer shared the news on twitter and said the affected employees will be provided with a generous severance and healthcare coverage into 2023 he added that the company will also help in the placement of these employees "

In [19]:
#testing all functions:
tokenize(cleaned)

"the world ' s first and biggest nft marketplace opensea has fired 20 of its employees citing cryptocurrency winter and broad macroeconomic instability ceo devin finzer shared the news on twitter and said the affected employees will be provided with a generous severance and healthcare coverage into 2023 he added that the company will also help in the placement of these employees"

### Exercise 3)
Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [22]:
#defining the stem tools: stem focuses on pull the root word out of any words with affixes (pre/suffix)
# Create the nltk stemmer object, then use it-
ps = nltk.porter.PorterStemmer()

ps.stem('call'), ps.stem('calling'), ps.stem('called')

('call', 'call', 'call')

In [26]:
stems = [ps.stem(word) for word in cleaned.split()]
article_stemmed = ' '.join(stems)
print(article_stemmed)

the world' first and biggest nft marketplac opensea ha fire 20 of it employe cite cryptocurr winter and broad macroeconom instabl ceo devin finzer share the news on twitter and said the affect employe will be provid with a gener sever and healthcar coverag into 2023 he ad that the compani will also help in the placement of these employe


In [37]:
def stem(string):
    """This function takes in a string and returns the stemmed version of string"""
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in cleaned.split()]
    string = ' '.join(stems)
    
    return string

### Exercise 4)
Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [34]:
#testing out lemmatizer:
wnl = nltk.stem.WordNetLemmatizer()

sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."

for word in sentence.split():
    print('stem:', ps.stem(word), '-- lemma:', wnl.lemmatize(word))


stem: he -- lemma: He
stem: wa -- lemma: wa
stem: run -- lemma: running
stem: and -- lemma: and
stem: eat -- lemma: eating
stem: at -- lemma: at
stem: same -- lemma: same
stem: time. -- lemma: time.
stem: he -- lemma: He
stem: ha -- lemma: ha
stem: bad -- lemma: bad
stem: habit -- lemma: habit
stem: of -- lemma: of
stem: swim -- lemma: swimming
stem: after -- lemma: after
stem: play -- lemma: playing
stem: long -- lemma: long
stem: hour -- lemma: hour
stem: in -- lemma: in
stem: the -- lemma: the
stem: sun. -- lemma: Sun.


In [35]:
#using on cleaned data/shows stem vs lemmatized versions:
wnl = nltk.stem.WordNetLemmatizer()
for word in cleaned.split():
    print('stem:', ps.stem(word), '-- lemma:', wnl.lemmatize(word))

stem: the -- lemma: the
stem: world' -- lemma: world's
stem: first -- lemma: first
stem: and -- lemma: and
stem: biggest -- lemma: biggest
stem: nft -- lemma: nft
stem: marketplac -- lemma: marketplace
stem: opensea -- lemma: opensea
stem: ha -- lemma: ha
stem: fire -- lemma: fired
stem: 20 -- lemma: 20
stem: of -- lemma: of
stem: it -- lemma: it
stem: employe -- lemma: employee
stem: cite -- lemma: citing
stem: cryptocurr -- lemma: cryptocurrency
stem: winter -- lemma: winter
stem: and -- lemma: and
stem: broad -- lemma: broad
stem: macroeconom -- lemma: macroeconomic
stem: instabl -- lemma: instability
stem: ceo -- lemma: ceo
stem: devin -- lemma: devin
stem: finzer -- lemma: finzer
stem: share -- lemma: shared
stem: the -- lemma: the
stem: news -- lemma: news
stem: on -- lemma: on
stem: twitter -- lemma: twitter
stem: and -- lemma: and
stem: said -- lemma: said
stem: the -- lemma: the
stem: affect -- lemma: affected
stem: employe -- lemma: employee
stem: will -- lemma: will
stem: be

In [41]:
def lemmatize(string):
    """This function takes in a string and returns a lemmatized version of the string."""
    # create our lemmatizer object
    wnl = nltk.stem.WordNetLemmatizer()
    # use a list comprehension to lemmatize each word
    # string.split() => output a list of every token inside of the document
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    # glue the lemmas back together by the strings we split on
    string = ' '.join(lemmas)
    #return the altered document
    return string

### Exercise 5)
Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

 <b>note about sets vs list: putting a list of a dictionary together. It's items cannot be replaced.</b>
- sets have similarities to a dictionary.
- we can use sets in stopwords so that we make sure only unique words going into list (vs. not checking what words are already in stopwords...which could return duplicates if not careful)

In [38]:
#Example of list:
list1 = [1,2,3,4]
list2 = [2,1,3,4]

print(set(list1)== set(list2))

True


In [39]:
#list vs set:
mylist = ['a','b','c']

myset = set(mylist)

print(mylist, myset)

['a', 'b', 'c'] {'b', 'c', 'a'}


In [40]:
def remove_stopwords(string, extra_words = [], exclude_words = []):
    stopword_list = stopwords.words('english')
    stopword_list = set(stopword_list)= set(exclude_words)

SyntaxError: cannot assign to function call (3073174545.py, line 2)