# Text Cleaning 

## BRZOZOWSKI MAREK

## FIND THE PHRASES AT YOUR CHOICE USING REGULAR EXPRESSION

For this I will be looking for Noun Phrases and single Verbs for the wikipedia data.

First we need to set up our environment and load in our data. 

In [3]:
# Setting up environmental packages to complete the task presented above.
import wikipediaapi
import nltk

from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.chunk import RegexpParser


Second a quick preview will show the type of data we are working with.

In [4]:
# Loading Data and Quick Previews. Checking for existence, title, URL, and quick page summary.
wiki_wiki = wikipediaapi.Wikipedia('en')
page_py = wiki_wiki.page('George W. Bush')
print("Page - Exists: %s" % page_py.exists())
print("Page - Title: %s" % page_py.title)
print(page_py.fullurl,'\n')
print("Page - Summary: \n%s" % page_py.summary)

Page - Exists: True
Page - Title: George W. Bush
https://en.wikipedia.org/wiki/George_W._Bush 

Page - Summary: 
George Walker Bush (born July 6, 1946) is an American politician and businessman who served as the 43rd president of the United States from 2001 to 2009. A member of the Republican Party, he had previously served as the 46th governor of Texas from 1995 to 2000. Born into the Bush family, his father, George H. W. Bush, served as the 41st president of the United States from 1989 to 1993.
Bush is the eldest son of Barbara and George H. W. Bush. As such he is the second son of a former United States president to himself become the American president, with the first being John Quincy Adams, the son of John Adams. He flew warplanes in the Texas and Alabama Air National Guard. After graduating from Yale College in 1968 and Harvard Business School in 1975, he worked in the oil industry. Bush married Laura Welch in 1977 and unsuccessfully ran for the U.S. House of Representatives sho

In [5]:
# Converting Wikipedia's data into usuable information
raw_text = page_py.text
raw_text[:300]

'George Walker Bush (born July 6, 1946) is an American politician and businessman who served as the 43rd president of the United States from 2001 to 2009. A member of the Republican Party, he had previously served as the 46th governor of Texas from 1995 to 2000. Born into the Bush family, his father,'

In [6]:
# Word Tokenizer: Dividing text/documents into word units called tokens. Tokens originate from words or sentences where spaces or punctuation define tokens.
word_list = word_tokenize(raw_text)

# Preview and quick examinations
print('Word Lenght', len(word_list),'\n')
print(word_list[0:100])

Word Lenght 17125 

['George', 'Walker', 'Bush', '(', 'born', 'July', '6', ',', '1946', ')', 'is', 'an', 'American', 'politician', 'and', 'businessman', 'who', 'served', 'as', 'the', '43rd', 'president', 'of', 'the', 'United', 'States', 'from', '2001', 'to', '2009', '.', 'A', 'member', 'of', 'the', 'Republican', 'Party', ',', 'he', 'had', 'previously', 'served', 'as', 'the', '46th', 'governor', 'of', 'Texas', 'from', '1995', 'to', '2000', '.', 'Born', 'into', 'the', 'Bush', 'family', ',', 'his', 'father', ',', 'George', 'H.', 'W.', 'Bush', ',', 'served', 'as', 'the', '41st', 'president', 'of', 'the', 'United', 'States', 'from', '1989', 'to', '1993', '.', 'Bush', 'is', 'the', 'eldest', 'son', 'of', 'Barbara', 'and', 'George', 'H.', 'W.', 'Bush', '.', 'As', 'such', 'he', 'is', 'the', 'second']


Thrid we will filter out our data to remove unnecessary stopwords.

In [7]:
# Removal of Stop Words. Stop words are words that we do no to process. They are usually part of the language strucutre such as: the, is are, etc.,.
stop_words = set(stopwords.words('english'))

# Filtering out stopwords
filtered_word = [w for w in word_list if not w in stop_words]

Fourth we will have seperate our data into seperate listings, USE tokenizing

In [8]:
# Part of Speech: Tagging words with a part of speech.
tagged = nltk.pos_tag(filtered_word)

Fifth adding part of speech will define our data as a member of a closs of words.

In [9]:
# Chunking: Grouping words in meaniful chunks. 
chunker = RegexpParser("""
                       NOUN PHRASES: {(<J\w+>|<V\w+>|<NN\w?>)+.*<NN\w?>}                #To extract Noun Phrases  (Any type of Adjectives OR Any type Verbs OR Any type Nouns/Regular Noun WITH at Least One  Any type Nouns/Regular Noun)
                       VERBS: {<V.*>}                                                   #To extract Verbs (Any Verb)
                       """)
chunky = chunker.parse(tagged)

In [10]:
# Printing out the chunked combinations for 100 words
for i in range(0,100,1):
    print(chunky[i])

(NOUN PHRASES George/NNP Walker/NNP Bush/NNP)
('(', '(')
(NOUN PHRASES born/VBN July/NNP)
('6', 'CD')
(',', ',')
('1946', 'CD')
(')', ')')
(NOUN PHRASES American/JJ politician/NN businessman/NN)
(VERBS served/VBD)
('43rd', 'CD')
(NOUN PHRASES president/NN United/NNP)
('States', 'NNPS')
('2001', 'CD')
('2009', 'CD')
('.', '.')
('A', 'DT')
(NOUN PHRASES member/NN Republican/NNP Party/NNP)
(',', ',')
('previously', 'RB')
(VERBS served/VBD)
('46th', 'CD')
(NOUN PHRASES governor/NN Texas/NNP)
('1995', 'CD')
('2000', 'CD')
('.', '.')
(NOUN PHRASES Born/NNP Bush/NNP family/NN)
(',', ',')
('father', 'RB')
(',', ',')
(NOUN PHRASES George/NNP H./NNP W./NNP Bush/NNP)
(',', ',')
(VERBS served/VBD)
('41st', 'CD')
(NOUN PHRASES president/NN United/NNP)
('States', 'NNPS')
('1989', 'CD')
('1993', 'CD')
('.', '.')
(NOUN PHRASES
  Bush/NNP
  eldest/JJS
  son/NN
  Barbara/NNP
  George/NNP
  H./NNP
  W./NNP
  Bush/NNP)
('.', '.')
('As', 'IN')
(NOUN PHRASES second/JJ son/NN former/JJ United/NNP)
('States',

Sixth apply a chunking parameter we will have seperated into our desired Phrases.

In [11]:
# Filtering out Noun Phrases out of the chunked data
nounPhrases = []
for i in chunky.subtrees(filter=lambda t: t.label() == 'NOUN PHRASES'):
    nounPhrases.append(i)

# Number of Noun Phrases our chunking parameters have produced.
print ("Number of Noun Phrases Parsed ",len(nounPhrases))

Number of Noun Phrases Parsed  1617


In [12]:
# Printing the first 100 of the Noun Phrases discovered
for i in range(0,100,1):
    print(nounPhrases[i])

(NOUN PHRASES George/NNP Walker/NNP Bush/NNP)
(NOUN PHRASES born/VBN July/NNP)
(NOUN PHRASES American/JJ politician/NN businessman/NN)
(NOUN PHRASES president/NN United/NNP)
(NOUN PHRASES member/NN Republican/NNP Party/NNP)
(NOUN PHRASES governor/NN Texas/NNP)
(NOUN PHRASES Born/NNP Bush/NNP family/NN)
(NOUN PHRASES George/NNP H./NNP W./NNP Bush/NNP)
(NOUN PHRASES president/NN United/NNP)
(NOUN PHRASES
  Bush/NNP
  eldest/JJS
  son/NN
  Barbara/NNP
  George/NNP
  H./NNP
  W./NNP
  Bush/NNP)
(NOUN PHRASES second/JJ son/NN former/JJ United/NNP)
(NOUN PHRASES president/NN become/VBN American/NNP president/NN)
(NOUN PHRASES first/JJ John/NNP Quincy/NNP Adams/NNP)
(NOUN PHRASES son/NN John/NNP Adams/NNP)
(NOUN PHRASES
  flew/VBD
  warplanes/NNS
  Texas/NNP
  Alabama/NNP
  Air/NNP
  National/NNP
  Guard/NNP)
(NOUN PHRASES graduating/VBG Yale/NNP College/NNP)
(NOUN PHRASES Harvard/NNP Business/NNP School/NNP)
(NOUN PHRASES worked/VBD oil/NN industry/NN)
(NOUN PHRASES Bush/NNP married/VBD Laur