### Let's talk about preprocessing text

#### Reminder:  Goal is to transform text into numeric information that can be used for classification or preditiction.

#### Recall:
* Vectorizers create feature spaces which have column representations for unique tokens and row representations for documents in a corpus (or collection)
* There are several ways to affect your feature space by pre-processing the text.
* Some options are available as part of the vectorizer method that creates the feature space, others need to be done before calling the vectorizer.

### Things that need to be done PRIOR to creating the feature space.
* Anything that will SUBSTITUTE one token for another:  dictionary replacement
* Anything that affects the options for the vectorizer: custom stopword lists
* You may choose to do stemming and lemmitization first

### Things that are options as part of the vector space creation:
* Removal of stopwords 
* Change to lower case
* You can embed stemmers to your vectorizer call


In [1]:
text = ["Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy lies a small unregarded yellow sun.",
"Orbiting this at a distance of roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea.",
"This planet has - or rather had - a problem, which was this: most of the people on it were unhappy for pretty much of the time. Many solutions were suggested for this problem, but most of these were largely concerned with the movements of small green pieces of paper, which is odd because on the whole it wasn't the small green pieces of paper that were unhappy.",
"And so the problem remained; lots of the people were mean, and most of them were miserable, even the ones with digital watches.",
"Many were increasingly of the opinion that they'd all made a big mistake in coming down from the trees in the first place. And some said that even the trees had been a bad move, and that no one should ever have left the oceans.",
"And then, one Thursday, nearly two thousand years after one man had been nailed to a tree for saying how great it would be to be nice to people for a change, one girl sitting on her own in a small cafe in Rickmansworth suddenly realized what it was that had been going wrong all this time, and she finally knew how the world could be made a good and happy place. This time it was right, it would work, and no one would have to get nailed to anything.",
"Sadly, however, before she could get to a phone to tell anyone about it, a terribly stupid catastrophe occurred, and the idea was lost forever.",
"This is not her story.",
"But it is the story of that terrible stupid catastrophe and some of its consequences.",
"It is also the story of a book, a book called The Hitch Hiker's Guide to the Galaxy - not an Earth book, never published on Earth, and until the terrible catastrophe occurred, never seen or heard of by any Earthman.",
"Nevertheless, a wholly remarkable book.",
"In fact it was probably the most remarkable book ever to come out of the great publishing houses of Ursa Minor - of which no Earthman had ever heard either.",
"Not only is it a wholly remarkable book, it is also a highly successful one - more popular than the Celestial Home Care Omnibus, better selling than Fifty More Things to do in Zero Gravity, and more controversial than Oolon Colluphid's trilogy of philosophical blockbusters Where God Went Wrong, Some More of God's Greatest Mistakes and Who is this God Person Anyway?",
"In many of the more relaxed civilizations on the Outer Eastern Rim of the Galaxy, the Hitch Hiker's Guide has already supplanted the great Encyclopedia Galactica as the standard repository of all knowledge and wisdom, for though it has many omissions and contains much that is apocryphal, or at least wildly inaccurate, it scores over the older, more pedestrian work in two important respects.",
"First, it is slightly cheaper; and secondly it has the words Don't Panic inscribed in large friendly letters on its cover.",
"But the story of this terrible, stupid Thursday, the story of its extraordinary consequences, and the story of how these consequences are inextricably intertwined with this remarkable book begins very simply.",
"It begins with a house. "]
print(type(text), len(text))
print(type(text[9]))
print(text[9:10])

<class 'list'> 17
<class 'str'>
["It is also the story of a book, a book called The Hitch Hiker's Guide to the Galaxy - not an Earth book, never published on Earth, and until the terrible catastrophe occurred, never seen or heard of by any Earthman."]


### Creating tokens in a list is an alternative way of processing text.  You can leave it as strings.  
### The difference is that lists have individual items already defined to process individually; strings need to be broken into those elements.

In [2]:
from nltk import word_tokenize
tokens = word_tokenize(text[9]) #creates a list of tokens from a string that is in the 0 element of the list "text"
print(type(tokens), len(tokens))
print(tokens[0:10])



<class 'list'> 47
['It', 'is', 'also', 'the', 'story', 'of', 'a', 'book', ',', 'a']


In [3]:
# preliminary investigation of feature space - you can (should?) see what your data look like before 
# deciding on what preprocessing you want to do
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

prelim = CountVectorizer(binary=False, lowercase = False)
text_dm = prelim.fit_transform(text)
print(text_dm.shape)
print(prelim.get_feature_names())


(17, 287)
['And', 'Anyway', 'But', 'Care', 'Celestial', 'Colluphid', 'Don', 'Earth', 'Earthman', 'Eastern', 'Encyclopedia', 'Far', 'Fifty', 'First', 'Galactica', 'Galaxy', 'God', 'Gravity', 'Greatest', 'Guide', 'Hiker', 'Hitch', 'Home', 'In', 'It', 'Many', 'Minor', 'Mistakes', 'More', 'Nevertheless', 'Not', 'Omnibus', 'Oolon', 'Orbiting', 'Outer', 'Panic', 'Person', 'Rickmansworth', 'Rim', 'Sadly', 'Some', 'The', 'Things', 'This', 'Thursday', 'Ursa', 'Went', 'Where', 'Who', 'Wrong', 'Zero', 'about', 'after', 'all', 'already', 'also', 'amazingly', 'an', 'and', 'any', 'anyone', 'anything', 'ape', 'apocryphal', 'are', 'arm', 'as', 'at', 'backwaters', 'bad', 'be', 'because', 'been', 'before', 'begins', 'better', 'big', 'blockbusters', 'blue', 'book', 'but', 'by', 'cafe', 'called', 'catastrophe', 'change', 'cheaper', 'civilizations', 'come', 'coming', 'concerned', 'consequences', 'contains', 'controversial', 'could', 'cover', 'descended', 'digital', 'distance', 'do', 'down', 'either', 'end'

In [4]:
# Let's say for our purposes, it's important to count publish and publishing as the same feature.
# So now we do some pre processing

#pre processing - stemmers
#porter stemmer

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

# our corpus is a list of strings, we need to stem each word in each string
# if this was a dataframe containing rows with strings, we'd need to stem each word in each string in each row
# our stemmed text = [ [process each word in the string that gets split into words] process each string in the list]
ps_text = [[ps.stem(word) for word in sentence.split()] for sentence in text]
print(type(ps_text), len(ps_text))
print(ps_text[9])



<class 'list'> 17
['It', 'is', 'also', 'the', 'stori', 'of', 'a', 'book,', 'a', 'book', 'call', 'the', 'hitch', "hiker'", 'guid', 'to', 'the', 'galaxi', '-', 'not', 'an', 'earth', 'book,', 'never', 'publish', 'on', 'earth,', 'and', 'until', 'the', 'terribl', 'catastroph', 'occurred,', 'never', 'seen', 'or', 'heard', 'of', 'by', 'ani', 'earthman.']


In [5]:
# we can process the list of lists or join the pieces back into a string
ps_text = [" ".join([ps.stem(word) for word in sentence.split()]) for sentence in text]
print(type(ps_text), len(ps_text))
print(ps_text[9])

<class 'list'> 17
It is also the stori of a book, a book call the hitch hiker' guid to the galaxi - not an earth book, never publish on earth, and until the terribl catastroph occurred, never seen or heard of by ani earthman.


In [6]:
# did this change our feature space the way we had hoped?
prelim = CountVectorizer(binary=False, lowercase = False)
ptext_dm = prelim.fit_transform(ps_text)
print(ptext_dm.shape)
print(prelim.get_feature_names())


(17, 276)
['In', 'It', 'about', 'after', 'all', 'alreadi', 'also', 'amazingli', 'an', 'and', 'ani', 'anyon', 'anything', 'anyway', 'ape', 'apocryphal', 'are', 'arm', 'as', 'at', 'backwat', 'bad', 'be', 'becaus', 'been', 'befor', 'begin', 'better', 'big', 'blockbust', 'blue', 'book', 'but', 'by', 'cafe', 'call', 'care', 'catastroph', 'celesti', 'change', 'cheaper', 'civil', 'colluphid', 'come', 'concern', 'consequ', 'consequences', 'contain', 'controversi', 'could', 'cover', 'descend', 'digit', 'distanc', 'do', 'don', 'down', 'earth', 'earthman', 'eastern', 'either', 'encyclopedia', 'end', 'even', 'ever', 'extraordinari', 'fact', 'far', 'fifti', 'final', 'first', 'for', 'forever', 'form', 'friendli', 'from', 'galactica', 'galaxi', 'galaxy', 'get', 'girl', 'go', 'god', 'good', 'gravity', 'great', 'greatest', 'green', 'guid', 'ha', 'had', 'happi', 'have', 'heard', 'her', 'highli', 'hiker', 'hitch', 'home', 'hous', 'house', 'how', 'however', 'idea', 'import', 'in', 'inaccurate', 'increasin

In [7]:
# now let's actually create the feature space of interest
CV = CountVectorizer(binary=False, stop_words = "english")
cvtext_dm = CV.fit_transform(ps_text)
print(cvtext_dm.shape)
print(CV.get_feature_names())


(17, 195)
['alreadi', 'amazingli', 'ani', 'anyon', 'ape', 'apocryphal', 'arm', 'backwat', 'bad', 'becaus', 'befor', 'begin', 'better', 'big', 'blockbust', 'blue', 'book', 'cafe', 'care', 'catastroph', 'celesti', 'change', 'cheaper', 'civil', 'colluphid', 'come', 'concern', 'consequ', 'consequences', 'contain', 'controversi', 'cover', 'descend', 'digit', 'distanc', 'don', 'earth', 'earthman', 'eastern', 'encyclopedia', 'end', 'extraordinari', 'fact', 'far', 'fifti', 'final', 'forever', 'form', 'friendli', 'galactica', 'galaxi', 'galaxy', 'girl', 'god', 'good', 'gravity', 'great', 'greatest', 'green', 'guid', 'ha', 'happi', 'heard', 'highli', 'hiker', 'hitch', 'home', 'hous', 'house', 'idea', 'import', 'inaccurate', 'increasingli', 'inextric', 'inscrib', 'insignific', 'intertwin', 'knew', 'knowledg', 'larg', 'left', 'letter', 'lie', 'life', 'littl', 'lost', 'lot', 'man', 'mani', 'mean', 'mile', 'million', 'minor', 'miserable', 'mistak', 'movement', 'nail', 'nearli', 'neat', 'nice', 'nine

In [8]:
# because we were looking at verb tenses, we could have lemmatized instead
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

# our corpus is a list of strings, we need to stem each word in each string
# if this was a dataframe containing rows with strings, we'd need to process each word in each string in each row
# our stemmed text = [put back into a string ([process each word in the string that gets split into words]) process each string in the list]
lem_text = [" ".join([wnl.lemmatize(word) for word in sentence.split()]) for sentence in text]
print(type(lem_text), len(lem_text))
print(lem_text[9])



<class 'list'> 17
It is also the story of a book, a book called The Hitch Hiker's Guide to the Galaxy - not an Earth book, never published on Earth, and until the terrible catastrophe occurred, never seen or heard of by any Earthman.


In [9]:
# did this change our feature space the way we had hoped?
prelim = CountVectorizer(binary=False, lowercase = False)
ltext_dm = prelim.fit_transform(lem_text)
print(ltext_dm.shape)
print(prelim.get_feature_names())


(17, 284)
['And', 'Anyway', 'But', 'Care', 'Celestial', 'Colluphid', 'Don', 'Earth', 'Earthman', 'Eastern', 'Encyclopedia', 'Far', 'Fifty', 'First', 'Galactica', 'Galaxy', 'God', 'Gravity', 'Greatest', 'Guide', 'Hiker', 'Hitch', 'Home', 'In', 'It', 'Many', 'Minor', 'Mistakes', 'More', 'Nevertheless', 'Not', 'Omnibus', 'Oolon', 'Orbiting', 'Outer', 'Panic', 'Person', 'Rickmansworth', 'Rim', 'Sadly', 'Some', 'The', 'Things', 'This', 'Thursday', 'Ursa', 'Went', 'Where', 'Who', 'Wrong', 'Zero', 'about', 'after', 'all', 'already', 'also', 'amazingly', 'an', 'and', 'any', 'anyone', 'anything', 'ape', 'apocryphal', 'are', 'arm', 'at', 'backwater', 'bad', 'be', 'because', 'been', 'before', 'begin', 'better', 'big', 'blockbuster', 'blue', 'book', 'but', 'by', 'cafe', 'called', 'catastrophe', 'change', 'cheaper', 'civilization', 'come', 'coming', 'concerned', 'consequence', 'consequences', 'contains', 'controversial', 'could', 'cover', 'descended', 'digital', 'distance', 'do', 'down', 'either', 

In [10]:
# no because the default to lemmatizing is to process things as nouns
# if we only care about that one verb, set the verb as default

lem_text = [" ".join([wnl.lemmatize(word, pos = "v") for word in sentence.split()]) for sentence in text]
print(type(lem_text), len(lem_text))
print(lem_text[9])


<class 'list'> 17
It be also the story of a book, a book call The Hitch Hiker's Guide to the Galaxy - not an Earth book, never publish on Earth, and until the terrible catastrophe occurred, never see or hear of by any Earthman.


In [11]:
# did this change our feature space the way we had hoped?
prelim = CountVectorizer(binary=False, lowercase = False)
ltext_dm = prelim.fit_transform(lem_text)
print (ltext_dm.shape)
print(prelim.get_feature_names())


(17, 276)
['And', 'Anyway', 'But', 'Care', 'Celestial', 'Colluphid', 'Don', 'Earth', 'Earthman', 'Eastern', 'Encyclopedia', 'Far', 'Fifty', 'First', 'Galactica', 'Galaxy', 'God', 'Gravity', 'Greatest', 'Guide', 'Hiker', 'Hitch', 'Home', 'In', 'It', 'Many', 'Minor', 'Mistakes', 'More', 'Nevertheless', 'Not', 'Omnibus', 'Oolon', 'Orbiting', 'Outer', 'Panic', 'Person', 'Rickmansworth', 'Rim', 'Sadly', 'Some', 'The', 'Things', 'This', 'Thursday', 'Ursa', 'Went', 'Where', 'Who', 'Wrong', 'Zero', 'about', 'after', 'all', 'already', 'also', 'amazingly', 'an', 'and', 'any', 'anyone', 'anything', 'ape', 'apocryphal', 'arm', 'as', 'at', 'backwaters', 'bad', 'be', 'because', 'before', 'begin', 'better', 'big', 'blockbusters', 'blue', 'book', 'but', 'by', 'cafe', 'call', 'catastrophe', 'change', 'cheaper', 'civilizations', 'come', 'concern', 'consequences', 'contain', 'controversial', 'could', 'cover', 'descended', 'digital', 'distance', 'do', 'down', 'either', 'end', 'even', 'ever', 'extraordinar

In [12]:
# now let's actually create the feature space of interest
CV = CountVectorizer(binary=False, stop_words = "english")
cvtext_dm = CV.fit_transform(lem_text)
print(cvtext_dm.shape)
print(CV.get_feature_names())


(17, 180)
['amazingly', 'ape', 'apocryphal', 'arm', 'backwaters', 'bad', 'begin', 'better', 'big', 'blockbusters', 'blue', 'book', 'cafe', 'care', 'catastrophe', 'celestial', 'change', 'cheaper', 'civilizations', 'colluphid', 'come', 'concern', 'consequences', 'contain', 'controversial', 'cover', 'descended', 'digital', 'distance', 'don', 'earth', 'earthman', 'eastern', 'encyclopedia', 'end', 'extraordinary', 'fact', 'far', 'finally', 'forever', 'form', 'friendly', 'galactica', 'galaxy', 'girl', 'god', 'good', 'gravity', 'great', 'greatest', 'green', 'guide', 'happy', 'hear', 'highly', 'hiker', 'hitch', 'home', 'house', 'idea', 'important', 'inaccurate', 'increasingly', 'inextricably', 'inscribe', 'insignificant', 'intertwine', 'know', 'knowledge', 'large', 'largely', 'leave', 'letter', 'lie', 'life', 'little', 'lose', 'lot', 'make', 'man', 'mean', 'miles', 'million', 'minor', 'miserable', 'mistake', 'mistakes', 'movements', 'nail', 'nearly', 'neat', 'nice', 'ninety', 'occurred', 'ocea

In [13]:
#how to use POS tagging to improve lemmatization
import nltk

onestring = word_tokenize(text[9])
print(type(onestring), onestring[0], type(onestring[0]))
taggedlist = nltk.pos_tag(onestring)
print(type(taggedlist), taggedlist[0], type(taggedlist[0]))
print(taggedlist[0][0])
print(taggedlist[0][1])

# we know that nouns begin with N, verbs with V, adverbs with R, and adjectives with J

<class 'list'> It <class 'str'>
<class 'list'> ('It', 'PRP') <class 'tuple'>
It
PRP


In [14]:
# we know that nouns begin with N, verbs with V, adverbs with R, and adjectives with J
for item in taggedlist:
    print(item, item[1][0])


('It', 'PRP') P
('is', 'VBZ') V
('also', 'RB') R
('the', 'DT') D
('story', 'NN') N
('of', 'IN') I
('a', 'DT') D
('book', 'NN') N
(',', ',') ,
('a', 'DT') D
('book', 'NN') N
('called', 'VBN') V
('The', 'DT') D
('Hitch', 'NNP') N
('Hiker', 'NNP') N
("'s", 'POS') P
('Guide', 'NNP') N
('to', 'TO') T
('the', 'DT') D
('Galaxy', 'NNP') N
('-', ':') :
('not', 'RB') R
('an', 'DT') D
('Earth', 'NN') N
('book', 'NN') N
(',', ',') ,
('never', 'RB') R
('published', 'VBN') V
('on', 'IN') I
('Earth', 'NNP') N
(',', ',') ,
('and', 'CC') C
('until', 'IN') I
('the', 'DT') D
('terrible', 'JJ') J
('catastrophe', 'NN') N
('occurred', 'VBD') V
(',', ',') ,
('never', 'RB') R
('seen', 'VBN') V
('or', 'CC') C
('heard', 'VBN') V
('of', 'IN') I
('by', 'IN') I
('any', 'DT') D
('Earthman', 'NNP') N
('.', '.') .


In [15]:
# way better code for this can be found here: http://www.ling.helsinki.fi/~gwilcock/Tartu-2011/P2-nltk-2.xhtml

newlem = []
wordtype = set(['R','V','N'])
for item in taggedlist:
    if item[1][0] in wordtype:
        postag = item[1][0].lower()
    elif item[1][0] == 'J':
        postag = 'a'
    else:
        postag = "n"

    lemmed = wnl.lemmatize(item[0], pos = postag)
    newlem.append(lemmed)
    
print(" ".join(newlem))


It be also the story of a book , a book call The Hitch Hiker 's Guide to the Galaxy - not an Earth book , never publish on Earth , and until the terrible catastrophe occur , never see or hear of by any Earthman .


In [16]:
# put it all together
newtext = []
wordtype = set(['R','V','N'])

for string in text:
    newlem = []

    taggedlist = nltk.pos_tag(word_tokenize(string))
    for item in taggedlist:
        if item[1][0] in wordtype:
            postag = item[1][0].lower()
        elif item[1][0] == 'J':
            postag = 'a'
        else:
            postag = "n"

        lemmed = wnl.lemmatize(item[0], pos = postag)
        newlem.append(lemmed)
    
    newstring = " ".join(newlem)
    
   

    newtext =newtext + [newstring]
    
print(newtext[9])       


It be also the story of a book , a book call The Hitch Hiker 's Guide to the Galaxy - not an Earth book , never publish on Earth , and until the terrible catastrophe occur , never see or hear of by any Earthman .


In [17]:
# now let's actually create the feature space of interest
CV = CountVectorizer(binary=False, stop_words = "english")
cvtext_dm = CV.fit_transform(newtext)
print(cvtext_dm.shape)
print(CV.get_feature_names())


(17, 176)
['amazingly', 'ape', 'apocryphal', 'arm', 'backwater', 'bad', 'begin', 'big', 'blockbuster', 'blue', 'book', 'cafe', 'care', 'catastrophe', 'celestial', 'change', 'cheap', 'civilization', 'colluphid', 'come', 'concern', 'consequence', 'contains', 'controversial', 'cover', 'descended', 'digital', 'distance', 'earth', 'earthman', 'eastern', 'encyclopedia', 'end', 'extraordinary', 'fact', 'far', 'finally', 'forever', 'form', 'friendly', 'galactica', 'galaxy', 'girl', 'god', 'good', 'gravity', 'great', 'greatest', 'green', 'guide', 'happy', 'hear', 'highly', 'hiker', 'hitch', 'home', 'house', 'idea', 'important', 'inaccurate', 'increasingly', 'inextricably', 'inscribe', 'insignificant', 'intertwine', 'know', 'knowledge', 'large', 'largely', 'leave', 'letter', 'lie', 'life', 'little', 'lose', 'lot', 'make', 'man', 'mean', 'mile', 'million', 'minor', 'miserable', 'mistake', 'mistakes', 'movement', 'nail', 'nearly', 'neat', 'nice', 'ninety', 'occur', 'ocean', 'odd', 'old', 'omission

In [18]:
# WHY IS MISTAKE/MISTAKES still there?!?
# Investigate!

print(text[4])
print("~~~~~~~~~~~~~~")
print(text[12])

Many were increasingly of the opinion that they'd all made a big mistake in coming down from the trees in the first place. And some said that even the trees had been a bad move, and that no one should ever have left the oceans.
~~~~~~~~~~~~~~
Not only is it a wholly remarkable book, it is also a highly successful one - more popular than the Celestial Home Care Omnibus, better selling than Fifty More Things to do in Zero Gravity, and more controversial than Oolon Colluphid's trilogy of philosophical blockbusters Where God Went Wrong, Some More of God's Greatest Mistakes and Who is this God Person Anyway?


In [19]:
nltk.pos_tag(word_tokenize(text[4]))

[('Many', 'JJ'),
 ('were', 'VBD'),
 ('increasingly', 'RB'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('opinion', 'NN'),
 ('that', 'IN'),
 ('they', 'PRP'),
 ("'d", 'MD'),
 ('all', 'DT'),
 ('made', 'VBD'),
 ('a', 'DT'),
 ('big', 'JJ'),
 ('mistake', 'NN'),
 ('in', 'IN'),
 ('coming', 'VBG'),
 ('down', 'RP'),
 ('from', 'IN'),
 ('the', 'DT'),
 ('trees', 'NNS'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('first', 'JJ'),
 ('place', 'NN'),
 ('.', '.'),
 ('And', 'CC'),
 ('some', 'DT'),
 ('said', 'VBD'),
 ('that', 'IN'),
 ('even', 'RB'),
 ('the', 'DT'),
 ('trees', 'NNS'),
 ('had', 'VBD'),
 ('been', 'VBN'),
 ('a', 'DT'),
 ('bad', 'JJ'),
 ('move', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('that', 'IN'),
 ('no', 'DT'),
 ('one', 'NN'),
 ('should', 'MD'),
 ('ever', 'RB'),
 ('have', 'VB'),
 ('left', 'VBN'),
 ('the', 'DT'),
 ('oceans', 'NNS'),
 ('.', '.')]

In [20]:
nltk.pos_tag(word_tokenize(text[12]))

[('Not', 'RB'),
 ('only', 'RB'),
 ('is', 'VBZ'),
 ('it', 'PRP'),
 ('a', 'DT'),
 ('wholly', 'RB'),
 ('remarkable', 'JJ'),
 ('book', 'NN'),
 (',', ','),
 ('it', 'PRP'),
 ('is', 'VBZ'),
 ('also', 'RB'),
 ('a', 'DT'),
 ('highly', 'RB'),
 ('successful', 'JJ'),
 ('one', 'CD'),
 ('-', ':'),
 ('more', 'JJR'),
 ('popular', 'JJ'),
 ('than', 'IN'),
 ('the', 'DT'),
 ('Celestial', 'NNP'),
 ('Home', 'NNP'),
 ('Care', 'NNP'),
 ('Omnibus', 'NNP'),
 (',', ','),
 ('better', 'RBR'),
 ('selling', 'VBG'),
 ('than', 'IN'),
 ('Fifty', 'NNP'),
 ('More', 'NNP'),
 ('Things', 'NNS'),
 ('to', 'TO'),
 ('do', 'VB'),
 ('in', 'IN'),
 ('Zero', 'NNP'),
 ('Gravity', 'NNP'),
 (',', ','),
 ('and', 'CC'),
 ('more', 'RBR'),
 ('controversial', 'JJ'),
 ('than', 'IN'),
 ('Oolon', 'NNP'),
 ('Colluphid', 'NNP'),
 ("'s", 'POS'),
 ('trilogy', 'NN'),
 ('of', 'IN'),
 ('philosophical', 'JJ'),
 ('blockbusters', 'NNS'),
 ('Where', 'WRB'),
 ('God', 'NNP'),
 ('Went', 'NNP'),
 ('Wrong', 'NNP'),
 (',', ','),
 ('Some', 'DT'),
 ('More', 'JJR

In [21]:
# now REALLY put it all together
newtext = []
wordtype = set(['R','V','N'])

for string in text:
    newlem = []

    taggedlist = nltk.pos_tag(word_tokenize(string.lower()))
    for item in taggedlist:
        if item[1][0] in wordtype:
            postag = item[1][0].lower()
        elif item[1][0] == 'J':
            postag = 'a'
        else:
            postag = "n"

        lemmed = wnl.lemmatize(item[0], pos = postag)
        newlem.append(lemmed)
    
    newstring = " ".join(newlem)
    
   

    newtext =newtext + [newstring]
    
print(newtext[9])     


it be also the story of a book , a book call the hitch hiker 's guide to the galaxy - not an earth book , never publish on earth , and until the terrible catastrophe occur , never see or hear of by any earthman .


In [22]:
# now let's actually create the feature space of interest
CV = CountVectorizer(binary=False, stop_words = "english")
cvtext_dm = CV.fit_transform(newtext)
print(cvtext_dm.shape)
print(CV.get_feature_names())


(17, 173)
['amazingly', 'ape', 'apocryphal', 'arm', 'backwater', 'bad', 'begin', 'big', 'blockbuster', 'blue', 'book', 'cafe', 'care', 'catastrophe', 'celestial', 'change', 'cheap', 'civilization', 'colluphid', 'come', 'concern', 'consequence', 'contains', 'controversial', 'cover', 'descended', 'digital', 'distance', 'earth', 'earthman', 'eastern', 'encyclopedia', 'end', 'extraordinary', 'fact', 'far', 'finally', 'forever', 'form', 'friendly', 'galactica', 'galaxy', 'girl', 'god', 'good', 'gravity', 'great', 'green', 'guide', 'happy', 'hear', 'highly', 'hiker', 'hitch', 'home', 'house', 'idea', 'important', 'inaccurate', 'increasingly', 'inextricably', 'inscribe', 'insignificant', 'intertwine', 'know', 'knowledge', 'large', 'largely', 'leave', 'letter', 'lie', 'life', 'little', 'lose', 'lot', 'make', 'man', 'mean', 'mile', 'million', 'minor', 'miserable', 'mistake', 'movement', 'nail', 'nearly', 'neat', 'nice', 'ninety', 'occur', 'ocean', 'odd', 'old', 'omission', 'omnibus', 'oolon', '

#### Notice that the lemmatizer doesn't work well with adverbs.  There's lots of stuff written on this and on how WordNet treats such things.  It involves "pertainyms" and other NLP stuff that we don't need to delve into. If adverb lemmatization is important for your question, let me know and I'll share some ugly code that might help


#### If you are interested in more detail on POS, see this: http://www.inf.ed.ac.uk/teaching/courses/inf2a/slides/2009_inf2a_L12_slides.pdf

