## Session 3 - Changing the feature space: stopwords, stemming,  POS, n-grams.

### More notes on tweets versus news:
* News outlets have APIs as well.  Here's a list: https://newsapi.org/
* Not all Twitter feeds are real people: https://krebsonsecurity.com/2017/08/twitter-bots-use-likes-rts-for-intimidation/

### Full disclosure:  lots of what we did last week using scikitlearn, you can do other ways.

Another very heavily used package:  NLTK (http://www.nltk.org/)

We will use bits of it. If you haven't installed it yet, please do so now.


In [27]:
# import module(s) into namespace
import pandas as pd
import numpy as np
import requests
pd.set_option('display.max_colwidth', 15000) #important for getting all the text


### What's the optimal feature space?

There's no such thing!  It depends on the context of your problem/application.  You will continue to hear "It depends"!

---there is no absolute optimal. It depends on the task at hand

## Let's see what else we need to consider when creating our vector representation

In case you want more examples of interesting text:
http://avalon.law.yale.edu/

In [28]:
from bs4 import BeautifulSoup
import requests

page = requests.get('http://avalon.law.yale.edu/20th_century/mlk01.asp')

soup = BeautifulSoup(page.text, 'lxml')
#soup = BeautifulSoup(page.text, 'html5lib') # one of these should work.... 


thing = soup.p #extract text based on htlm tag and store in an object called "thing"
print(type(thing))
all = thing.find_next_siblings('p')
print(type(all))


<class 'bs4.element.Tag'>
<class 'bs4.element.ResultSet'>


In [29]:
print(len(all))
print(all[20])

28
<p>This will be the day when all of God's children will be able to sing with a new meaning, "My country, 'tis of thee, sweet land of liberty, of thee I sing. Land where my fathers died, land of the pilgrim's pride, from every mountainside, let freedom ring." </p>


In [30]:
# ResultSet looks like a list but doesn't have the same properties 
# Let's change it to something easier


speechtext = []
for row in all: #row is merely the name of the iterator - it doesn't mean anything.
    text = ''.join(row.findAll(text=True)) #picks out all the text and sticks it together with spaces
    data = [str(text.strip())] #gets rid of leading and trailing characters
    speechtext = speechtext + data #add new item to list

In [31]:
print(type(speechtext))
print(len(speechtext))
print(type(speechtext[20]), speechtext[20])

<class 'list'>
28
<class 'str'> This will be the day when all of God's children will be able to sing with a new meaning, "My country, 'tis of thee, sweet land of liberty, of thee I sing. Land where my fathers died, land of the pilgrim's pride, from every mountainside, let freedom ring."


### Now let's look at a few ways to change our feature space, starting with Stopwords

In [32]:
#There are many different sets of stopwords
# Let's see what's in NLTK

from nltk.corpus import stopwords #import the package

nltk_stopwords = stopwords.words("english") #pull out the words within the default nltk stopwords list

print(type(nltk_stopwords))
print(len(nltk_stopwords))
print(nltk_stopwords)

<class 'list'>
179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'sam

In [33]:
from sklearn.feature_extraction import text #import package

skl_stopwords = text.ENGLISH_STOP_WORDS #pull out words in sklearn stopwords list.  Note the different syntax
print(type(skl_stopwords))
print(len(skl_stopwords))
print(skl_stopwords)

<class 'frozenset'>
318
frozenset({'amount', 'one', 'among', 'been', 'do', 'himself', 'couldnt', 'per', 'eight', 'someone', 'interest', 'not', 'wherein', 'off', 'its', 'while', 'an', 'across', 'give', 'hasnt', 'i', 'latterly', 'are', 'afterwards', 'thereafter', 'when', 'whenever', 'itself', 'ever', 'nor', 'am', 'down', 'put', 'about', 'become', 'her', 'my', 'nevertheless', 'several', 'due', 'themselves', 'to', 'where', 'beforehand', 'perhaps', 'four', 'becoming', 'thence', 'onto', 'co', 'bill', 'anyway', 'has', 'go', 'now', 'who', 'could', 'found', 'serious', 'system', 'yet', 'only', 'as', 'everyone', 'either', 'it', 'but', 'back', 'third', 'made', 'towards', 'although', 'each', 'bottom', 'that', 'also', 'three', 'full', 'show', 'something', 'thereupon', 'in', 'beyond', 'nothing', 'moreover', 'below', 'herself', 'for', 'mine', 'over', 'whereas', 'besides', 'last', 'ourselves', 'six', 'get', 'might', 'then', 'via', 'can', 'hereby', 'still', 'others', 'is', 'together', 'too', 'very', 'ca

### Slight aside: data manipulation - lists, sets, and frozensets

* **Lists** are just that: lists of comma-separated values (items) between square brackets. Items in a list do *not* need to be the same type.
* **Sets** are similar to lists but they cannot have multiple occurances of the same item and they can only contain immutable objects (stuff that doesn't change like strings or numbers). A set is mutable but the contents are not. 
* **Frozensets** are like sets except that they cannot be changed, i.e. they are immutable.

http://www.python-course.eu/python3_sets_frozensets.php - good reference for all the set math we're going to walk through

In [34]:
wordlist = ['thing','blah','thing',12]
print(wordlist)
wordset = set(wordlist)
print(wordset)

['thing', 'blah', 'thing', 12]
{'thing', 'blah', 12}


In [35]:
wordlist2 = ['thing','blah','thing',12, ['a','b']]
print(wordlist2)
wordset = set(wordlist2)
print(wordset2) #yes, throws an error because you included a list as an element

['thing', 'blah', 'thing', 12, ['a', 'b']]


TypeError: unhashable type: 'list'

#### So what?  Well, we might want to be able to manipulate collections of text objects in ways that aren't easy to do in a data frame.

For example, what happens if you want to change the stopword list?

In [36]:
print(len(nltk_stopwords)) #remember nltk_stopwords is a list
nltk_stopwords.remove('before') #we can remove items
nltk_stopwords.remove('after')
print(len(nltk_stopwords))


179
177


In [37]:
nltk_stopwords.remove('blue') #but only if they are there.  Lists don't handle these errors well

ValueError: list.remove(x): x not in list

In [38]:
# sets have methods that are more forgiving

# sets can be more forgiving than lists
nltk_stopwords = stopwords.words("english")
nltk_stopwords_set = set(nltk_stopwords) #convert the list into a set

print(len(nltk_stopwords_set))
keepwords = set(['before', 'after','blue'])
my_stopwords = nltk_stopwords_set.difference(keepwords) #only retains common words across both objects
print(len(my_stopwords))





179
177


In [39]:
# you can use "subtraction" to do the same thing

print(len(nltk_stopwords_set))
keepwords = set(['before', 'after','blue'])
my_stopwords = nltk_stopwords_set - keepwords
print(len(my_stopwords))

179
177


In [40]:
#We can use sets to do intersections and unions
#Create two sets to work with

set_a = set(nltk_stopwords) # create a set object from a list
print(type(set_a), len(set_a))


set_b = set(skl_stopwords) # create a set object from a list
print(type(set_b), len(set_b))

<class 'set'> 179
<class 'set'> 318


In [41]:
#Find the intersection of our two sets and show the results
set_c = set(set_a).intersection(set_b)
print(type(set_c), len(set_c))
print(set_c)

<class 'set'> 119
{'been', 'do', 'himself', 'not', 'off', 'its', 'while', 'i', 'an', 'when', 'are', 'itself', 'nor', 'which', 'am', 'down', 'about', 'her', 'my', 'to', 'themselves', 'where', 'has', 'now', 'who', 'only', 'as', 'it', 'but', 'each', 'that', 'in', 'below', 'herself', 'for', 'over', 'ourselves', 'then', 'can', 'is', 'too', 'very', 'most', 'into', 'up', 'he', 'there', 'here', 'hers', 'further', 'have', 'such', 'were', 'his', 'once', 'was', 'during', 'before', 'same', 'both', 'any', 'no', 'them', 'she', 'ours', 'or', 'yourself', 'some', 'at', 'how', 'they', 'between', 'me', 'yourselves', 'more', 'why', 'should', 'again', 'had', 'a', 're', 'out', 'the', 'yours', 'with', 'their', 'of', 'through', 'from', 'will', 'all', 'we', 'whom', 'him', 'few', 'against', 'you', 'until', 'own', 'myself', 'being', 'if', 'so', 'those', 'other', 'under', 'these', 'your', 'what', 'our', 'after', 'than', 'be', 'on', 'above', 'this', 'by', 'because', 'and'}


In [42]:
#maybe you want to see which words are coming from each set
#you can start with your original set and then subtract the terms common in both sets
only_nltk = set_a - set_c
print(len(only_nltk), only_nltk)

60 {'aren', 'won', 'm', 'having', "wouldn't", 'shouldn', "mustn't", "hasn't", "shouldn't", "haven't", "that'll", "shan't", "it's", 'don', 'just', "hadn't", "weren't", 'y', 't', 'wasn', "aren't", 'isn', 'didn', 'd', "you'd", 'doesn', 'needn', 'mustn', "wasn't", 's', 'did', 'shan', "didn't", 'weren', 'o', 'll', 'ain', "she's", 'wouldn', "mightn't", "you'll", "won't", "couldn't", 'couldn', "don't", "needn't", 'haven', "you've", 'theirs', "should've", 've', "doesn't", 'mightn', 'hadn', 'ma', 'hasn', 'doing', "you're", 'does', "isn't"}


In [43]:
only_skl = set_b - set_c
print(len(only_skl), only_skl)

199 {'amount', 'one', 'among', 'per', 'someone', 'couldnt', 'eight', 'interest', 'wherein', 'across', 'give', 'thereafter', 'hasnt', 'latterly', 'whenever', 'afterwards', 'ever', 'put', 'several', 'become', 'nevertheless', 'due', 'beforehand', 'perhaps', 'four', 'becoming', 'thence', 'onto', 'co', 'bill', 'anyway', 'go', 'could', 'found', 'serious', 'system', 'yet', 'everyone', 'either', 'third', 'back', 'towards', 'made', 'although', 'bottom', 'three', 'also', 'show', 'something', 'thereupon', 'beyond', 'nothing', 'moreover', 'mine', 'besides', 'last', 'six', 'get', 'might', 'via', 'hereby', 'still', 'others', 'within', 'together', 'wherever', 'cannot', 'less', 'else', 'became', 'therefore', 'however', 'first', 'sometimes', 'beside', 'almost', 'inc', 'along', 'con', 'thereby', 'hereafter', 'amoungst', 'ie', 'whoever', 'thru', 'former', 'least', 'already', 'eleven', 'since', 'whose', 'anyone', 'mostly', 'none', 'except', 'must', 'otherwise', 'much', 'even', 'another', 'indeed', 'herein

In [44]:
#combining sets
print(len(my_stopwords))
another_set = set(['tis','thee'])
my_stopwords.update(another_set) #adds another_set to existing my_stopwords object
print(len(my_stopwords))

#there is no parallel "+" construct in Python 3 

177
179


## Alright, enough set math...
It's important to understand how you can manipulate some of the objects you'll be working with
As always, the defaults represent a good starting point, but rarely give you the results you're looking for

### Let's go back to our vectorizors and see how different stopword lists affect  the shape of our feature space

In [45]:
#first, let's use the default english stopwords list
from sklearn.feature_extraction.text import TfidfVectorizer

tf_none = TfidfVectorizer(binary=False, stop_words = "english") #define the method
none_dm = tf_none.fit_transform(speechtext) #apply the method

print(none_dm.shape)


(28, 412)


In [46]:
print(tf_none.get_feature_names())

['able', 'ago', 'ahead', 'alabama', 'alleghenies', 'allow', 'almighty', 'america', 'american', 'appalling', 'architects', 'areas', 'asking', 'autumn', 'awakening', 'bad', 'bank', 'bankrupt', 'basic', 'battered', 'beacon', 'beautiful', 'beginning', 'believe', 'believes', 'bitterness', 'black', 'blow', 'bodies', 'bound', 'boys', 'bright', 'brotherhood', 'brothers', 'brutality', 'business', 'california', 'came', 'capital', 'captivity', 'cash', 'catholics', 'cells', 'chains', 'changed', 'character', 'check', 'children', 'cities', 'citizens', 'citizenship', 'city', 'civil', 'color', 'colorado', 'come', 'community', 'concerned', 'condition', 'conduct', 'constitution', 'content', 'continue', 'cooling', 'corners', 'country', 'created', 'creative', 'creed', 'crippled', 'crooked', 'cup', 'curvaceous', 'dark', 'day', 'daybreak', 'declaration', 'decree', 'deeds', 'deeply', 'defaulted', 'degenerate', 'demand', 'desert', 'desolate', 'despair', 'destiny', 'determination', 'devotees', 'died', 'difficu

In [47]:
#next, let's use our custom stopwords list.
tf_none = TfidfVectorizer(binary=False, stop_words = my_stopwords) #define the method
none_dm = tf_none.fit_transform(speechtext) #apply the method
print(none_dm.shape)

#remember, my_stopwords started as nltk stopwords and then we removed before & after and added tis & thee

(28, 443)


In [48]:
print(tf_none.get_feature_names())


['able', 'ago', 'ahead', 'alabama', 'alleghenies', 'allow', 'almighty', 'alone', 'also', 'america', 'american', 'appalling', 'architects', 'areas', 'asking', 'autumn', 'awakening', 'back', 'bad', 'bank', 'bankrupt', 'basic', 'battered', 'beacon', 'beautiful', 'become', 'beginning', 'believe', 'believes', 'bitterness', 'black', 'blow', 'bodies', 'bound', 'boys', 'bright', 'brotherhood', 'brothers', 'brutality', 'business', 'california', 'came', 'cannot', 'capital', 'captivity', 'cash', 'catholics', 'cells', 'chains', 'changed', 'character', 'check', 'children', 'cities', 'citizens', 'citizenship', 'city', 'civil', 'color', 'colorado', 'come', 'community', 'concerned', 'condition', 'conduct', 'constitution', 'content', 'continue', 'cooling', 'corners', 'country', 'created', 'creative', 'creed', 'crippled', 'crooked', 'cup', 'curvaceous', 'dark', 'day', 'daybreak', 'declaration', 'decree', 'deeds', 'deeply', 'defaulted', 'degenerate', 'demand', 'desert', 'desolate', 'despair', 'destiny', 

In [49]:
# you can define stopword lists directly (or as a set)
# maybe you like the stopwords in R
r_stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 
               'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 
               'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 
               'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 
               'having', 'do', 'does', 'did', 'doing', 'would', 'should', 'could', 'ought', "i'm", "you're", "he's",
               "she's", "it's", "we're", "they're", "i've", "you've", "we've", "they've", "i'd", "you'd", "he'd", 
               "she'd", "we'd", "they'd", "i'll", "you'll", "he'll", "she'll", "we'll", "they'll", "isn't", "aren't", 
               "wasn't", "weren't", "hasn't", "haven't", "hadn't", "doesn't", "don't", "didn't", "won't", "wouldn't", 
               "shan't", "shouldn't", "can't", 'cannot', "couldn't", "mustn't", "let's", "that's", "who's", "what's", 
               "here's", "there's", "when's", "where's", "why's", "how's", 'a', 'an', 'the', 'and', 'but', 'if', 'or', 
               'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 
               'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 
               'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 
               'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 
               'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very']
print(len(r_stopwords))

174


In [50]:
# or maybe you want to add to an existing stopword list
my_stopwords = list(skl_stopwords) + ["RT"]
print(type(my_stopwords), len(my_stopwords))

<class 'list'> 319


### Let's look at a few more examples of how stopwords affect our feature space

In [51]:
#Create a base line feature space
from sklearn.feature_extraction.text import TfidfVectorizer

tf_none = TfidfVectorizer(binary=False, min_df = .01, max_df = .95) #define the method
none_dm = tf_none.fit_transform(speechtext) #apply the method

print(none_dm.shape)

(28, 512)


In [52]:
#Remove deafult sklearn stopwords 
tf_skl = TfidfVectorizer(binary=False, stop_words='english', min_df = .01, max_df = .95) 
skl_dm = tf_skl.fit_transform(speechtext)

print(skl_dm.shape)
skl_features = tf_skl.get_feature_names()

(28, 412)


In [53]:
#remove the default nltk stopwrods
#remember, nltk_stopwords is an object we created
tf_nltk = TfidfVectorizer(binary=False, stop_words=nltk_stopwords, min_df = .01, max_df = .95) 
nltk_dm = tf_nltk.fit_transform(speechtext)

print(nltk_dm.shape)


(28, 445)


In [54]:
#remove our custom stopwords
#remember, we switched my_stopwods to be basaed on sklearn's list when we added "RT"
tf_my = TfidfVectorizer(binary=False, stop_words=my_stopwords, min_df = .01, max_df = .95) 
my_dm = tf_my.fit_transform(speechtext)

print(my_dm.shape)


(28, 412)


In [55]:
#Spot checking 412 features is no fun, but set math can tell us the features are the same
tf_set = set(tf_my.get_feature_names())
my_set = set(tf_skl.get_feature_names())

difference = tf_set - my_set
difference

set()

# ______
### Time for some hands on (10-15 minutes)
Break up into pairs and practice modifying stopwords lists.

1) Choose one of the default stop words lists and inspect its contents

2) Compare it against the stopwords you see in the speech text

3) Based on the comparison, select 3 terms to add & 3 terms to remove to customize your stopwords list

4) Insert your custom list into the vectorizer above so we can compare it against our baseline

# ______

In [74]:
skl_stopwords = text.ENGLISH_STOP_WORDS #pull out words in sklearn stopwords list.  Note the different syntax
print(type(skl_stopwords))
print(len(skl_stopwords))
print(skl_stopwords)

print()
print(all[20])

print('skl_stopwords =' ,len(text.ENGLISH_STOP_WORDS))
my_stopwords = list(text.ENGLISH_STOP_WORDS) + ["tis", "thee", "let"]
print('add 3 words, stop list count is now' , len(my_stopwords))

my_stopwords.remove("bill")
print('removed 1, stop list count is now' , len(my_stopwords))
my_stopwords.remove("ltd")
print('removed 1, stop list count is now' , len(my_stopwords))
my_stopwords.remove("ie")
print('removed 1, stop list count is now' , len(my_stopwords))

<class 'frozenset'>
318
frozenset({'amount', 'one', 'among', 'been', 'do', 'himself', 'couldnt', 'per', 'eight', 'someone', 'interest', 'not', 'wherein', 'off', 'its', 'while', 'an', 'across', 'give', 'hasnt', 'i', 'latterly', 'are', 'afterwards', 'thereafter', 'when', 'whenever', 'itself', 'ever', 'nor', 'am', 'down', 'put', 'about', 'become', 'her', 'my', 'nevertheless', 'several', 'due', 'themselves', 'to', 'where', 'beforehand', 'perhaps', 'four', 'becoming', 'thence', 'onto', 'co', 'bill', 'anyway', 'has', 'go', 'now', 'who', 'could', 'found', 'serious', 'system', 'yet', 'only', 'as', 'everyone', 'either', 'it', 'but', 'back', 'third', 'made', 'towards', 'although', 'each', 'bottom', 'that', 'also', 'three', 'full', 'show', 'something', 'thereupon', 'in', 'beyond', 'nothing', 'moreover', 'below', 'herself', 'for', 'mine', 'over', 'whereas', 'besides', 'last', 'ourselves', 'six', 'get', 'might', 'then', 'via', 'can', 'hereby', 'still', 'others', 'is', 'together', 'too', 'very', 'ca

## Shifting gears... Let's talk about another way to manipulate your feature space
### Stemming

Lemmatization (and stemming) can sometimes be helpful. By reducing words to their root form or counting only the headword, vector space models that account for frequency will result in fewer dimensions and greater element wise values.

See http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization

Online: http://text-processing.com/demo/stem/

In [76]:
testlist = ["maximum","presumably", "multiply","provision", "churches","owed", "ear", "saying", "crying", "string", "meant", "cement", "is", "are", "aardwolves"]

print(testlist)

['maximum', 'presumably', 'multiply', 'provision', 'churches', 'owed', 'ear', 'saying', 'crying', 'string', 'meant', 'cement', 'is', 'are', 'aardwolves']


### Porter Stemmer

* "Gentle Stemmer"
* Algorithm dates from 1980
* Still the default “go-to” stemmer
* Excellent trade-off between speed, readability, and accuracy
* Stems using a set of rules, or transformations,applied in a succession of steps
* About 60 rules in 6 steps
* No recursion

#### Porter Stemmer Steps
* Step 1: Gets rid of plurals and -ed or -ing suffixes
* Step 2: Turns terminal y to i when there is another vowel in the stem
* Step 3: Maps double suffixes to single ones:-ization, -ational, etc.
* Step 4: Deals with suffixes, -full, -ness etc.
* Step 5: Takes off -ant, -ence, etc.
* Step 6: Removes a final -e 


Original article by Mr. Porter: http://tartarus.org/martin/PorterStemmer/def.txt

In [77]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer() #define method 

# There are a few implementations of porters original work.  
#More information on them may be found @ http://www.nltk.org/api/nltk.stem.html#nltk.stem.porter.PorterStemmer

print((testlist))
[ps.stem(word) for word in testlist] 
#lots of stuff going on here.  For each word in the list, apply the stemmer

['maximum', 'presumably', 'multiply', 'provision', 'churches', 'owed', 'ear', 'saying', 'crying', 'string', 'meant', 'cement', 'is', 'are', 'aardwolves']


['maximum',
 'presum',
 'multipli',
 'provis',
 'church',
 'owe',
 'ear',
 'say',
 'cri',
 'string',
 'meant',
 'cement',
 'is',
 'are',
 'aardwolv']

In [78]:
#### Porter Mishaps

print(ps.stem("severing"), ps.stem("several"))

print(ps.stem("university"), ps.stem("universe"))

print(ps.stem("iron"), ps.stem("ironic"))

print(ps.stem("animal"), ps.stem("animated"))

sever sever
univers univers
iron iron
anim anim


In [79]:
#lancaster stemmer - much more aggressive

from nltk.stem.lancaster import LancasterStemmer
ls = LancasterStemmer()
[ls.stem(word) for word in testlist]

['maxim',
 'presum',
 'multiply',
 'provid',
 'church',
 'ow',
 'ear',
 'say',
 'cry',
 'string',
 'meant',
 'cem',
 'is',
 'ar',
 'aardwolv']

In [80]:
#### Are Porter Mishaps also Lancaster mishaps?

print(ls.stem("severing"), ls.stem("several"))

print(ls.stem("university"), ls.stem("universe"))

print(ls.stem("iron"), ls.stem("ironic"))

print(ls.stem("animal"), ls.stem("animated"))

sev sev
univers univers
iron iron
anim anim


#### Next gen stemmers:  Porter2 or Snowball
Actually language for creating stemmers, it's based on the Porter logic and the NLTK method can
handle multiple languages. 

http://snowballstem.org/

In [81]:

from nltk.stem import SnowballStemmer
ss = SnowballStemmer("english")
[ss.stem(word) for word in testlist]

['maximum',
 'presum',
 'multipli',
 'provis',
 'church',
 'owe',
 'ear',
 'say',
 'cri',
 'string',
 'meant',
 'cement',
 'is',
 'are',
 'aardwolv']

In [82]:
#### Fix mishaps?

print(ss.stem("severing"), ss.stem("several"))

print(ss.stem("university"), ss.stem("universe"))

print(ss.stem("iron"), ss.stem("ironic"))

print(ss.stem("animal"), ss.stem("animated"))

sever sever
univers univers
iron iron
anim anim


#### Stemming Shortcomings
* Stemmers are rudimentary
* No word sense disambiguation (“bats” vs “batting”)
* No POS disambiguation (“Batting” could be noun or verb, but “hitting” could only be verb)
* Cannot handle irregular conjungation/inflection (“to be”, etc.)


#### Lemmatization
Lemmas differ from stems in that a lemma is a canonical form (basic form of a word used as a dictionary entry) of the word, while a stem may not be a real word. So lemmatizers try to reduce things to their lemma which is a word. 

In [83]:
# lemmetizing - tries to make sense of things instead of chopping them down
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
print(testlist)
[wnl.lemmatize(word) for word in testlist]

['maximum', 'presumably', 'multiply', 'provision', 'churches', 'owed', 'ear', 'saying', 'crying', 'string', 'meant', 'cement', 'is', 'are', 'aardwolves']


['maximum',
 'presumably',
 'multiply',
 'provision',
 'church',
 'owed',
 'ear',
 'saying',
 'cry',
 'string',
 'meant',
 'cement',
 'is',
 'are',
 'aardwolf']

In [84]:
#another great example
#stemming can chop but not convert

sent = "cats running ran cactus cactuses cacti community communities"

print(" ".join([ps.stem(i) for i in sent.split()]))
print(" ".join([ls.stem(i) for i in sent.split()]))
print(" ".join([ss.stem(i) for i in sent.split()]))


cat run ran cactu cactus cacti commun commun
cat run ran cact cactus cact commun commun
cat run ran cactus cactus cacti communiti communiti


In [85]:
# lemmatizing tries a little harder
print(sent)
print(" ".join([wnl.lemmatize(i) for i in sent.split()]))

cats running ran cactus cactuses cacti community communities
cat running ran cactus cactus cactus community community


In [86]:
# why did it miss run/running?  It can handle parts of speech and the default is noun

print(sent)
print(" ".join([wnl.lemmatize(i, pos = "v") for i in sent.split()]))



cats running ran cactus cactuses cacti community communities
cat run run cactus cactuses cacti community communities


In [87]:
# what happens to MLK's speech if we stem it?

#porter - note the capitals
p_speech = [[ps.stem(word) for word in sentence.split(" ")] for sentence in speechtext]
p_speech = [" ".join(sentence) for sentence in p_speech]

print(p_speech[20])

thi will be the day when all of god' children will be abl to sing with a new meaning, "mi country, 'ti of thee, sweet land of liberty, of thee I sing. land where my father died, land of the pilgrim' pride, from everi mountainside, let freedom ring."


In [88]:
# lancaster - note the lack of capitals
l_speech = [[ls.stem(word) for word in sentence.split()] for sentence in speechtext]
l_speech = [" ".join(sentence) for sentence in l_speech]

print(l_speech[20])

thi wil be the day when al of god's childr wil be abl to sing with a new meaning, "my country, 'tis of thee, sweet land of liberty, of the i sing. land wher my fath died, land of the pilgrim's pride, from every mountainside, let freedom ring."


In [89]:
#lemmatization - Capitals are there

lem_speech = [[wnl.lemmatize(word) for word in sentence.split()] for sentence in speechtext]
lem_speech = [" ".join(sentence) for sentence in lem_speech]

print(lem_speech[20])

This will be the day when all of God's child will be able to sing with a new meaning, "My country, 'tis of thee, sweet land of liberty, of thee I sing. Land where my father died, land of the pilgrim's pride, from every mountainside, let freedom ring."


#### Effects both the size and contents of your feature space

In [90]:
#effect of lancaster stemmer on feature space (stop words already accounted for)
#tf_skl = TfidfVectorizer(binary=False, stop_words='english', min_df = .01, max_df = .95)
skl_ls_dm = tf_skl.fit_transform(l_speech) # applied to lancaster stemmed speech

print(skl_dm.shape)
print(skl_ls_dm.shape)
ls_features = tf_skl.get_feature_names()
print(len(ls_features))

(28, 412)
(28, 434)
434


In [91]:
#effect of lemmatization on feature space (stop words already accounted for)
#tf_skl = TfidfVectorizer(binary=False, stop_words='english', min_df = .01, max_df = .95) 
skl_lem_dm = tf_skl.fit_transform(lem_speech)

print(skl_dm.shape)
print(skl_lem_dm.shape)

lem_features = tf_skl.get_feature_names()
print(len(lem_features))

(28, 412)
(28, 409)
409


A little more set math

In [92]:
ls_minus_lem = set(ls_features) - set(lem_features)
print(len(ls_minus_lem))

204


## Let's see what we can do with our news stories

In [94]:
filename = "C:\\Users\\Paul\\Desktop\\Rockhurst\\BIA 6304-Text Mining\\Week 2\\nytimes2013.csv"
newsdf = pd.read_csv(filename, index_col = 0) 

print(newsdf.shape)
newsdf.head(1)

(3848, 5)


Unnamed: 0,date,description,headline,url,text
0,2013-01-01,"Ending a climactic showdown in the final hours of the 112th Congress, the House sent to President Obama legislation to avert big income tax increases on most Americans.",Divided House Passes Tax Deal in End to Latest Fiscal Standoff,http://www.nytimes.com/2013/01/02/us/politics/house-takes-on-fiscal-cliff.html,"WASHINGTON — Ending a climactic fiscal showdown in the final hours of the 112th Congress, the House late Tuesday passed and sent to legislation to avert big income tax increases on most Americans and prevent large cuts in spending for the Pentagon and other government programs. The measure, brought to the House floor less than 24 hours after its passage in the Senate, was approved 257 to 167, with 85 Republicans joining 172 Democrats in voting to allow income taxes to rise for the first time in two decades, in this case for the highest-earning Americans. Voting no were 151 Republicans and 16 Democrats. The bill was expected to be signed quickly by Mr. Obama, who won re-election on a promise to increase taxes on the wealthy. Mr. Obama strode into the White House briefing room shortly after the vote, less to hail the end of the fiscal crisis than to lay out a marker for the next one. “The one thing that I think, hopefully, the new year will focus on,” he said, “is seeing if we can put a package like this together with a little bit less drama, a little less brinkmanship, and not scare the heck out of folks quite as much.” In approving the measure after days of legislative intrigue, Congress concluded its final and most pitched fight over fiscal policy, the culmination of two years of battles over taxes, the federal debt, spending and what to do to slow the growth in popular social programs like . The decision by Republican leaders to allow the vote came despite widespread scorn among House Republicans for the bill, passed overwhelmingly by the Senate in the early hours of New Year’s Day. They were unhappy that it did not include significant spending cuts in health and other social programs, which they say are essential to any long-term solution to the nation’s debt. Democrats, while hardly placated by the compromise, celebrated Mr. Obama’s nominal victory in his final showdown with House Republicans in the 112th Congress, who began their term emboldened by scores of new, conservative members whose reach to the right ultimately tipped them over. “The American people are the real winners tonight,” Representative Bill Pascrell Jr., Democrat of New Jersey, said on the House floor, “not anyone who navigates these halls.” Not a single leader among House Republicans came to the floor to speak in favor of the bill, though Speaker , who rarely takes part in roll calls, voted in favor. Representative Eric Cantor of Virginia, the majority leader, and Representative Kevin McCarthy of California, the No. 3 Republican, voted no. Representative Paul D. Ryan, the budget chairman who was the Republican vice-presidential candidate, supported the bill. Despite the party divisions, many Republicans in their remarks characterized the measure, which allows taxes to go up on household income over $400,000 for individuals and $450,000 for couples but makes permanent tax cuts for income below that level, as a victory of sorts, even as so many of them declined to vote for it. “After more than a decade of criticizing these tax cuts,” said Representative Dave Camp of Michigan, “Democrats are finally joining Republicans in making them permanent. Republicans and the American people are getting something really important, permanent tax relief.” The dynamic with the House was a near replay of a fight at the end of 2011 over a break extension. In that showdown, Senate Democrats and Republicans passed legislation, and while House Republicans fulminated, they were eventually forced to swallow it. On Tuesday, as they got a detailed look at the Senate’s fiscal legislation, House Republicans ranging from Midwest pragmatists to -blessed conservatives voiced serious reservations about the measure, emerging from a lunchtime New Year’s Day meeting with their leaders, eyes flashing and faces grim, insisting they would not accept a bill without substantial savings from cuts. The unrest reached to the highest levels as Mr. Cantor told members in a closed-door meeting in the basement of the Capitol that he could not support the legislation in its current form. Mr. Boehner, who faces a re-election vote on his post on Thursday when the 113th Congress convenes, had grave concerns as well, but he had pledged to allow the House to consider any legislation that cleared the Senate. And he was not eager to have such a major piece of legislation pass with mainly opposition votes, and the outcome could be seen as undermining his authority. Adding to the pressure on the House, the fiscal agreement was reached by Senator Mitch McConnell of Kentucky, the Senate Republican leader, and had deep Republican support in the Senate, isolating the House Republicans in their opposition. Some of the Senate Republicans who backed the bill are staunch conservatives, like Senators Patrick J. Toomey of Pennsylvania and Tom Coburn of Oklahoma, with deep credibility among House Republicans."


In [95]:
#remember my favorite vectorizer
from sklearn.feature_extraction.text import CountVectorizer

cv8 = CountVectorizer(binary=False, min_df = .1, stop_words = "english") #define the transformation
cv8_news = cv8.fit_transform(newsdf['text']) #apply the transformation
print(cv8_news.shape)

names = cv8.get_feature_names()   #create list of feature names
count = np.sum(cv8_news.toarray(), axis = 0) # add up feature counts 
count2 = count.tolist()  # convert numpy array to list
count_df = pd.DataFrame(count2, index = names, columns = ['count']) # create a dataframe from the list
count_df.sort_values(['count'], ascending = False)[0:20]  #arrange by count instead

(3848, 426)


Unnamed: 0,count
said,21374
mr,13575
new,7997
like,6614
people,5547
year,5461
years,4853
time,4757
just,4127
city,3514


# ____
## Break up into pairs for 10-15 minutes (time permitting)
1) Stem or lematize the text (not the headlines) from our news articles

2) Create and apply the following vectorizer inserting a stopwords list of your choice
- CountVectorizer(binary=False, min_df = .1, stop_words = "your choice")

3) Print a sorted list of your top 20 features

4) Report back the decisions you made and whether youre happy with your top feature list
# ____

In [99]:
news_df_lem = [wnl.lemmatize(word) for word in newsdf['text']]

news_cv = CountVectorizer(binary=False, min_df = .1, stop_words = "english") 
news_cv_fit = news_cv.fit_transform(news_df_lem) 
print(news_cv_fit.shape)

names = news_cv.get_feature_names()  
count = np.sum(news_cv_fit.toarray(), axis = 0)
count2 = count.tolist() 
count_df = pd.DataFrame(count2, index = names, columns = ['count'])
count_df.sort_values(['count'], ascending = False)[0:20]  


(3848, 426)


Unnamed: 0,count
said,21374
mr,13575
new,7997
like,6614
people,5547
year,5461
years,4853
time,4757
just,4127
city,3514


In [100]:
# Apply the porter stemmer
newsdf['pstem'] = newsdf["text"].apply(lambda x: [ps.stem(y) for y in x.split()])
newsdf['pstem']= [" ".join(token) for token in newsdf['pstem']]
newsdf.head(2)

Unnamed: 0,date,description,headline,url,text,pstem
0,2013-01-01,"Ending a climactic showdown in the final hours of the 112th Congress, the House sent to President Obama legislation to avert big income tax increases on most Americans.",Divided House Passes Tax Deal in End to Latest Fiscal Standoff,http://www.nytimes.com/2013/01/02/us/politics/house-takes-on-fiscal-cliff.html,"WASHINGTON — Ending a climactic fiscal showdown in the final hours of the 112th Congress, the House late Tuesday passed and sent to legislation to avert big income tax increases on most Americans and prevent large cuts in spending for the Pentagon and other government programs. The measure, brought to the House floor less than 24 hours after its passage in the Senate, was approved 257 to 167, with 85 Republicans joining 172 Democrats in voting to allow income taxes to rise for the first time in two decades, in this case for the highest-earning Americans. Voting no were 151 Republicans and 16 Democrats. The bill was expected to be signed quickly by Mr. Obama, who won re-election on a promise to increase taxes on the wealthy. Mr. Obama strode into the White House briefing room shortly after the vote, less to hail the end of the fiscal crisis than to lay out a marker for the next one. “The one thing that I think, hopefully, the new year will focus on,” he said, “is seeing if we can put a package like this together with a little bit less drama, a little less brinkmanship, and not scare the heck out of folks quite as much.” In approving the measure after days of legislative intrigue, Congress concluded its final and most pitched fight over fiscal policy, the culmination of two years of battles over taxes, the federal debt, spending and what to do to slow the growth in popular social programs like . The decision by Republican leaders to allow the vote came despite widespread scorn among House Republicans for the bill, passed overwhelmingly by the Senate in the early hours of New Year’s Day. They were unhappy that it did not include significant spending cuts in health and other social programs, which they say are essential to any long-term solution to the nation’s debt. Democrats, while hardly placated by the compromise, celebrated Mr. Obama’s nominal victory in his final showdown with House Republicans in the 112th Congress, who began their term emboldened by scores of new, conservative members whose reach to the right ultimately tipped them over. “The American people are the real winners tonight,” Representative Bill Pascrell Jr., Democrat of New Jersey, said on the House floor, “not anyone who navigates these halls.” Not a single leader among House Republicans came to the floor to speak in favor of the bill, though Speaker , who rarely takes part in roll calls, voted in favor. Representative Eric Cantor of Virginia, the majority leader, and Representative Kevin McCarthy of California, the No. 3 Republican, voted no. Representative Paul D. Ryan, the budget chairman who was the Republican vice-presidential candidate, supported the bill. Despite the party divisions, many Republicans in their remarks characterized the measure, which allows taxes to go up on household income over $400,000 for individuals and $450,000 for couples but makes permanent tax cuts for income below that level, as a victory of sorts, even as so many of them declined to vote for it. “After more than a decade of criticizing these tax cuts,” said Representative Dave Camp of Michigan, “Democrats are finally joining Republicans in making them permanent. Republicans and the American people are getting something really important, permanent tax relief.” The dynamic with the House was a near replay of a fight at the end of 2011 over a break extension. In that showdown, Senate Democrats and Republicans passed legislation, and while House Republicans fulminated, they were eventually forced to swallow it. On Tuesday, as they got a detailed look at the Senate’s fiscal legislation, House Republicans ranging from Midwest pragmatists to -blessed conservatives voiced serious reservations about the measure, emerging from a lunchtime New Year’s Day meeting with their leaders, eyes flashing and faces grim, insisting they would not accept a bill without substantial savings from cuts. The unrest reached to the highest levels as Mr. Cantor told members in a closed-door meeting in the basement of the Capitol that he could not support the legislation in its current form. Mr. Boehner, who faces a re-election vote on his post on Thursday when the 113th Congress convenes, had grave concerns as well, but he had pledged to allow the House to consider any legislation that cleared the Senate. And he was not eager to have such a major piece of legislation pass with mainly opposition votes, and the outcome could be seen as undermining his authority. Adding to the pressure on the House, the fiscal agreement was reached by Senator Mitch McConnell of Kentucky, the Senate Republican leader, and had deep Republican support in the Senate, isolating the House Republicans in their opposition. Some of the Senate Republicans who backed the bill are staunch conservatives, like Senators Patrick J. Toomey of Pennsylvania and Tom Coburn of Oklahoma, with deep credibility among House Republicans.","washington — end a climact fiscal showdown in the final hour of the 112th congress, the hous late tuesday pass and sent to legisl to avert big incom tax increas on most american and prevent larg cut in spend for the pentagon and other govern programs. the measure, brought to the hous floor less than 24 hour after it passag in the senate, wa approv 257 to 167, with 85 republican join 172 democrat in vote to allow incom tax to rise for the first time in two decades, in thi case for the highest-earn americans. vote no were 151 republican and 16 democrats. the bill wa expect to be sign quickli by mr. obama, who won re-elect on a promis to increas tax on the wealthy. mr. obama strode into the white hous brief room shortli after the vote, less to hail the end of the fiscal crisi than to lay out a marker for the next one. “the one thing that I think, hopefully, the new year will focu on,” he said, “i see if we can put a packag like thi togeth with a littl bit less drama, a littl less brinkmanship, and not scare the heck out of folk quit as much.” In approv the measur after day of legisl intrigue, congress conclud it final and most pitch fight over fiscal policy, the culmin of two year of battl over taxes, the feder debt, spend and what to do to slow the growth in popular social program like . the decis by republican leader to allow the vote came despit widespread scorn among hous republican for the bill, pass overwhelmingli by the senat in the earli hour of new year’ day. they were unhappi that it did not includ signific spend cut in health and other social programs, which they say are essenti to ani long-term solut to the nation’ debt. democrats, while hardli placat by the compromise, celebr mr. obama’ nomin victori in hi final showdown with hous republican in the 112th congress, who began their term embolden by score of new, conserv member whose reach to the right ultim tip them over. “the american peopl are the real winner tonight,” repres bill pascrel jr., democrat of new jersey, said on the hous floor, “not anyon who navig these halls.” not a singl leader among hous republican came to the floor to speak in favor of the bill, though speaker , who rare take part in roll calls, vote in favor. repres eric cantor of virginia, the major leader, and repres kevin mccarthi of california, the no. 3 republican, vote no. repres paul D. ryan, the budget chairman who wa the republican vice-presidenti candidate, support the bill. despit the parti divisions, mani republican in their remark character the measure, which allow tax to go up on household incom over $400,000 for individu and $450,000 for coupl but make perman tax cut for incom below that level, as a victori of sorts, even as so mani of them declin to vote for it. “after more than a decad of critic these tax cuts,” said repres dave camp of michigan, “democrat are final join republican in make them permanent. republican and the american peopl are get someth realli important, perman tax relief.” the dynam with the hous wa a near replay of a fight at the end of 2011 over a break extension. In that showdown, senat democrat and republican pass legislation, and while hous republican fulminated, they were eventu forc to swallow it. On tuesday, as they got a detail look at the senate’ fiscal legislation, hous republican rang from midwest pragmatist to -bless conserv voic seriou reserv about the measure, emerg from a lunchtim new year’ day meet with their leaders, eye flash and face grim, insist they would not accept a bill without substanti save from cuts. the unrest reach to the highest level as mr. cantor told member in a closed-door meet in the basement of the capitol that he could not support the legisl in it current form. mr. boehner, who face a re-elect vote on hi post on thursday when the 113th congress convenes, had grave concern as well, but he had pledg to allow the hous to consid ani legisl that clear the senate. and he wa not eager to have such a major piec of legisl pass with mainli opposit votes, and the outcom could be seen as undermin hi authority. ad to the pressur on the house, the fiscal agreement wa reach by senat mitch mcconnel of kentucky, the senat republican leader, and had deep republican support in the senate, isol the hous republican in their opposition. some of the senat republican who back the bill are staunch conservatives, like senat patrick J. toomey of pennsylvania and tom coburn of oklahoma, with deep credibl among hous republicans."
1,2013-01-01,A report on nearly three million people found that those whose body mass index ranked them as overweight had less risk of dying than people of normal weight.,Study Suggests Lower Mortality Risk for People Deemed to Be Overweight,http://www.nytimes.com/2013/01/02/health/study-suggests-lower-death-risk-for-the-overweight.html,"A century ago, Elsie Scheel was the perfect woman. So said a 1912 in The New York Times about how Miss Scheel, 24, was chosen by the “medical examiner of the 400 ‘co-eds’ ” at Cornell University as a woman “whose very presence bespeaks perfect health.” Miss Scheel, however, was hardly model-thin. At 5-foot-7 and 171 pounds, she would, by today’s medical standards, be clearly overweight. (Her body mass index was 27; 25 to 29.9 is overweight.) But a suggests that Miss Scheel may have been onto something. The report on nearly three million people found that those whose B.M.I. ranked them as overweight had less risk of dying than people of normal weight. And while obese people had a greater mortality risk over all, those at the lowest level (B.M.I. of 30 to 34.9) were not more likely to die than normal-weight people. The report, although not the first to suggest this relationship between B.M.I. and mortality, is by far the largest and most carefully done, analyzing nearly 100 studies, experts said. But don’t scrap those New Year’s weight-loss resolutions and start gorging on fried Belgian waffles or triple cheeseburgers. Experts not involved in the research said it suggested that overweight people need not panic unless they have other indicators of poor health and that depending on where fat is in the body, it might be protective or even nutritional for older or sicker people. But over all, piling on pounds and becoming more than slightly obese remains dangerous. “We wouldn’t want people to think, ‘Well, I can take a pass and gain more weight,’ ” said Dr. George Blackburn, associate director of Harvard Medical School’s division. Rather, he and others said, the report, in The Journal of the American Medical Association, suggests that B.M.I., a ratio of height to weight, should not be the only indicator of healthy weight. “Body mass index is an imperfect measure of the risk of mortality,” and factors like , and blood sugar must be considered, said Dr. Samuel Klein, director of the Center for Human Nutrition at Washington University School of Medicine in St. Louis. Dr. Steven Heymsfield, executive director of the Pennington Biomedical Research Center in Louisiana, who wrote an editorial accompanying the study, said that for overweight people, if indicators like cholesterol “are in the abnormal range, then that weight is affecting you,” but that if indicators are normal, there’s no reason to “go on a crash diet.” Experts also said the data suggested that the definition of “normal” B.M.I., 18.5 to 24.9, should be revised, excluding its lowest weights, which might be too thin. The study did show that the two highest obesity categories (B.M.I. of 35 and up) are at high risk. “Once you have higher obesity, the fat’s in the fire,” Dr. Blackburn said. But experts also suggested that concepts of fat be refined. “Fat per se is not as bad as we thought,” said Dr. Kamyar Kalantar-Zadeh, professor of medicine and public health at the University of California, Irvine. “What is bad is a type of fat that is inside your belly,” he said. “Non-belly fat, underneath your skin in your thigh and your butt area — these are not necessarily bad.” He added that, to a point, extra fat is accompanied by extra muscle, which can be healthy. Still, it is possible that overweight or somewhat obese people are less likely to die because they, or their doctors, have identified other conditions associated with weight gain, like high cholesterol or . “You’re more likely to be in your doctor’s office and more likely to be treated,” said Dr. Robert Eckel, a past president of the American Heart Association and a professor at University of Colorado. Some experts said fat could be protective in some cases, although that is unproven and debated. The study did find that people 65 and over had no greater mortality risk even at high obesity. “There’s something about extra body fat when you’re older that is providing some reserve,” Dr. Eckel said. And studies on specific illnesses, like heart and kidney disease, have found an “obesity paradox,” that heavier patients are less likely to die. Still, death is not everything. Even if “being overweight doesn’t increase your risk of dying,” Dr. Klein said, it “does increase your risk of having diabetes” or other conditions. Ultimately, said the study’s lead author, Katherine Flegal, a senior scientist at the Centers for Disease Control and Prevention, “the best weight might depend on the situation you’re in.” , in whose “physical makeup there is not a single defect,” the Times article said. This woman who “has never been ill and doesn’t know what fear is” loved sports and didn’t consume candy, coffee or tea. But she also ate only three meals every two days, and loved beefsteak. Maybe such seeming contradictions made sense against the societal inconsistencies of that time. After all, her post-college plans involved tilling her father’s farm, but “if she were a man, she would study mechanical engineering.”","A centuri ago, elsi scheel wa the perfect woman. So said a 1912 in the new york time about how miss scheel, 24, wa chosen by the “medic examin of the 400 ‘co-eds’ ” at cornel univers as a woman “whose veri presenc bespeak perfect health.” miss scheel, however, wa hardli model-thin. At 5-foot-7 and 171 pounds, she would, by today’ medic standards, be clearli overweight. (her bodi mass index wa 27; 25 to 29.9 is overweight.) but a suggest that miss scheel may have been onto something. the report on nearli three million peopl found that those whose b.m.i. rank them as overweight had less risk of die than peopl of normal weight. and while obes peopl had a greater mortal risk over all, those at the lowest level (b.m.i. of 30 to 34.9) were not more like to die than normal-weight people. the report, although not the first to suggest thi relationship between b.m.i. and mortality, is by far the largest and most care done, analyz nearli 100 studies, expert said. but don’t scrap those new year’ weight-loss resolut and start gorg on fri belgian waffl or tripl cheeseburgers. expert not involv in the research said it suggest that overweight peopl need not panic unless they have other indic of poor health and that depend on where fat is in the body, it might be protect or even nutrit for older or sicker people. but over all, pile on pound and becom more than slightli obes remain dangerous. “we wouldn’t want peopl to think, ‘well, I can take a pass and gain more weight,’ ” said dr. georg blackburn, associ director of harvard medic school’ division. rather, he and other said, the report, in the journal of the american medic association, suggest that b.m.i., a ratio of height to weight, should not be the onli indic of healthi weight. “bodi mass index is an imperfect measur of the risk of mortality,” and factor like , and blood sugar must be considered, said dr. samuel klein, director of the center for human nutrit at washington univers school of medicin in st. louis. dr. steven heymsfield, execut director of the pennington biomed research center in louisiana, who wrote an editori accompani the study, said that for overweight people, if indic like cholesterol “are in the abnorm range, then that weight is affect you,” but that if indic are normal, there’ no reason to “go on a crash diet.” expert also said the data suggest that the definit of “normal” b.m.i., 18.5 to 24.9, should be revised, exclud it lowest weights, which might be too thin. the studi did show that the two highest obes categori (b.m.i. of 35 and up) are at high risk. “onc you have higher obesity, the fat’ in the fire,” dr. blackburn said. but expert also suggest that concept of fat be refined. “fat per se is not as bad as we thought,” said dr. kamyar kalantar-zadeh, professor of medicin and public health at the univers of california, irvine. “what is bad is a type of fat that is insid your belly,” he said. “non-belli fat, underneath your skin in your thigh and your butt area — these are not necessarili bad.” He ad that, to a point, extra fat is accompani by extra muscle, which can be healthy. still, it is possibl that overweight or somewhat obes peopl are less like to die becaus they, or their doctors, have identifi other condit associ with weight gain, like high cholesterol or . “you’r more like to be in your doctor’ offic and more like to be treated,” said dr. robert eckel, a past presid of the american heart associ and a professor at univers of colorado. some expert said fat could be protect in some cases, although that is unproven and debated. the studi did find that peopl 65 and over had no greater mortal risk even at high obesity. “there’ someth about extra bodi fat when you’r older that is provid some reserve,” dr. eckel said. and studi on specif illnesses, like heart and kidney disease, have found an “obes paradox,” that heavier patient are less like to die. still, death is not everything. even if “be overweight doesn’t increas your risk of dying,” dr. klein said, it “doe increas your risk of have diabetes” or other conditions. ultimately, said the study’ lead author, katherin flegal, a senior scientist at the center for diseas control and prevention, “the best weight might depend on the situat you’r in.” , in whose “physic makeup there is not a singl defect,” the time articl said. thi woman who “ha never been ill and doesn’t know what fear is” love sport and didn’t consum candy, coffe or tea. but she also ate onli three meal everi two days, and love beefsteak. mayb such seem contradict made sens against the societ inconsist of that time. after all, her post-colleg plan involv till her father’ farm, but “if she were a man, she would studi mechan engineering.”"


In [101]:
from sklearn.feature_extraction.text import CountVectorizer

cv8 = CountVectorizer(binary=False, min_df = .1, stop_words = "english") #define the transformation
cv8_news = cv8.fit_transform(newsdf['pstem']) #apply the transformation
print(cv8_news.shape)

names = cv8.get_feature_names()   #create list of feature names
count = np.sum(cv8_news.toarray(), axis = 0) # add up feature counts 
count2 = count.tolist()  # convert numpy array to list
count_df = pd.DataFrame(count2, index = names, columns = ['count']) # create a dataframe from the list
count_df.sort_values(['count'], ascending = False)[0:20]  #arrange by count instead

(3848, 542)


Unnamed: 0,count
wa,22797
said,21374
hi,15170
mr,13575
ha,12566
thi,9283
year,8520
new,8149
like,7618
time,5579


#### Wait "wa"? Where'd that come from?  Any guesses?

### POS tagging

In [102]:
# NLTK also has tokenizers 
from nltk import word_tokenize
text = word_tokenize(speechtext[20])
print(type(text))
print(text)


<class 'list'>
['This', 'will', 'be', 'the', 'day', 'when', 'all', 'of', 'God', "'s", 'children', 'will', 'be', 'able', 'to', 'sing', 'with', 'a', 'new', 'meaning', ',', '``', 'My', 'country', ',', "'t", 'is', 'of', 'thee', ',', 'sweet', 'land', 'of', 'liberty', ',', 'of', 'thee', 'I', 'sing', '.', 'Land', 'where', 'my', 'fathers', 'died', ',', 'land', 'of', 'the', 'pilgrim', "'s", 'pride', ',', 'from', 'every', 'mountainside', ',', 'let', 'freedom', 'ring', '.', "''"]


In [103]:
# and can try to figure out parts of speech
import nltk
nltk.pos_tag(text)

[('This', 'DT'),
 ('will', 'MD'),
 ('be', 'VB'),
 ('the', 'DT'),
 ('day', 'NN'),
 ('when', 'WRB'),
 ('all', 'DT'),
 ('of', 'IN'),
 ('God', 'NNP'),
 ("'s", 'POS'),
 ('children', 'NNS'),
 ('will', 'MD'),
 ('be', 'VB'),
 ('able', 'JJ'),
 ('to', 'TO'),
 ('sing', 'VBG'),
 ('with', 'IN'),
 ('a', 'DT'),
 ('new', 'JJ'),
 ('meaning', 'NN'),
 (',', ','),
 ('``', '``'),
 ('My', 'PRP$'),
 ('country', 'NN'),
 (',', ','),
 ("'t", "''"),
 ('is', 'VBZ'),
 ('of', 'IN'),
 ('thee', 'NN'),
 (',', ','),
 ('sweet', 'JJ'),
 ('land', 'NN'),
 ('of', 'IN'),
 ('liberty', 'NN'),
 (',', ','),
 ('of', 'IN'),
 ('thee', 'NN'),
 ('I', 'PRP'),
 ('sing', 'VBG'),
 ('.', '.'),
 ('Land', 'NNP'),
 ('where', 'WRB'),
 ('my', 'PRP$'),
 ('fathers', 'NNS'),
 ('died', 'VBN'),
 (',', ','),
 ('land', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('pilgrim', 'NN'),
 ("'s", 'POS'),
 ('pride', 'NN'),
 (',', ','),
 ('from', 'IN'),
 ('every', 'DT'),
 ('mountainside', 'NN'),
 (',', ','),
 ('let', 'VB'),
 ('freedom', 'NN'),
 ('ring', 'VBG'),
 ('

In [104]:
# but what do those  abbreviations mean?
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [105]:
sent = "cats running ran cactus cactuses cacti community communities"
text = word_tokenize(sent)

nltk.pos_tag(text)

[('cats', 'NNS'),
 ('running', 'VBG'),
 ('ran', 'NN'),
 ('cactus', 'NN'),
 ('cactuses', 'VBZ'),
 ('cacti', 'VBP'),
 ('community', 'NN'),
 ('communities', 'NNS')]

This is getting more into NLP - we won't cover that in this class but if you are interested, read about grammars in the NLTK book.  You'll need to know POS to construct grammars.

### Main take aways:
* Removing stopwords generally decreases the size of the feature space but be careful what words are considered "unhelpful."
* Stemming and lemmatization try to consolidate features by reducing words down to their root but be careful you aren't conflating multiple concepts.  
* What stopwords to use and whether or not to stem words are more choices that the data scientist needs to make (on top of vectorizer options, parameter settings, etc.)