# CountVectorizer and Stopwords Argument

#### Import lib.

In [1]:
import pandas as pd

#### Create list of word

In [2]:
doc = ["One Cent, Two Cents, Old Cent, New Cent: All About Money"]

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

In [4]:
cv = CountVectorizer()

In [5]:
cvw = cv.fit_transform(doc)

#### Mapped text into sparse matrix

In [6]:
cvw.A

array([[1, 1, 3, 1, 1, 1, 1, 1, 1]], dtype=int64)

In [10]:
cv.get_feature_names()

['about', 'all', 'cent', 'cents', 'money', 'new', 'old', 'one', 'two']

In [14]:
dfcv = pd.DataFrame(cvw.A, columns=cv.get_feature_names())

In [15]:
dfcv

Unnamed: 0,about,all,cent,cents,money,new,old,one,two
0,1,1,3,1,1,1,1,1,1


## Stop Words (From List)

In [16]:
cat_in_the_hat_docs=[
       "One Cent, Two Cents, Old Cent, New Cent: All About Money (Cat in the Hat's Learning Library",
       "Inside Your Outside: All About the Human Body (Cat in the Hat's Learning Library)",
       "Oh, The Things You Can Do That Are Good for You: All About Staying Healthy (Cat in the Hat's Learning Library)",
       "On Beyond Bugs: All About Insects (Cat in the Hat's Learning Library)",
       "There's No Place Like Space: All About Our Solar System (Cat in the Hat's Learning Library)" 
      ]

In [53]:
df = pd.DataFrame(cat_in_the_hat_docs, columns = ['Article'])

In [54]:
df

Unnamed: 0,Article
0,"One Cent, Two Cents, Old Cent, New Cent: All A..."
1,Inside Your Outside: All About the Human Body ...
2,"Oh, The Things You Can Do That Are Good for Yo..."
3,On Beyond Bugs: All About Insects (Cat in the ...
4,There's No Place Like Space: All About Our Sol...


#### Before added stop words

In [39]:
df['Article'][1]

"Inside Your Outside: All About the Human Body (Cat in the Hat's Learning Library)"

#### Added Stop words to CV argument

In [37]:
cv2 = CountVectorizer(stop_words=["all","in","the","is","and"])

In [70]:
cv2w = cv2.fit_transform(df['Article'])

In [45]:
cv2w.A

array([[1, 0, 0, 0, 0, 0, 1, 3, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1,
        0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 2, 0],
       [1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0,
        0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0,
        1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0]],
      dtype=int64)

#### View in Sparse Matrix

check in sparse matrix if our stop words already remove all words from document library (dataframe)

In [61]:
cv2.stop_words

['all', 'in', 'the', 'is', 'and']

In [56]:
df2 = pd.DataFrame(cv2w.A,columns=cv2.get_feature_names())
df2

Unnamed: 0,about,are,beyond,body,bugs,can,cat,cent,cents,do,...,solar,space,staying,system,that,there,things,two,you,your
0,1,0,0,0,0,0,1,3,1,0,...,0,0,0,0,0,0,0,1,0,0
1,1,0,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,1,1,0,0,0,1,1,0,0,1,...,0,0,1,0,1,0,1,0,2,0
3,1,0,1,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,1,0,0,0,...,1,1,0,1,0,1,0,0,0,0


stopwords already remove all words listed from list.

In [59]:
df2['in']

KeyError: 'in'

remain words after applying stopwords

In [50]:
cv2.vocabulary_

{'one': 26,
 'cent': 7,
 'two': 37,
 'cents': 8,
 'old': 24,
 'new': 21,
 'about': 0,
 'money': 20,
 'cat': 6,
 'hat': 12,
 'learning': 17,
 'library': 18,
 'inside': 16,
 'your': 39,
 'outside': 28,
 'human': 14,
 'body': 3,
 'oh': 23,
 'things': 36,
 'you': 38,
 'can': 5,
 'do': 9,
 'that': 34,
 'are': 1,
 'good': 11,
 'for': 10,
 'staying': 32,
 'healthy': 13,
 'on': 25,
 'beyond': 2,
 'bugs': 4,
 'insects': 15,
 'there': 35,
 'no': 22,
 'place': 29,
 'like': 19,
 'space': 31,
 'our': 27,
 'solar': 30,
 'system': 33}

## Stop Words (From min_df argument)

The goal of MIN_DF is to ignore words that have very few occurrences to be considered meaningful. For example, in your text you may have names of people that may appear in only 1 or two documents. In some applications, this may qualify as noise and could be eliminated from further analysis.

Instead of using a minimum term frequency (total occurrences of a word) to eliminate words, MIN_DF looks at how many documents contained a term, better known as document frequency. The MIN_DF value can be an absolute value (e.g. 1, 2, 3, 4) or a value representing proportion of documents (e.g. 0.25 meaning, ignore words that have appeared in 25% of the documents) .

In [73]:
cv3 = CountVectorizer(min_df=2)

ignore terms that appeared in less than 2 documents

In [74]:
cv3w = cv3.fit_transform(df['Article'])

In [76]:
cv3.stop_words_

{'are',
 'beyond',
 'body',
 'bugs',
 'can',
 'cent',
 'cents',
 'do',
 'for',
 'good',
 'healthy',
 'human',
 'insects',
 'inside',
 'like',
 'money',
 'new',
 'no',
 'oh',
 'old',
 'on',
 'one',
 'our',
 'outside',
 'place',
 'solar',
 'space',
 'staying',
 'system',
 'that',
 'there',
 'things',
 'two',
 'you',
 'your'}

In [78]:
cv3.vocabulary_

{'all': 1,
 'about': 0,
 'cat': 2,
 'in': 4,
 'the': 7,
 'hat': 3,
 'learning': 5,
 'library': 6}

## Stop Words (From max_df argument)

Just as we ignored words that were too rare with MIN_DF, we can ignore words that are too common with MAX_DF. MAX_DF looks at how many documents contained a term, and if it exceeds the MAX_DF threshold, then it is eliminated from consideration. The MAX_DF value can be an absolute value (e.g. 1, 2, 3, 4) or a value representing proportion of documents (e.g. 0.85 meaning, ignore words appeared in 85% of the documents as they are too common).

In [92]:
# ignore terms that appear in 10% of the documents, keep only 90% occurence text in documents
cv4 = CountVectorizer(max_df=0.90)
cv4w= cv4.fit_transform(df['Article'])

In [93]:
cv4.stop_words_

{'about', 'all', 'cat', 'hat', 'in', 'learning', 'library', 'the'}

In [94]:
cv4.vocabulary_

{'one': 21,
 'cent': 5,
 'two': 32,
 'cents': 6,
 'old': 19,
 'new': 16,
 'money': 15,
 'inside': 13,
 'your': 34,
 'outside': 23,
 'human': 11,
 'body': 2,
 'oh': 18,
 'things': 31,
 'you': 33,
 'can': 4,
 'do': 7,
 'that': 29,
 'are': 0,
 'good': 9,
 'for': 8,
 'staying': 27,
 'healthy': 10,
 'on': 20,
 'beyond': 1,
 'bugs': 3,
 'insects': 12,
 'there': 30,
 'no': 17,
 'place': 24,
 'like': 14,
 'space': 26,
 'our': 22,
 'solar': 25,
 'system': 28}