# TEXT PREPROCESSING - BAG OF WORDS

**Bag of Words** model is a way of representing text; it is a feature extraction exercise to make the representation handy to process. It basically represents count of occurence of words within a corpus. A vocabulary of known words and measure of presence of those words.

**Measure of words** can be just binary (if it exists or not), count (how many times it exists), 

* Clearly the order/structure of the words is discarded
* The intuition is that documents are similar if they have similar content. 
* Further, that from the content alone we can learn something about the meaning of the document.

In [20]:
# will be using NLTK to preprocess text
# will be using stemming as the task with BOW will not need lot of semantics in the first place
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

paragraph = """Jerry's main friends are George Costanza, Cosmo Kramer and his ex-girlfriend Elaine Benes.\
Jerry (though not without exceptions) typically represents the voice of reason amidst George, Elaine, and \
Kramer's antics, and can be seen as the focal point of the foursome's relationship. Jerry is somewhat of \
an eternal optimist, as he rarely runs into major personal problems. Jerry is the only main character on \
the show to maintain the same career throughout the series. Considering his job as a comedian, he is the most \
observational character, usually sarcastically commenting on his friends' quirky habits, almost essentially \
the New York Jew-type character. He seems to have a new girlfriend every week, but the relationships usually \
end for fairly superficial reasons. He is also an almost obsessive compulsive neat freak; he once threw out\
belt because it had touched a urinal, and once commented on finding out his toilet brush had been placed in \
the toilet that, I can replace that."""

In [21]:
#generate sentences from the paragraph
sentences = nltk.sent_tokenize(paragraph)
print(sentences)

["Jerry's main friends are George Costanza, Cosmo Kramer and his ex-girlfriend Elaine Benes.Jerry (though not without exceptions) typically represents the voice of reason amidst George, Elaine, and Kramer's antics, and can be seen as the focal point of the foursome's relationship.", 'Jerry is somewhat of an eternal optimist, as he rarely runs into major personal problems.', 'Jerry is the only main character on the show to maintain the same career throughout the series.', "Considering his job as a comedian, he is the most observational character, usually sarcastically commenting on his friends' quirky habits, almost essentially the New York Jew-type character.", 'He seems to have a new girlfriend every week, but the relationships usually end for fairly superficial reasons.', 'He is also an almost obsessive compulsive neat freak; he once threw outbelt because it had touched a urinal, and once commented on finding out his toilet brush had been placed in the toilet that, I can replace that

In [22]:
#initalise stemmer and stem each word, remove stopwords
lemmatizer = WordNetLemmatizer()
lemmatized_sentences = []

for sentence in sentences:
    words = nltk.word_tokenize(sentence)
    print("Words before lemmatization : ", words)
    
    lemmas = []
    for word in words:
        if word not in set(stopwords.words('english')):
            lemma = lemmatizer.lemmatize(word)
            lemmas.append(lemma)
    
    lemmatized_sentence = ' '.join(lemmas)
    lemmatized_sentences.append(lemmatized_sentence)
        
    print("Words after lemmatizaton : ", lemmas)
    
print("Sentences after lemmatizaton : ", lemmatized_sentences) 

Words before lemmatization :  ['Jerry', "'s", 'main', 'friends', 'are', 'George', 'Costanza', ',', 'Cosmo', 'Kramer', 'and', 'his', 'ex-girlfriend', 'Elaine', 'Benes.Jerry', '(', 'though', 'not', 'without', 'exceptions', ')', 'typically', 'represents', 'the', 'voice', 'of', 'reason', 'amidst', 'George', ',', 'Elaine', ',', 'and', 'Kramer', "'s", 'antics', ',', 'and', 'can', 'be', 'seen', 'as', 'the', 'focal', 'point', 'of', 'the', 'foursome', "'s", 'relationship', '.']
Words after lemmatizaton :  ['Jerry', "'s", 'main', 'friend', 'George', 'Costanza', ',', 'Cosmo', 'Kramer', 'ex-girlfriend', 'Elaine', 'Benes.Jerry', '(', 'though', 'without', 'exception', ')', 'typically', 'represents', 'voice', 'reason', 'amidst', 'George', ',', 'Elaine', ',', 'Kramer', "'s", 'antic', ',', 'seen', 'focal', 'point', 'foursome', "'s", 'relationship', '.']
Words before lemmatization :  ['Jerry', 'is', 'somewhat', 'of', 'an', 'eternal', 'optimist', ',', 'as', 'he', 'rarely', 'runs', 'into', 'major', 'perso

In [23]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
count_vector = count_vectorizer.fit_transform(lemmatized_sentences)

for count, word in zip(count_vector.toarray().tolist()[0], count_vectorizer.get_feature_names()) :
    print("word : {} count : {} ".format(word, count))


word : almost count : 0 
word : also count : 0 
word : amidst count : 1 
word : antic count : 1 
word : benes count : 1 
word : brush count : 0 
word : career count : 0 
word : character count : 0 
word : comedian count : 0 
word : commented count : 0 
word : commenting count : 0 
word : compulsive count : 0 
word : considering count : 0 
word : cosmo count : 1 
word : costanza count : 1 
word : elaine count : 2 
word : end count : 0 
word : essentially count : 0 
word : eternal count : 0 
word : every count : 0 
word : ex count : 1 
word : exception count : 1 
word : fairly count : 0 
word : finding count : 0 
word : focal count : 1 
word : foursome count : 1 
word : freak count : 0 
word : friend count : 1 
word : george count : 2 
word : girlfriend count : 1 
word : habit count : 0 
word : he count : 0 
word : jerry count : 2 
word : jew count : 0 
word : job count : 0 
word : kramer count : 2 
word : main count : 1 
word : maintain count : 0 
word : major count : 0 
word : neat cou

**Note** :

* The **measure of words** can be more intelligent by giving weight to words which are more important (rarity like nouns)
* TF-IDF will be discussed later which makes that measure intelligent