# Codebasics - Bag of N grams tutorial

Let's first understand how to generate n-grams using CountVectorizer

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

v= CountVectorizer()

v.fit(["Faisal Vai is one the best employee in NSL."])
v.vocabulary_


{'faysal': 2,
 'vai': 8,
 'is': 4,
 'one': 6,
 'the': 7,
 'best': 0,
 'employee': 1,
 'in': 3,
 'nsl': 5}

In [18]:
#bi-gram  and tri-gram
v= CountVectorizer(ngram_range=(1,2))
v.fit(["Faisal Vai is one the best employee in NSL."])
v.vocabulary_

{'faisal': 4,
 'vai': 15,
 'is': 8,
 'one': 11,
 'the': 13,
 'best': 0,
 'employee': 2,
 'in': 6,
 'nsl': 10,
 'faisal vai': 5,
 'vai is': 16,
 'is one': 9,
 'one the': 12,
 'the best': 14,
 'best employee': 1,
 'employee in': 3,
 'in nsl': 7}

In [19]:
v

In [20]:
v.get_stop_words

<bound method _VectorizerMixin.get_stop_words of CountVectorizer(ngram_range=(1, 2))>

In [21]:
v.build_tokenizer

<bound method _VectorizerMixin.build_tokenizer of CountVectorizer(ngram_range=(1, 2))>

In [22]:
v.vocabulary_

{'faisal': 4,
 'vai': 15,
 'is': 8,
 'one': 11,
 'the': 13,
 'best': 0,
 'employee': 2,
 'in': 6,
 'nsl': 10,
 'faisal vai': 5,
 'vai is': 16,
 'is one': 9,
 'one the': 12,
 'the best': 14,
 'best employee': 1,
 'employee in': 3,
 'in nsl': 7}

In [23]:
v.binary

False

In [24]:
v.encoding

'utf-8'


We will not take a simple collection of text documents, preprocess them to remove stop words, lemmatize etc and then generate bag of 1 grams and 2 grams from it

In [11]:
corpus = [
    "NSL ate pizza",
    "NSL is tall",
    "Meo is eating fish"
]

In [12]:
type(corpus)

list

In [13]:
import spacy
# load english language model and create nlp object from it 
nlp = spacy.load("en_core_web_sm")

def preprocess(text):
    #rmeove stop words and lemmatize the text
    doc = nlp(text)
    filtered_token = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_token.append(token.lemma_)
    return " ".join(filtered_token)


In [16]:
text='NSL is on the growing company in Bangladesh'
preprocess(text)

'NSL grow company Bangladesh'

In [17]:
corpus_processed = [
    preprocess(text) for text in corpus
]
corpus_processed

['NSL eat pizza', 'NSL tall', 'Meo eat fish']

In [25]:
v = CountVectorizer(ngram_range=(1,2))
v.fit(corpus_processed)
v.vocabulary_

{'nsl': 6,
 'eat': 0,
 'pizza': 9,
 'nsl eat': 7,
 'eat pizza': 2,
 'tall': 10,
 'nsl tall': 8,
 'meo': 4,
 'fish': 3,
 'meo eat': 5,
 'eat fish': 1}

In [27]:
# Now generate bag of n gram vector for few sample documents
v.transform(["Badhon eat pizza"]).toarray()



array([[1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0]])

# News Category Classification Problem


Okay now that we know basics of BAG of n grams vectorizer 😎 It is the time to work on a real problem. Here we want to do a news category classification. We will use bag of n-grams and traing a machine learning model that can categorize any news into one of the following categories,

- BUSINESS

- SPORTS
- CRIME
- SCIENCE