## Count Vector, TFIDF Representations of Text

Working with text generally involves converting it into a format that our model is able to understand, which is mostly numbers. In this notebook, you will have a closer look on two of the most basic and ubiquitiously used formats: 

 - Count Vector
 - TFIDF

You will also build a Machine Learning model on a real world dataset of **BBC News** and perform text classification utilizing the above two formats.

#### Table of Contents
1. About the Dataset
2. Preprocessing Text
3. Working with Count Vector
4. Using TFIDF to improve Count Vector
5. Conclusion
6. Challenge

### 1. About the Dataset

The dataset that you are going to use is a collection of news articles from BBC across 5 major categories, namely:
 
 - Business
 - Entertainment
 - Politics
 - Sport
 - Tech

There are a total of 2225 articles in the dataset, which is a mix of all of the above categories. Let's load the dataset using pandas and have a quick look at some of the articles. 

**Note:** 
 - You can get the dataset [here](https://github.com/kunalj101/random-/raw/master/Text%20Feature%20Engineering/data/bbc_news_mixed.csv)
 - Do Ctrl+s and save the file as "bbc_news_mixed.csv"

In [153]:
import pandas as pd

# Load the dataset
bbc_news = pd.read_csv('bbc_news_mixed.csv')
bbc_news.head()

Unnamed: 0,text,label
0,Cairn shares slump on oil setback\n\nShares in...,business
1,Egypt to sell off state-owned bank\n\nThe Egyp...,business
2,Cairn shares up on new oil find\n\nShares in C...,business
3,Low-cost airlines hit Eurotunnel\n\nChannel Tu...,business
4,"Parmalat to return to stockmarket\n\nParmalat,...",business


In [155]:
# print first 2 articles
for art in bbc_news.text[:2]:
    print(art)

Cairn shares slump on oil setback


The company said tests had shown no significant finds in one of its Indian oil fields, but was upbeat about the potential of other areas. It also said the Indian government had told it to pay a production tax, for which Cairn argues it is not liable. Cairn's shares have jumped by almost 400% this year. Investors had piled into Cairn after the company announced significant oil finds in India this year. Chief executive Bill Gammell said on Friday he was "disappointed" with exploration in the so-called N-C extension area in Rajasthan. Investors had held high hopes of major oil finds in this area. But Cairn said estimates had been revised in what was a "significant downgrade of the initial expectation".

Cairn also said that the government believed the company was liable to pay taxes under its production-sharing contract. The company said the rate would be about 900 rupees ($20.40; £10.50) per tonne, or seven barrels, of oil. A spokesman for the firm sai

Now that you have an idea of how your data looks like, let's see the count of each category in the dataset!

In [156]:
# category-wise count
bbc_news.label.value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: label, dtype: int64

### 2. Preprocessing Text

You would have noticed that the labels are in text format, in order to build a model on this dataset you will have to create a mapping between the labels and numbers like 0,1,2,3 this process is called Label Encoding. You can easily label encode your text data using sklearn's [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html). Let's have a look at how to do that!

In [67]:
from sklearn.preprocessing import LabelEncoder

# initialize LabelEncoder
lencod = LabelEncoder()
# fit_transform() converts the text to numbers
bbc_news.label = lencod.fit_transform(bbc_news.label)
# label-wise count
bbc_news.label.value_counts()

3    511
0    510
2    417
4    401
1    386
Name: label, dtype: int64

**Note** You'd have noticed in the output of the above code that the text labels have been replaced by numbers. We have a mapping like this - 
 - 0 is Business
 - 1 is Entertainment
 - 2 is Politics
 - 3 is Sport
 - 4 is Tech
 
### 3. Working with Count Vector

Sklearn provides an easy way to create count vectors from a piece of text. You can use the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to do that. Let's see how simple it is!

In [171]:
from sklearn.feature_extraction.text import CountVectorizer

# initialize count vector
cvec = CountVectorizer(stop_words='english')
# create Bag of Words
bow = cvec.fit_transform(bbc_news.text)
# shape of Bag of Words
print('shape of BOW:', bow.shape)
# number of words in the vocabulary
print('No. of words in vocabulary:', len(cvec.vocabulary_))

shape of BOW: (2225, 29126)
No. of words in vocabulary: 29126


Let's have a closer look at the Bag of Words that you have just generated.

In [189]:
# create a dataframe from the BOW
bow_df = pd.SparseDataFrame(bow, columns=cvec.get_feature_names(), index=bbc_news.index, default_fill_value=0)

# sample some data points
bow_df.iloc[:20, 5000:5050]

Unnamed: 0,callaghan,callam,called,calleri,callers,calling,callow,calls,calm,calmed,...,camp,campaiging,campaign,campaigned,campaigner,campaigners,campaigning,campaigns,campbell,camped
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


If you explore the above dataframe, you will find that the Bag of Words representation of the text. Notice that the word "called" appears in the first only once hence there is a 1 at it's index. Now that your BOW is created, let's see just how good is it at classifying the articles in a ML model.

You'll be using [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) model because it works well with sparse features of text.

In [181]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

# creates a ML model based on parameters
def create_model(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = MultinomialNB()
    model = model.fit(X_train, y_train)
    return model, X_test, y_test

In [182]:
# create BOW based classification model
model_b, X_test_b, y_test_b = create_model(bow, bbc_news.label)

Now that the model is created and trained, have a look at the classification accuracy:

In [183]:
from sklearn.metrics import accuracy_score

# check accuracy 
accuracy_score(y_test_b, model_b.predict(X_test_b))

0.9730337078651685

That's a pretty good accuracy and now let's see how can you improve it even further!

### 4. Using TFIDF to improve Count Vector

Just like Count Vector, TFIDF can also be very easily implemented in Python using Sklearn. Here's how you will create a TFIDF representation of your text.

In [184]:
from sklearn.feature_extraction.text import TfidfVectorizer

# initialize TFIDF
vec = TfidfVectorizer(max_features=4000, stop_words='english')
# create TFIDF
tfidf = vec.fit_transform(bbc_news.text)
# shape of TFIDF
tfidf.shape

(2225, 4000)

**Note** this time you have a smaller number of columns in shape(4000 as compared to 29192 of previous). This is because we have used the paremeter `max_features` which tells Sklearn to only use 4000 most important words from the entire text in the dataset to build the TFIDF representation. Have a look at how it looks like!

In [190]:
# create a dataframe from the TFIDF
tfidf_df = pd.SparseDataFrame(tfidf, columns=vec.get_feature_names(), index=bbc_news.index, default_fill_value=0)

# sample some data points
tfidf_df.iloc[:20, 1000:1050]

Unnamed: 0,death,debate,debt,debts,debut,dec,decade,decades,december,decent,...,dem,demand,demanded,demands,democracy,democrat,democratic,democrats,dems,denied
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.043469,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.077338,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.105987,0.129173,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.099194,0.181343,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.085918,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032933,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


If you explore the above dataframe, you will find the TFIDF representation of the text. Notice that each word has a numeric value associated with it, with respect to a column(that in turn represents each document), this is the TFIDF score of that word. Now that your TFIDF is created, let's see just how good is it at classifying the articles in a ML model.

In [186]:
# create TFIDF based classification model
model_t, X_test_t, y_test_t = create_model(tfidf, bbc_news.label)

In [187]:
from sklearn.metrics import accuracy_score

# check accuracy
accuracy_score(y_test_t, model_t.predict(X_test_t))

0.9775280898876404

### 5. Conclusion

 - Notice that using TFIDF word presentation, you were able to build a better model by just using 4000 words as oppossed to the 29,192 words of the BOW. 
 - This is where TFIDF's strength lies which gives the intution that rest of the 25,000+ words weren't adding much useful information to the model and would be common among many documents.
 - You can know more about the word vectors, TFIDF and similar text embeddings in [this comprehensive article](https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/).
 - Finally, note that you could have gotten an even better accuracy by doing preprocessing over the text like Normalization, spelling correction and much more.

### 6. Challenge

If you notice the TFIDF dataframe, words like `demand` `demands` and `demanded` are counted separately this is because the data set isn't normalize yet. I encourage you to go ahead and try to do that using concepts learnt in the previous classes.

In [None]:
# Your code here