<a href="https://colab.research.google.com/github/jyotidabass/count-vectorizer-in-nlp-finance-example/blob/main/count_vectorizer_in_nlp_finance_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Natural Language Processing (NLP) is a powerful tool used in finance to analyze and understand text data. Count Vectorizer is a technique in NLP that helps convert text data into numerical format for machine learning algorithms. **

# What is a Count Vectorizer?

**Count Vectorizer is a technique used in Natural Language Processing (NLP) to convert text data into numerical format, which can be used for machine learning algorithms. It is a type of bag-of-words representation, which means it creates a numerical representation of text by counting the occurrences of words in a given document or corpus.**

**First, install the necessary packages if you haven't already done so:**

In [1]:
!pip install pandas numpy nltk



**Import the required libraries:**

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize

**Load the sample finance-related text data in a pandas DataFrame:**

In [3]:
data = {
    'text': [
        'The app is very useful for tracking my investments.',
        'The user interface is intuitive and easy to use.',
        'I had a great experience with the app.',
        'The app is a bit slow at times.'
    ]
}

df = pd.DataFrame(data)

**Tokenize the text:**

In [4]:
import nltk
nltk.download('punkt')
df['tokens'] = df['text'].apply(lambda x: word_tokenize(x))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


**Define the CountVectorizer and fit it on the tokenized text:**

In [5]:
vectorizer = CountVectorizer()
# Convert list of tokens back to strings
df['text'] = df['tokens'].apply(lambda x: ' '.join(x))
vectorizer.fit(df['text'])

**Transform the tokenized text into a numerical representation using the CountVectorizer:**

In [6]:
count_matrix = vectorizer.transform(df['text']).toarray() # Use the 'text' column which contains strings

**Convert the sparse matrix to a numpy array:**

In [7]:
count_array = np.array(count_matrix)

**Print the CountVectorizer vocabulary and the resulting count array:**

In [8]:
print('Vocabulary:', vectorizer.get_feature_names_out()) # Use get_feature_names_out() for scikit-learn versions 1.0 and above
print('Count array:')
print(count_array)

Vocabulary: ['and' 'app' 'at' 'bit' 'easy' 'experience' 'for' 'great' 'had'
 'interface' 'intuitive' 'investments' 'is' 'my' 'slow' 'the' 'times' 'to'
 'tracking' 'use' 'useful' 'user' 'very' 'with']
Count array:
[[0 1 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 0 1 0 1 0]
 [1 0 0 0 1 0 0 0 0 1 1 0 1 0 0 1 0 1 0 1 0 1 0 0]
 [0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1]
 [0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0]]


# Please upvote if you liked this!! Thanks!!