# What is TF-IDF (Term Frequency-Inverse Document Frequency)? 

**TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that reflects how important a word is to a document in a collection or corpus of documents. In the finance domain, it can be used to analyze and compare the relative importance of financial terms across different financial reports or documents.**

# Here's a simple Python code snippet using the popular scikit-learn library to perform TF-IDF analysis on a given collection of documents. 

**First, you'll need to install the required libraries if they are not already installed:**

In [1]:
!pip install numpy pandas sklearn

Collecting sklearn
  Downloading sklearn-0.0.post12.tar.gz (2.6 kB)
  Preparing metadata (setup.py) ... [?25l- error
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[15 lines of output][0m
  [31m   [0m The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
  [31m   [0m rather than 'sklearn' for pip commands.
  [31m   [0m 
  [31m   [0m Here is how to fix this error in the main use cases:
  [31m   [0m - use 'pip install scikit-learn' rather than 'pip install sklearn'
  [31m   [0m - replace 'sklearn' by 'scikit-learn' in your pip requirements files
  [31m   [0m   (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
  [31m   [0m - if the 'sklearn' package is used by one of your dependencies,
  [31m   [0m   it would be great if you take some time to track which package uses
  [31m   [0m   'sklearn' i

**Now, let's assume you have a list of documents in a variable named documents. In this example, we'll work with a simple list of three documents:**

In [2]:
documents = [
    "Document 1: This is the first document.",
    "Document 2: This is the second document.",
    "Document 3: This is the third document."
]

**Next, you'll need to preprocess the documents, tokenize them, and calculate the TF-IDF scores using scikit-learn.**

*This code will output the TF-IDF scores for each term in each document.
In this example, the three documents contain the same terms, so the TF-IDF scores will be the same for each term. The output will look something like this:*

In [3]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
tfidf = TfidfVectorizer()

# Fit the vectorizer to the documents
tfidf_matrix = tfidf.fit_transform(documents)

# Access the feature names (terms)
terms = tfidf.get_feature_names_out() # Changed to get_feature_names_out()

# Access the TF-IDF matrix
tfidf_array = np.array(tfidf_matrix.todense())

# Print the terms and their corresponding TF-IDF scores for each document
for i, doc in enumerate(documents):
    print(f"\nTF-IDF scores for {doc}")
    for j, term in enumerate(terms):
        print(f"{term}: {tfidf_array[i, j]:.4f}")


TF-IDF scores for Document 1: This is the first document.
document: 0.6367
first: 0.5390
is: 0.3184
second: 0.0000
the: 0.3184
third: 0.0000
this: 0.3184

TF-IDF scores for Document 2: This is the second document.
document: 0.6367
first: 0.0000
is: 0.3184
second: 0.5390
the: 0.3184
third: 0.0000
this: 0.3184

TF-IDF scores for Document 3: This is the third document.
document: 0.6367
first: 0.0000
is: 0.3184
second: 0.0000
the: 0.3184
third: 0.5390
this: 0.3184


# Please upvote if you liked this!! Thanks!!