

```markdown
### TF-IDF (Term Frequency – Inverse Document Frequency)

The formula for **TF-IDF** for a term `t` in a document `d` from a corpus `D` is:

```

TF-IDF(t, d, D) = TF(t, d) × IDF(t, D)

```

#### Term Frequency (TF):

```

TF(t, d) = (Number of times term t appears in document d) / (Total number of terms in document d)

```

#### Inverse Document Frequency (IDF):

```

IDF(t, D) = log(N / (1 + df(t)))

```

Where:
- `N` = Total number of documents in the corpus
- `df(t)` = Number of documents containing the term `t`
- `log` = Natural logarithm (base *e*) or base-10, depending on implementation
```

---



In [1]:
paragraph = """Narendra Damodardas Modi[a] (born 17 September 1950) is an Indian politician who has served as the prime minister of India since 2014. Modi was the chief minister of Gujarat from 2001 to 2014 and is the member of parliament (MP) for Varanasi. He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindutva paramilitary volunteer organisation. He is the longest-serving prime minister outside the Indian National Congress.

Modi was born and raised in Vadnagar, Bombay State (present-day Gujarat), where he completed his secondary education. He was introduced to the RSS at the age of eight. Modi became a full-time worker for the RSS in Gujarat in 1971. The RSS assigned him to the BJP in 1985 and he rose through the party hierarchy, becoming general secretary in 1998.[b] In 2001, Modi was appointed chief minister of Gujarat and elected to the legislative assembly soon after. His administration is considered complicit in the 2002 Gujarat riots,[c] and has been criticised for its management of the crisis. According to official records, a little over 1,000 people were killed, three-quarters of whom were Muslim; independent sources estimated 2,000 deaths, mostly Muslim.[4] A Special Investigation Team appointed by the Supreme Court of India in 2012 found no evidence to initiate prosecution proceedings against him.[d] While his policies as chief minister were credited for encouraging economic growth, his administration was criticised for failing to significantly improve health, poverty and education indices in the state.[e]"""

In [4]:
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

sentences = nltk.sent_tokenize(paragraph)
print(sentences)

['Narendra Damodardas Modi[a] (born 17 September 1950) is an Indian politician who has served as the prime minister of India since 2014.', 'Modi was the chief minister of Gujarat from 2001 to 2014 and is the member of parliament (MP) for Varanasi.', 'He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindutva paramilitary volunteer organisation.', 'He is the longest-serving prime minister outside the Indian National Congress.', 'Modi was born and raised in Vadnagar, Bombay State (present-day Gujarat), where he completed his secondary education.', 'He was introduced to the RSS at the age of eight.', 'Modi became a full-time worker for the RSS in Gujarat in 1971.', 'The RSS assigned him to the BJP in 1985 and he rose through the party hierarchy, becoming general secretary in 1998.', '[b] In 2001, Modi was appointed chief minister of Gujarat and elected to the legislative assembly soon after.', 'His administration is considered co

In [5]:
# import re
# corpus = []
# for i in range(len(sentences)):
#     review = re.sub('[^a-zA-Z]', ' ', sentences[i])
#     review = review.lower()
#     corpus.append(review)

# corpus

In [6]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [10]:
import re 
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [lemmatizer.lemmatize(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

In [11]:
corpus

['narendra damodardas modi born september indian politician served prime minister india since',
 'modi chief minister gujarat member parliament mp varanasi',
 'member bharatiya janata party bjp rashtriya swayamsevak sangh rss right wing hindutva paramilitary volunteer organisation',
 'longest serving prime minister outside indian national congress',
 'modi born raised vadnagar bombay state present day gujarat completed secondary education',
 'introduced rss age eight',
 'modi became full time worker rss gujarat',
 'rss assigned bjp rose party hierarchy becoming general secretary',
 'b modi appointed chief minister gujarat elected legislative assembly soon',
 'administration considered complicit gujarat riot c criticised management crisis',
 'according official record little people killed three quarter muslim independent source estimated death mostly muslim',
 'special investigation team appointed supreme court india found evidence initiate prosecution proceeding',
 'policy chief minist

In [12]:
## tf idf
from sklearn.feature_extraction.text import TfidfVectorizer

cv = TfidfVectorizer()
X = cv.fit_transform(corpus)

In [13]:
X[0].toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.27641537, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.31936592, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.27641537, 0.27641537,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.20299092, 0.20299092, 0.        , 0.        , 0.        ,
        0.31936592, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.31936592, 0.        , 0.        , 0.27