# -------------------------------------------TF - IDF ----------------------------------------------------
### TF-IDF stands for “Term Frequency — Inverse Document Frequency”. This is a technique to quantify a word in documents, we generally compute a weight to each word which signifies the importance of the word in the document and corpus. This method is a widely used technique in Information Retrieval and Text Mining.
If i give you a sentence for example “This building is so tall”. Its easy for us to understand the sentence as we know the semantics of the words and the sentence. But how will the computer understand this sentence? The computer can understand any data only in the form of numerical value. So, for this reason we vectorize all of the text so that the computer can understand the text better.
*  TF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF)
*  TF-IDF(t, d) = tf(t, d) * log(N/(df + 1))

Terminology
 * t — term (word)
 * d — document (set of words)
 * N — count of corpus
 * corpus — the total document set
#### Term Frequency
This measures the frequency of a word in a document. This highly depends on the length of the document and the generality of word, for example a very common word such as “was” can appear multiple times in a document. but if we take two documents one which have 100 words and other which have 10,000 words. There is a high probability that the common word such as “was” can be present more in the 10,000 worded document. But we cannot say that the longer document is more important than the shorter document. For this exact reason, we perform a normalization on the frequency value. we divide the the frequency with the total number of words in the document.
* tf(t,d) = count of t in d / number of words in d

### Inverse Document Frequency
IDF is the inverse of the document frequency which measures the informativeness of term t. When we calculate IDF, it will be very low for the most occurring words such as stop words (because stop words such as “is” is present in almost all of the documents, and N/df will give a very low value to that word). This finally gives what we want, a relative weightage.
* idf(t) = log(N/(df + 1))

In [1]:
## import libraries
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [None]:
Paragraph = """India’s fight against the Corona global pandemic is moving ahead with great strength and steadfastness. It is only because of your restraint, penance and sacrifice that, India has so far been able to avert the harm caused by corona to a large extent. You have endured immense suffering to save your country, save your India.

    I am well aware of the problems you have faced -some for food, some for movement from place to place, and others for staying away from homes and families. However, for the sake of your country, you are fulfilling your duties like a disciplined soldier. This is the power of ‘We, the People of India’ that our constitution talks about.

    This display of our collective strength, by us, the people of India, is a true tribute to Baba Saheb Doctor Bhim Rao Ambedkar, on his birth anniversary. Baba Saheb’s life inspires us to combat each challenge with determination and hard work. I bow before Baba Saheb on behalf of all of us.

    Friends, this is also the time of various festivals across various parts of our country. Along with festivals like Baisakhi, Pohela Boishakh, Puthandu, and Vishu, the new year has commenced in many states.In the time of lockdown, the manner in which people are abiding by the rules, and celebrating festivalswith restraint while staying within their homes, is truly praiseworthy. On the occasion of new year, I wish and pray for your good health.

    Friends, you are well aware of the status of the Corona pandemic all over the world today. You have been a partner as well as witness to the manner in which India has tried to stop the infection, compared to other countries. Long before we had even a single case of Corona, India had started screening travelers coming in from Corona affected countries at airports. Much before the number of Corona patients reached 100, India had made 14-day isolation mandatory for all those coming in from abroad. Malls, clubs and gyms were shut down in many places. When we had only 550 Corona cases, then itself India had taken the big step of a 21-day complete lockdown. India did not wait for the problem to aggravate. Rather, we attempted to nip the problem in the bud itself, by taking quick decisions as soon as it arose.

    Friends, in such a crisis it is not right to compare our situation with any other country. However, it is also true that if we look at Corona-related figures in the world’s big, powerful countries, India today is in a very well-managed position. A month, month and a half ago, several countries had been at par with India in terms of Corona infection. But today, Corona cases in those countries are 25 to 30 times than that of India. Thousands of people have tragically died in those countries. Had India not adopted a holistic and integrated approach, taking quick and decisive action; the situation in India today would have been completely different.

    It is clearly evident from the experience of the past few days, that we have chosen the correct path. Our country has greatly benefited from Social Distancing and Lockdown. From an economic only point of view, it undoubtedly looks costly right now; but measured against the lives of Indian citizens, there is no comparison itself. The path that India has taken within our limited resources has become a topic of discussion in the entire world today.

    The State Governments of the country have also acted with great responsibility in this, managing the situation round the clock. But friends, the way the Corona pandemic is spreading amidst all these efforts, has made health experts & governments around the world even more alert. I have been in continuous touch with the States on how the fight against Corona should progress in India. Everyone has suggested that the lockdown should be continued. Many States have in fact already decided and declared to continue the lockdown.

    Friends, keeping all the suggestions in mind, it has been decided that the lockdown in India will have to be extended till 3rd May. That means until 3rd May, each and every one of us, will have to remain in the lockdown. During this time, we must continue maintaining discipline in the way we have been doing till now.

    It is my request and prayer to all fellow citizens, that we must not let Coronavirus spread to new areas at any cost. A single new patient at even the smallest local level, should be a matter of concern for us. The tragic death of even a single patient from coronavirus, should increase our concern even further.

    Therefore, we have to be very vigilant about hot-spots. We will have to keep a close and strict watch on the places which run the risk of becoming hot-spots. The creation of new hot-spots will further challenge our hard work and penance. Hence, let us extend the strictness and austerity in the fight against Corona for the upcoming one week.

    Until 20th April, every town, every police station, every district, every state will be evaluated on how much the lockdown is being followed.The extent to which the region has protected itself from Coronavirus will be noted.

    Areas that will succeed in this litmus test, which will not be in the hot-spot category, and will have less likelihood to turn into a hot-spot; maybe allowed to open up select necessary activities from 20th April. However, keep in mind, this permission will be conditional, and the rules for going out will be very strict. Permission will be withdrawn immediately if lockdown rules are broken, and spread of Coronavirus risked. Hence, we must make sure we ourselves don’t become careless, not allow anyone else do so. A detailed guideline will be issued by the Government tomorrow in this regard.

    Friends, provision of this limited exemption in these identified areas after 20th April has been done keeping in mind the livelihood of our poor brothers and sisters.Those who earn daily, make ends meet with daily income, they are my family. One of my top-most priorities is to reduce the difficulties in their lives.The government has made every possible effort to help them through Pradhan Mantri Gareeb Kalyan Yojna. Their interests have also been taken care of while making the new guidelines.

    These days, the harvesting of the Rabi crop is also in progress.The Central and State governments are working together to minimize the problems of the farmers.

    Friends, the country has ample reserves of medicines, food-ration and other essential goods; and supply chain constraints are continuously being removed.We are making rapid progress in ramping up health infrastructure as well.From having only one testing lab for Coronavirus in January, we now have more than 220 functional testing labs. Global experience shows that 1,500-1,600 beds are required for every 10,000 patients. In India, we have arranged more than 1 Lakh beds today. Not only this, there are more than 600 hospitals which are dedicated for Covid treatment. As we speak, these facilities are being increased even more rapidly.

    Friends, while India has limited resources today, I have a special request for India’s young scientists – to come forward and take a lead in creating a vaccine for Coronavirus; for the welfare of the world, for the welfare of the human race.

    Friends, if we continue to be patient and follow rules, we will be able to defeat even a pandemic like Corona.With this faith and trust, I seek your support for 7 things in the end."""

In [None]:
ps = PorterStemmer()
wordnet=WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

In [None]:
# Creating the TF-IDF model
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()
X = cv.fit_transform(corpus).toarray()