## 1. Extracting Important Keywords by Using Scikit-Learn

In [1]:
import pandas as pd
df = pd.read_csv("../input/2020-07-16_Tier-2-5-sponsor-guidance_Jul-2020_v1.0_data.csv", index_col=0)

In [2]:
import re
def clean_doc(doc):
    """
    Cleans the document from unneecessary chars/words, etc.
    
    TODO : maybe first-level / second-level cleaning
    """
    doc = re.sub(r"[\t\n]+", "", doc)                                                   # find & replace \t and \n with empty string
    doc = re.sub(r"[^\x00-\x7F]+", " ", doc)                                            # remove non-ascii chars
    doc = re.sub(r" +", " ", doc)                                                       # remove dublicate spaces
    doc = doc.strip()                                                                   # strip leading/trailing spaces
    doc = re.sub(r"(Page)\s\d{1,2,3}\s\w+\s\d{1,3}\s(Tiers 2 and 5: guidance for sponsors - version 07/20)", "", doc)  # TODO
    doc = re.sub(r"(Annex)\s(\w)", r"\1_\2", doc)                                       # find & replace Annex 9 -> Annex_9 
    doc = re.sub(r"(Appendix)\s(\w)", r"\1_\2", doc)                                    # find & replace Apeendix 9 -> Appendix_9 
    doc = re.sub(r"(Table)\s(\w)", r"\1_\2", doc)                                       # find & replace Table 9 -> Table_9
    doc = re.sub(r"(Tier|Tiers)\s(\d)\s(and|or|and/or)\s(\d)", r"\1_\2_\3_\4", doc)     # find & replace Tier 2 and 5 -> Tier_2_and_5
    doc = re.sub(r"(Tier)\s(\d)", r"\1_\2", doc)                                        # find & replace Tier 4 -> Tier_4
    doc = re.sub(r"(\d{1,2})\s(January|February|March)\s(\d{4})", r"\1_\2_\3", doc)     # combine dates -> 1_June_2020
    doc = re.sub(r"(\d{1,2})\s(April|May|June)\s(\d{4})", r"\1_\2_\3", doc)             # combine dates 
    doc = re.sub(r"(\d{1,2})\s(July|August|September)\s(\d{4})", r"\1_\2_\3", doc)      # combine dates 
    doc = re.sub(r"(\d{1,2})\s(October|November|December)\s(\d{4})", r"\1_\2_\3", doc)  # combine dates 
    return doc

In [3]:
df.raw_text = df.raw_text.apply(lambda x:clean_doc(x))
docs = df.raw_text.to_list()
print ("Number of pages : " , len(docs))

Number of pages :  209


In [4]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

# dummy list of keywords to be EXCLUDED
list_of_keywords = ['from', 'subject', 're', 'edu', 'use'] 
stop_words.extend(list_of_keywords)

## 2. How to Use Tfidftransformer & Tfidfvectorizer?

Term Frequency- Inverse Document Frequency (IDF) takes the sparse matrix from CountVectorizer to generate the IDF when you invoke fit. **An extremely important point to note here is that the IDF should be based on a large corpora and should be representative of texts you would be using to extract keywords.**

https://github.com/kavgan/nlp-in-practice/blob/master/tf-idf/Keyword%20Extraction%20with%20TF-IDF%20and%20SKlearn.ipynb

Scikit-learn’s **Tfidftransformer** and **Tfidfvectorizer** aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The differences between the two modules can be quite confusing and it’s hard to know when to use which. This article shows you how to correctly use each module, the differences between the two and some guidelines on what to use when.

https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.Xz_UPegzaXI

So now you may be wondering, why you should use more steps than necessary if you can get everything done in two steps. Well, there are cases where you want to use Tfidftransformer over Tfidfvectorizer and it is sometimes not that obvious. Here is a general guideline:

* If you need the term frequency (term count) vectors for different tasks, use **Tfidftransformer**.
* If you need to compute tf-idf scores on documents within your “training” dataset, use **Tfidfvectorizer**
* If you need to compute tf-idf scores on documents outside your “training” dataset, use either one, **both** will work.

## 3.Tfidftransformer Usage
### 3.1. fit_transform WHOLE document, then extract keywords in PAGE / SECTION(?)

In [5]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# instantiate CountVectorizer() 
# create a CountVectorizer to count the number of words (term frequency)
cv = CountVectorizer(max_df=0.85, stop_words=stop_words)
 
# this steps generates word counts for the words in your docs (WHOLE DOCUMENTS)
word_count_vector = cv.fit_transform(docs)

# compute the IDF values by calling tfidf_transformer.fit(word_count_vector) on the word counts
tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True) 
tfidf_transformer.fit(word_count_vector)

# Once you have the IDF values, you can now compute the tf-idf scores for any document or set of documents.
"""
fot the moment PAGE, may be SECTION / SUBSECTION / ...
"""
# get the document (page) that we want to extract keywords from
#doc = "".join(docs[200])
doc = docs[15]

print(type(doc))

# generate tf-idf for the given document
tf_idf_vector = tfidf_transformer.transform(cv.transform([doc]))

# you only needs to do this once
feature_names = cv.get_feature_names()

print("word_count_vector (document size)  : ", word_count_vector.shape[0])
print("word_count_vector (distinct words) : ", word_count_vector.shape[1])
print("10 words from our vocabulary       : ", list(cv.vocabulary_.keys())[:10])
print("get feature names                  : ", list(cv.get_feature_names())[500:510])
print("vectors found in this page         : ", tf_idf_vector.shape)

# get the vector in this page, since doc is a single page, idx=0
first_document_vector = tf_idf_vector[0]
 
# print the scores 
dff = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, 
             columns=["tfidf"]).sort_values(by=["tfidf"],ascending=False).head(10)

# result = dff.to_dict()
dff

<class 'str'>
word_count_vector (document size)  :  209
word_count_vector (distinct words) :  3445
10 words from our vocabulary       :  ['tier_2_and_5', 'addendum', 'published', '19_july_2019', 'updated', '03', '19', 'replaced', '17_july_2019', 'applies']
get feature names                  :  ['annex_7', 'annex_8', 'annex_9', 'annex_provides', 'annexes', 'announced', 'announcements', 'annual', 'annum', 'another']
vectors found in this page         :  (1, 3445)


Unnamed: 0,tfidf
system,0.239034
come,0.219803
eea,0.196637
sponsor,0.196481
trust,0.169944
licence,0.151065
based,0.147372
work,0.142097
education,0.141099
immigration,0.134588


### 3.2. fit_transform PAGE document, then extract keywords in PAGE / SECTION(?)

In [6]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# get page
doc = "".join(docs[15])

# sentence tokenizer
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
abbreviation = ['f', 'fr', 'k', 'u.k.', 'gov.uk.', 'fig']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.train(doc)
doc = tokenizer.tokenize(doc)

# instantiate CountVectorizer() 
# create a CountVectorizer to count the number of words (term frequency)
cv = CountVectorizer(max_df=0.85, stop_words=stop_words)

# this steps generates word counts for the words in your docs (WHOLE DOCUMENTS)
word_count_vector = cv.fit_transform(doc)

# compute the IDF values by calling tfidf_transformer.fit(word_count_vector) on the word counts
tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True) 
tfidf_transformer.fit(word_count_vector)

# Once you have the IDF values, you can now compute the tf-idf scores for any document or set of documents.
"""
fot the moment PAGE, may be SECTION / SUBSECTION / ...
"""
# get the document (page) that we want to extract keywords from

# generate tf-idf for the given document
tf_idf_vector = tfidf_transformer.transform(cv.transform(doc))

# you only needs to do this once
feature_names = cv.get_feature_names()

print("word_count_vector (document size)  : ", word_count_vector.shape[0])
print("word_count_vector (distinct words) : ", word_count_vector.shape[1])
print("10 words from our vocabulary       : ", list(cv.vocabulary_.keys())[:10])
print("get feature names                  : ", list(cv.get_feature_names())[500:510])
print("vectors found in the first page    : ", tf_idf_vector.shape)

# get the vector in this page, since doc is a single page, idx=0
first_document_vector = tf_idf_vector[0]
 
# print the scores 
dff = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, 
             columns=["tfidf"]).sort_values(by=["tfidf"],ascending=False).head(10)

# result = dff.to_dict()
dff

word_count_vector (document size)  :  16
word_count_vector (distinct words) :  129
10 words from our vocabulary       :  ['applying', 'licence', 'sponsorship', 'tiers_2_and_5', 'points', 'based', 'system', 'primary', 'immigration', 'routes']
get feature names                  :  []
vectors found in the first page    :  (16, 129)


Unnamed: 0,tfidf
applying,0.651225
sponsorship,0.582716
licence,0.486157
07,0.0
provider,0.0
points,0.0
poses,0.0
primary,0.0
principles,0.0
providers,0.0


## 4. Tfidfvectorizer Usage
### 4.1. fit_transform WHOLE document, then extract keywords in PAGE

In [7]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer 
 
# settings that you use for count vectorizer will go here 
tfidf_vectorizer = TfidfVectorizer(use_idf=True, max_df=0.85, stop_words=stop_words) 
 
# just send in all your docs here 
tfidf_vectorizer_vectors = tfidf_vectorizer.fit_transform(docs)

# get the first vector out (for the first document) 
first_vector_tfidfvectorizer = tfidf_vectorizer_vectors[15] 
 
# place tf-idf values in a pandas data frame 
dff = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), 
             columns=["tfidf"]).sort_values(by=["tfidf"],ascending=False).head(10)

# result = dff.to_dict()
dff

Unnamed: 0,tfidf
system,0.239034
come,0.219803
eea,0.196637
sponsor,0.196481
trust,0.169944
licence,0.151065
based,0.147372
work,0.142097
education,0.141099
immigration,0.134588


### 4.2. fit_transform PAGE document, then extract keywords in PAGE

In [8]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer 
 
# get page
doc = "".join(docs[15])

# sentence tokenizer
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
abbreviation = ['f', 'fr', 'k', 'u.k.', 'gov.uk.', 'fig']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.train(doc)
doc = tokenizer.tokenize(doc)

# settings that you use for count vectorizer will go here 
tfidf_vectorizer = TfidfVectorizer(use_idf=True, max_df=0.85, stop_words=stop_words) 
 
# just send in all your docs here 
tfidf_vectorizer_vectors = tfidf_vectorizer.fit_transform(doc)

# get the first vector out (for the first document) 
first_vector_tfidfvectorizer = tfidf_vectorizer_vectors[0]
 
# place tf-idf values in a pandas data frame 
dff = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), 
             columns=["tfidf"]).sort_values(by=["tfidf"],ascending=False).head(10)

# result = dff.to_dict()
dff

Unnamed: 0,tfidf
applying,0.651225
sponsorship,0.582716
licence,0.486157
07,0.0
provider,0.0
points,0.0
poses,0.0
primary,0.0
principles,0.0
providers,0.0
