# TFIDF
The TF-IDF(Term-Frequency- Inverse Document Frequency) is the statistics-based method's baseline for finding the keywords. TF, the term-frequency, is the ratio between the frequency of a term and the document's total number of words[1]. The IDF (Inverse Document Frequency) is the representation of the entire document's logarithm based on the corpus and the ratio of the documents having the term in it [1]. The result obtained after the multiplication of the two values is the TF-IDF score [2]. The formula for TF-IDF is:

$ w _(t _k)=tf_k*log \frac{N}{df_k}\$

where $tf_k$ represents the TF of $t_k$,   $ log \frac{N}{df_k}\ $ represents the IDF of  $t_k$ and $w_(t_k)$ is the TF-IDF weight [2]. The TF-IDF aims to show that for a term with a higher frequency in a document if there are fewer documents in the corpus containing it inside, it is more likely a representative word for the particular document [2].  

[1] Liu, Xingbing, et al. "Keywords extraction method for technological demands of small and medium-sized enterprises based on LDA." 2019 Chinese Automation Congress (CAC). IEEE, 2019.

[2] Chen, Kewen, et al. "Turning from TF-IDF to TF-IGM for term weighting in text classification." Expert Systems with Applications 66 (2016): 245-260.


# TF-IDF
In this assessment, TfidfVectorizer from the sklearn library is used. It converts a collection of raw documents to a matrix of TF-IDF features. It is the way to use TfidfTransformer after CountVectorizer. CountVectorizer converts a collection of text documents to a matrix of token counts. TfidfTransformer does the TF-IDF transformation from a provided matrix of counts. It is important to mention that for TfidfVectorizer, the default "smooth_idf" feature is True. So, it weights by adding one to document frequencies, which prevents zero divisions. Furthermore, stop words, which include "and," "the," and "him," are seen to be uninformative in describing the content of a document. They are eliminated to prevent them from being interpreted as a signal for prediction. 

In [None]:
#Pandas and scikit-learn libraries are downloaded
!pip install scikit-learn
!pip install pandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

#Train data and Test data can be anything that has sentences as strings inside. For this experiment, random sentences, including Wikipedia data, are chosen.
#Two different Test datasets are used. This shows the different values observed after changing the datasets. The TFIDF values and the terms are changed regarding the test and train datasets.
#The first test dataset is a piece of the training dataset.

train = [
    "I enjoy reading about Machine Learning and it is my PhD subject.",
    "I would enjoy a walk in the park.",
    "I was reading in the library.",
    "You can not leave the library.",
    "Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. ",
    "Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks."

]
test = [
        "I enjoy reading about Machine Learning and it is my PhD subject."

]

In [None]:
#Train data and Test data can be anything that has sentences as strings inside. For this experiment, random sentences including Wikipedia data are chosen.
#Two different Train and Test datasets are used. This shows the different values observed after changing the datasets. The TFIDF values and the terms are changed regarding to the test and train datasets.
# The second test dataset does not belong to the training dataset. With the created TF-IDF class, even though new text data that is different than the training data arrives, if there are words existing in both of the datasets, the TF-IDF values are found. 
#The results are situated below.

train= [
    "I enjoy reading about machine Learning and it is my PhD subject.",
    "I would enjoy a walk in the park.",
    "I was reading in the library.",
    "You can not leave the library.",
    "Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. ",
    "Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks."

]
test_1 = [
    "It is a big park. After walking some time in the park, I will go to the library to study Machine Learning. I have to finish my reading before leaving. "
]

In [None]:
class tfidfku():
  """
  This class has two inputs as train and test. Train symbolizes the training dataset and test symbolizes the test dataset.
  The TfidfVectorizer of Sklearn library that also implements english stop word removal is used. This class helps the user 
  to find the TF-IDF values of the terms existing in the dataset regarding to the vocabulary existing in the training set. 
  Finally, it creates a Pandas DataFrame that illustrates the correponding TF-IDF value for the term.
  
  Functions:
  process()
  dfresults()
  showresults()


  """
  def __init__(self,train=train,test=test) :
    """
    Parameters: 
    train: training dataset that is a list of sentences.(string)
    test: test dataset that is a list of sentences.(string)
    tfidfvectorizer: the sklearn library's TfidfVectorizer with stop word removal included.

    """
    self.train=train
    self.test=test
    self.tfidfvectorizer = TfidfVectorizer(stop_words='english',smooth_idf=True)
    self.process()
    self.dfresults()
  def process(self):
    """
    process: tfidfvectorizer fitted to the training dataset with .fit function. It learns vocabulary and the idf from the training set. 
    Then, regarding to the learned vocabulary, by using .transform function, the document-term matrix (tfidf_term_vectors)) of the test dataset is obtained.

    """
    self.tfidfvectorizer.fit(self.train)
    self.tfidf_term_vectors  = self.tfidfvectorizer.transform(self.test)
  def dfresults(self):
    """
    dfresults: A pandas dataframe that illustrates the TF-IDF values of corresponding word is created.

    """
    self.df = pd.DataFrame(self.tfidf_term_vectors[0].T.todense(), index=self.tfidfvectorizer.get_feature_names_out(), columns=["TF-IDF"])
    self.df = self.df.sort_values('TF-IDF', ascending=False)
  def showresults(self):
    """
    showresults: It shows the DataFrame created by dfresults function.
    
    """
    return self.df

  


In [None]:
cl = tfidfku(train,test)

In [None]:
cl.showresults()

Unnamed: 0,TF-IDF
subject,0.48205
phd,0.48205
enjoy,0.395288
reading,0.395288
machine,0.333729
learning,0.333729
algorithms,0.0
recognition,0.0
model,0.0
needed,0.0


In [None]:
cl2 = tfidfku(train,test_1)

In [None]:
cl2.showresults()

Unnamed: 0,TF-IDF
park,0.796602
library,0.326612
reading,0.326612
machine,0.275749
learning,0.275749
algorithms,0.0
methods,0.0
model,0.0
needed,0.0
order,0.0


# Unit Test

Unit tests are used to see if some errors can appear during the experiment. In Edge cases, it is more likely to see the errors. They refer to the beginning and the end of the program. So it is essential to examine the tests specifically in these areas. The Unit Tests created aim to handle the Edge Cases.

In [None]:
# Word tokenizer and English stop words are installed to be used in the Unit Tests.
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize,sent_tokenize
nltk.download('stopwords')
from nltk.corpus import stopwords


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
import unittest

class tfidftest(unittest.TestCase):

    """
    This is the class created for Unit Testing for the TFIDF class (tfidfku) shown above. 
    
    """

    def setUp(self):
        """
        In the setUp part, tfidfku class with the train set as "train" and test set as "test" is created. 
        The tfidfku class also creates pandas DF thanks to its architecture.
        Parameters:
        tokenizedtrain,tokenizedtest : word tokenized form of the train and test datasets, respectively.

        """
        self.cltfidf=tfidfku(train,test)
        self.tokenizedtrain=[]
        for sents in train:
          for sentences in sent_tokenize(sents):
              self.tokenizedtrain.append(word_tokenize(sentences))
        self.tokenizedtest=[]
        for sents in test:
          for sentences in sent_tokenize(sents):
              self.tokenizedtest.append(word_tokenize(sentences))

        

    def test_trainstr(self):
        """
        The type of the variables inside of the lists that are used as the inputs for the tfidfku class should be str. This condition is checked for the train set.
        """
        
        for lines in self.cltfidf.train:
         self.assertEqual(lines.__class__.__name__,"str")

    def test_teststr(self):
        """
        The type of the variables inside of the lists that are used as the inputs for the tfidfku class should be str. This condition is checked for the test set.

        """
        for lines in self.cltfidf.test:
         self.assertEqual(lines.__class__.__name__,"str")

    def test_stopwordsnotexist(self):
        """
        As the TF-IDF vectorizer uses English stop word removal, the absence of these words is checked in the created TF-IDF DataFrame. 
        
        """
        for words in stopwords.words("english"):
            self.assertNotIn(words,self.cltfidf.df.index)
    
    def test_similar(self):
        """
        In the test and train datasets, "Machine Learning" is used as a bigram. "Machine" and "Learning" words are not used in any of the datasets alone.
        So, it is expected that they have equal TF-IDF values. 

        """

        self.assertAlmostEqual(self.cltfidf.df.loc["machine"][0],self.cltfidf.df.loc["learning"][0])

    def test_equalzero(self):
        """
        A non used word in the test set is expected to have a 0 as the TF-IDF score.

        """
        self.assertTrue(self.cltfidf.df.loc["recognition"][0]==0)
        

    def test_forexistinginboth(self):
        """
        For every word existing in both the train and test dataset, they should have a TF-IDF value bigger than 0. This condition is checked.

        """
        for sentences in self.tokenizedtest:
          for wordsTest in sentences:
            if wordsTest.lower() not in stopwords.words("english") and wordsTest.isalnum():
              for sents in self.tokenizedtrain:
                for wordsTrain in sents:
                  if wordsTrain.lower() not in stopwords.words("english") and wordsTrain.isalnum():
                    if wordsTest == wordsTrain:
                      self.assertTrue(self.cltfidf.df.loc[wordsTest.lower()][0]>0)
          


    
    
    
    
     

   


In [None]:
unittest.main(argv=[''], defaultTest='tfidftest', verbosity=2, exit=False)

test_equalzero (__main__.tfidftest)
A non used word in the test set is expected to have a 0 as the TF-IDF score. ... ok
test_forexistinginboth (__main__.tfidftest)
For every word existing in both the train and test dataset, they should have a TF-IDF value bigger than 0. This condition is checked. ... ok
test_similar (__main__.tfidftest)
In the test and train datasets, "Machine Learning" is used as a bigram. "Machine" and "Learning" words are not used in any of the datasets alone. ... ok
test_stopwordsnotexist (__main__.tfidftest)
As the TF-IDF vectorizer uses English stop word removal, the absence of these words is checked in the created TF-IDF DataFrame. ... ok
test_teststr (__main__.tfidftest)
The type of the variables inside of the lists that are used as the inputs for the tfidfku class should be str. This condition is checked for the test set. ... ok
test_trainstr (__main__.tfidftest)
The type of the variables inside of the lists that are used as the inputs for the tfidfku class sh

<unittest.main.TestProgram at 0x7fdf3282c850>

In [None]:
import unittest

class tfidftest_2(unittest.TestCase):

    """
    This is the class that is created for Unit Testing for the TFIDF class created above. In the setUp part, the class with train and test_1  
    
    """

    def setUp(self):
      self.cltfidf=tfidfku(train,test_1)
      self.tokenizedtrain=[]
      for sents in train:
        for sentences in sent_tokenize(sents):
            self.tokenizedtrain.append(word_tokenize(sentences))
      self.tokenizedtest=[]
      for sents in test_1:
        for sentences in sent_tokenize(sents):
            self.tokenizedtest.append(word_tokenize(sentences))

    def test_trainstr(self):
        """
        The type of the variables inside of the lists that are used as the inputs for the tfidfku class should be str. This condition is checked for the train set.
        
        """
        for lines in self.cltfidf.train:
         self.assertEqual(lines.__class__.__name__,"str")

    def test_teststr(self):
        """
        The type of the variables inside of the lists that are used as the inputs for the tfidfku class should be str. This condition is checked for the test set.

        """
        for lines in self.cltfidf.test:
         self.assertEqual(lines.__class__.__name__,"str")
    
    def test_stopwordsnotexist(self):
        """
        As the TF-IDF vectorizer uses English stop word removal, the absence of these words is checked in the created TF-IDF DataFrame. 
        
        """
        for words in stopwords.words("english"):
            self.assertNotIn(words,self.cltfidf.df.index)

    
    def test_similar(self):
      """      
      In the test and train datasets, "Machine Learning" is used as a bigram. "Machine" and "Learning" words are not used in any of the datasets alone.
      So, it is expected that they have equal TF-IDF values. 

      """
      self.assertAlmostEqual(self.cltfidf.df.loc["machine"][0],self.cltfidf.df.loc["learning"][0])

    def test_equalzero(self):
      """
      A non used word in the test set is expected to have a 0 as the TF-IDF score.

      """
      self.assertTrue(self.cltfidf.df.loc["recognition"][0]==0)

    def test_forexistinginboth(self):
      """
      For every word existing in both the train and test dataset, they should have a TF-IDF value bigger than 0. This condition is checked.

      """
      for sentences in self.tokenizedtest:
          for wordsTest in sentences:
            if wordsTest.lower() not in stopwords.words("english") and wordsTest.isalnum():
              for sents in self.tokenizedtrain:
                for wordsTrain in sents:
                  if wordsTrain.lower() not in stopwords.words("english") and wordsTrain.isalnum():
                    if wordsTest == wordsTrain:
                      self.assertTrue(self.cltfidf.df.loc[wordsTest.lower()][0]>0)


    
    
    
    
     

   


In [None]:
unittest.main(argv=[''], defaultTest='tfidftest_2', verbosity=2, exit=False)

test_equalzero (__main__.tfidftest_2)
A non used word in the test set is expected to have a 0 as the TF-IDF score. ... ok
test_forexistinginboth (__main__.tfidftest_2)
For every word existing in both the train and test dataset, they should have a TF-IDF value bigger than 0. This condition is checked. ... ok
test_similar (__main__.tfidftest_2)
In the test and train datasets, "Machine Learning" is used as a bigram. "Machine" and "Learning" words are not used in any of the datasets alone. ... ok
test_stopwordsnotexist (__main__.tfidftest_2)
As the TF-IDF vectorizer uses English stop word removal, the absence of these words is checked in the created TF-IDF DataFrame. ... ok
test_teststr (__main__.tfidftest_2)
The type of the variables inside of the lists that are used as the inputs for the tfidfku class should be str. This condition is checked for the test set. ... ok
test_trainstr (__main__.tfidftest_2)
The type of the variables inside of the lists that are used as the inputs for the tfid

<unittest.main.TestProgram at 0x7fdf327b4850>

# A further work

Lemmatization is one of the most common NLP techniques for text preprocessing. With lemmatization, it is possible to reduce a given word to its root. In that way, the TF-IDF algorithm can create more meaningful outputs. For example, it can understand the terms "gone" and "go" as the same word. This has a significant influence on the TF-IDF scores.SpaCy library's lemmatizer is used for this purpose. The tokens are aimed to be converted to their base forms. 

# lemmatization

In [None]:
import spacy
# SpaCy natural-language processor for English is created.
nlp = spacy.load("en_core_web_sm")

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

#Train data and Test data can be anything that has sentences as strings inside. For this experiment, random sentences, including Wikipedia data, are chosen.
#Two different Test datasets are used. This shows the different values observed after changing the datasets. The TFIDF values and the terms are changed regarding the test and train datasets.
#The first test dataset is a piece of the training dataset.

train = [
    "I enjoy reading about Machine Learning and it is my PhD subject.",
    "I would enjoy a walk in the park.",
    "I was reading in the library.",
    "You can not leave the library.",
    "Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. ",
    "Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks."

]
test = [
        "I enjoy reading about Machine Learning and it is my PhD subject."

]

In [None]:
#Train data and Test data can be anything that has sentences as strings inside. For this experiment, random sentences including Wikipedia data are chosen.
#Two different Train and Test datasets are used. This shows the different values observed after changing the datasets. The TFIDF values and the terms are changed regarding to the test and train datasets.
# The second test dataset does not belong to the training dataset. With the created TF-IDF class, even though new text data that is different than the training data arrives, if there are words existing in both of the datasets, the TF-IDF values are found. 
#The results are situated below.

train= [
    "I enjoy reading about machine Learning and it is my PhD subject.",
    "I would enjoy a walk in the park.",
    "I was reading in the library.",
    "You can not leave the library.",
    "Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. ",
    "Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks."

]
test_1 = [
    "It is a big park. After walking some time in the park, I will go to the library to study Machine Learning. I have to finish my reading before leaving. "
]

# class lemmma

In [None]:
class tfidfwlemma():
  def __init__(self,train,test) :
    self.train=train
    self.test=test
    self.process()
    self.dfresults()

  def lemmatize(self,text):
      doc = nlp(text)
      # The text is turned into its tokens, and the punctuations are ignored
      tokens = [token for token in doc if not token.is_punct]
      # Those tokens are converted into lemmas by SpaCy.
      lemmas = [token.lemma_ if token.pos_ != 'PRON' else token.orth_ for token in tokens]
      return lemmas

  def process(self):
    self.tfidfvectorizer = TfidfVectorizer(stop_words='english',tokenizer=self.lemmatize)
    self.tfidfvectorizer.fit(self.train)
    #tfidf_train = tfidfvectorizer.transform(train)
    self.tfidf_term_vectors  = self.tfidfvectorizer.transform(self.test)

  def dfresults(self):
    self.df = pd.DataFrame(self.tfidf_term_vectors[0].T.todense(), index=self.tfidfvectorizer.get_feature_names_out(), columns=["TF-IDF"])
    self.df = self.df.sort_values('TF-IDF', ascending=False)

  def showresults(self):
    return self.df

  


In [None]:
clw = tfidfwlemma(train,test)

  "The parameter 'token_pattern' will not be used"
  % sorted(inconsistent)


In [None]:
clw.showresults()

Unnamed: 0,TF-IDF
subject,0.48205
phd,0.48205
enjoy,0.395288
read,0.395288
machine,0.333729
learning,0.333729
recognition,0.0
model,0.0
need,0.0
order,0.0


In [None]:
clw_2 = tfidfwlemma(train,test_1)

  "The parameter 'token_pattern' will not be used"
  % sorted(inconsistent)


In [None]:
clw_2.showresults()

Unnamed: 0,TF-IDF
park,0.724
walk,0.362
leave,0.362
library,0.296845
machine,0.250617
learning,0.250617
algorithm,0.0
program,0.0
model,0.0
need,0.0


In [None]:
a=[]
for sents in train:
  for sentences in sent_tokenize(sents):
      spaword=nlp(sentences)
      for words in spaword:
        a.append(words.lemma_)


In [None]:
a

['I',
 'enjoy',
 'read',
 'about',
 'machine',
 'Learning',
 'and',
 'it',
 'be',
 'my',
 'phd',
 'subject',
 '.',
 'I',
 'would',
 'enjoy',
 'a',
 'walk',
 'in',
 'the',
 'park',
 '.',
 'I',
 'be',
 'read',
 'in',
 'the',
 'library',
 '.',
 'you',
 'can',
 'not',
 'leave',
 'the',
 'library',
 '.',
 'machine',
 'learning',
 '(',
 'ML',
 ')',
 'be',
 'a',
 'field',
 'of',
 'inquiry',
 'devote',
 'to',
 'understanding',
 'and',
 'building',
 'method',
 'that',
 "'",
 'learn',
 "'",
 ',',
 'that',
 'is',
 ',',
 'method',
 'that',
 'leverage',
 'datum',
 'to',
 'improve',
 'performance',
 'on',
 'some',
 'set',
 'of',
 'task',
 '.',
 'it',
 'be',
 'see',
 'as',
 'a',
 'part',
 'of',
 'artificial',
 'intelligence',
 '.',
 'machine',
 'learning',
 'algorithm',
 'build',
 'a',
 'model',
 'base',
 'on',
 'sample',
 'datum',
 ',',
 'know',
 'as',
 'training',
 'datum',
 ',',
 'in',
 'order',
 'to',
 'make',
 'prediction',
 'or',
 'decision',
 'without',
 'be',
 'explicitly',
 'program',
 'to',

#Unit Test

In [None]:
# Word tokenizer and English stop words are installed to be used in the Unit Tests.
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize,sent_tokenize
nltk.download('stopwords')
from nltk.corpus import stopwords


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
import unittest

class tfidftestwlemma(unittest.TestCase):

    """
    This is the class created for Unit Testing for the TFIDF class with lemmatization (tfidfwlemma) shown above. 
    
    """

    def setUp(self):
        """
        In the setUp part, tfidfwlemma class with the train set as "train" and test set as "test" is created. 
        The tfidfwlemma class also creates pandas DF thanks to its architecture.
        Parameters:
        lemmatrain,lemmatest : lemmatized forms of wors in the train and test datasets, respectively.

        """
        self.cltfidf=tfidfwlemma(train,test)
       

        self.lemmatrain=[]
        for sents in train:
          for sentences in sent_tokenize(sents):
              spaword=nlp(sentences)
              for words in spaword:
                self.lemmatrain.append(words.lemma_)

        self.lemmatest=[]
        for sents in test:
          for sentences in sent_tokenize(sents):
              spaword=nlp(sentences)
              for words in spaword:
                self.lemmatest.append(words.lemma_)
        

        

    def test_trainstr(self):
        """
        The type of the variables inside of the lists that are used as the inputs for the tfidfku class should be str. This condition is checked for the train set.
        """
        
        for lines in self.cltfidf.train:
         self.assertEqual(lines.__class__.__name__,"str")

    def test_teststr(self):
        """
        The type of the variables inside of the lists that are used as the inputs for the tfidfku class should be str. This condition is checked for the test set.

        """
        for lines in self.cltfidf.test:
         self.assertEqual(lines.__class__.__name__,"str")

    def test_stopwordsnotexist(self):
        """
        As the TF-IDF vectorizer uses English stop word removal, the absence of these words is checked in the created TF-IDF DataFrame. 
        
        """
        for words in stopwords.words("english"):
            self.assertNotIn(words,self.cltfidf.df.index)
    
    def test_similar(self):
        """
        In the test and train datasets, "Machine Learning" is used as a bigram. "Machine" and "Learning" words are not used in any of the datasets alone.
        So, it is expected that they have equal TF-IDF values. 

        """

        self.assertAlmostEqual(self.cltfidf.df.loc["machine"][0],self.cltfidf.df.loc["learning"][0])

    def test_equalzero(self):
        """
        A non used word in the test set is expected to have a 0 as the TF-IDF score.

        """
        self.assertTrue(self.cltfidf.df.loc["recognition"][0]==0)
        

    def test_forexistinginboth(self):
        """
        For the words' lemmatization forms that exist in both the train and test datasets, they should have a TF-IDF value bigger than 0. This condition is checked.

        """
        for words in self.lemmatest:
          if words.lower() not in stopwords.words("english") and words.isalnum() and words in self.lemmatrain:
            self.assertTrue(self.cltfidf.df.loc[words.lower()][0]>0)


        
      


    
    
    
    
     

   


In [None]:
unittest.main(argv=[''], defaultTest='tfidftestwlemma', verbosity=2, exit=False)

test_equalzero (__main__.tfidftestwlemma)
  "The parameter 'token_pattern' will not be used"
  % sorted(inconsistent)
ok
test_forexistinginboth (__main__.tfidftestwlemma)
For the words' lemmatization forms that exist in both the train and test datasets, they should have a TF-IDF value bigger than 0. This condition is checked. ... ok
test_similar (__main__.tfidftestwlemma)
In the test and train datasets, "Machine Learning" is used as a bigram. "Machine" and "Learning" words are not used in any of the datasets alone. ... ok
test_stopwordsnotexist (__main__.tfidftestwlemma)
As the TF-IDF vectorizer uses English stop word removal, the absence of these words is checked in the created TF-IDF DataFrame. ... ok
test_teststr (__main__.tfidftestwlemma)
The type of the variables inside of the lists that are used as the inputs for the tfidfku class should be str. This condition is checked for the test set. ... ok
test_trainstr (__main__.tfidftestwlemma)
The type of the variables inside of the list

<unittest.main.TestProgram at 0x7fdf32b53090>

In [None]:
import unittest

class tfidftestwlemma_1(unittest.TestCase):

    """
    This is the class created for Unit Testing for the TFIDF class with lemmatization (tfidfwlemma) shown above. 
    
    """

    def setUp(self):
        """
        In the setUp part, tfidfwlemma class with the train set as "train" and test set as "test_1" is created. 
        The tfidfwlemma class also creates pandas DF thanks to its architecture.
        Parameters:
        lemmatrain,lemmatest : lemmatized forms of wors in the train and test datasets, respectively.

        """
        self.cltfidf=tfidfwlemma(train,test_1)
       

        self.lemmatrain=[]
        for sents in train:
          for sentences in sent_tokenize(sents):
              spaword=nlp(sentences)
              for words in spaword:
                self.lemmatrain.append(words.lemma_)

        self.lemmatest=[]
        for sents in test_1:
          for sentences in sent_tokenize(sents):
              spaword=nlp(sentences)
              for words in spaword:
                self.lemmatest.append(words.lemma_)
        

        

    def test_trainstr(self):
        """
        The type of the variables inside of the lists that are used as the inputs for the tfidfku class should be str. This condition is checked for the train set.
        """
        
        for lines in self.cltfidf.train:
         self.assertEqual(lines.__class__.__name__,"str")

    def test_teststr(self):
        """
        The type of the variables inside of the lists that are used as the inputs for the tfidfku class should be str. This condition is checked for the test set.

        """
        for lines in self.cltfidf.test:
         self.assertEqual(lines.__class__.__name__,"str")

    def test_stopwordsnotexist(self):
        """
        As the TF-IDF vectorizer uses English stop word removal, the absence of these words is checked in the created TF-IDF DataFrame. 
        
        """
        for words in stopwords.words("english"):
            self.assertNotIn(words,self.cltfidf.df.index)
    
    def test_similar(self):
        """
        In the test and train datasets, "Machine Learning" is used as a bigram. "Machine" and "Learning" words are not used in any of the datasets alone.
        So, it is expected that they have equal TF-IDF values. 

        """

        self.assertAlmostEqual(self.cltfidf.df.loc["machine"][0],self.cltfidf.df.loc["learning"][0])

    def test_equalzero(self):
        """
        A non used word in the test set is expected to have a 0 as the TF-IDF score.

        """
        self.assertTrue(self.cltfidf.df.loc["recognition"][0]==0)
        

    def test_forexistinginboth(self):
        """
        For the words' lemmatization forms that exist in both the train and test datasets, they should have a TF-IDF value bigger than 0. This condition is checked.

        """
        for words in self.lemmatest:
          if words.lower() not in stopwords.words("english") and words.isalnum() and words in self.lemmatrain:
            self.assertTrue(self.cltfidf.df.loc[words.lower()][0]>0)


        
      


    
    
    
    
     

   


In [None]:
unittest.main(argv=[''], defaultTest='tfidftestwlemma_1', verbosity=2, exit=False)

test_equalzero (__main__.tfidftestwlemma_1)
  "The parameter 'token_pattern' will not be used"
  % sorted(inconsistent)
ok
test_forexistinginboth (__main__.tfidftestwlemma_1)
For the words' lemmatization forms that exist in both the train and test datasets, they should have a TF-IDF value bigger than 0. This condition is checked. ... ok
test_similar (__main__.tfidftestwlemma_1)
In the test and train datasets, "Machine Learning" is used as a bigram. "Machine" and "Learning" words are not used in any of the datasets alone. ... ok
test_stopwordsnotexist (__main__.tfidftestwlemma_1)
As the TF-IDF vectorizer uses English stop word removal, the absence of these words is checked in the created TF-IDF DataFrame. ... ok
test_teststr (__main__.tfidftestwlemma_1)
The type of the variables inside of the lists that are used as the inputs for the tfidfku class should be str. This condition is checked for the test set. ... ok
test_trainstr (__main__.tfidftestwlemma_1)
The type of the variables inside

<unittest.main.TestProgram at 0x7fdf2e406710>

# Conclusion
It is observed from the Unit Tests that both of the TF-IDF classes with and w/o Lemmatization worked well. Edge cases are handled as no Error is obtained from the Unit Tests. The results are different whenever new text data that is different from the training data arrives. However, the classes handle it, and they show the existing words TF-IDF in the DF. Moreover, when the results of the classes w/ and w/o lemmatization are compared, it is observed that the results differ. For example, it is noticeable that for the test_1 dataset, there is the word "walking" and there is the word "walk" in the training dataset. Whenever the lemmatization is applied, the results show that the TF-IDF class with lemmatization takes "walk" as a keyword and puts it in the TF-IDF DF with a value higher than 0. If the user does not want particular values about the TF-IDF values of the words, a TF-IDF class with lemmatization as the selected tokenizer could be a better choice. However, it is also important to say that, as the SpaCy is used as the natural-language processor for English, the runtime takes more time for the class w lemmatizing than w/o lemmatizing. So, this issue can create a problem for bigger datasets, and the TF-IDF class w/o lemmatization could be used as the chosen option.