Skip to content
This repository


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

WebIDF calculator

branch: master

Fetching latest commit…


Cannot retrieve the latest commit at this time

Octocat-spinner-32 df
Octocat-spinner-32 lib
Octocat-spinner-32 t
Octocat-spinner-32 utils
Octocat-spinner-32 xt
Octocat-spinner-32 .gitignore
Octocat-spinner-32 .shipit
Octocat-spinner-32 Changes
Octocat-spinner-32 INSTALL
Octocat-spinner-32 MANIFEST.SKIP
Octocat-spinner-32 Makefile.PL
Octocat-spinner-32 README
    Lingua::JA::WebIDF - WebIDF calculator

      use Lingua::JA::WebIDF;

      my $webidf = Lingua::JA::WebIDF->new(%config);

      print $webidf->idf("東京"); # low
      print $webidf->idf("スリジャヤワルダナプラコッテ"); # high

    Lingua::JA::WebIDF calculates WebIDF weight.

    WebIDF(Inverse Document Frequency) weight represents the rarity of a
    word on the Web. The WebIDF weight of a rare word is high. Conversely,
    the WebIDF weight of a common word is low.

    IDF is based on the intuition that a query term which occurs in many
    documents is not a good discriminator and should be given less weight
    than one which occurs in few documents.

  new( %config || \%config )
    Creates a new Lingua::JA::WebIDF instance.

    The following configuration is used if you don't set %config.

      KEY                 DEFAULT VALUE
      -----------         ---------------
      idf_type            1
      api                 'YahooPremium'
      appid               undef
      driver              'TokyoCabinet'
      df_file             './df.tch'
      fetch_df            0
      expires_in          365
      documents           250_0000_0000
      Furl_HTTP           undef
      verbose             1

    idf_type => 1 || 2 || 3
        The type1 is the most commonly cited form of IDF.

          idf(t_i) = log -----  (1)

          N  : the number of documents
          n_i: the number of documents which contain term t_i
          t_i: term

        The type2 is a simple version of the RSJ weight.

                           N - n_i + 0.5
          idf(t_i) = log ----------------  (2)
                            n_i + 0.5

        The type3 is a modification of (2).

                           N + 0.5
          idf(t_i) = log -----------  (3)
                          n_i + 0.5

    api => 'Yahoo' || 'YahooPremium'
        Uses the specified Web API when fetches WebDF(Document Frequency).

    driver => 'Storable' || 'TokyoCabinet'
        Fetches and saves WebDF with the specified driver.

    df_file => $path
        Saves WebDF to the specified path.

        In order to reduce access to Web API, please download a big df file
        from <>.

        I recommend that you change the file depending on the type of Web
        API you specifies because WebDF may be different depending on it.

    fech_df => 0
        Never fetches WebDF from the Web if 0 is specified.

        If the WebDF you want to know has already saved, it is used. If it
        is not so, returns undef.

    expires_in => $days
        If 365 is specified, WebDF expires in 365 days after fetches it.

    Furl_HTTP => \%option
        Sets the options of Furl::HTTP->new.

        If you want to use proxy server, you have to use this option.

    verbose => 1 || 0
        If 1 is specified, shows verbose error messages.

    Calculates the WebIDF weight of $word via df($word) method.

    Fetches the WebDF of $word.

    If the WebDF of $word has not been saved yet or has expired, fetches it
    by using the Web API you specified and saves it.

    If the WebDF of $word has expired and fetch_df is 0, the expired WebDF
    is used.

    Opens the database file which is located in $path.

    If you use TokyoCabinet, you have to open the database file via this
    method before idf|df|db_close|purge method is called.

    $mode is 'read' or 'write'.

    Closes the database file which is located in $path.

    This method is called automatically when the object is destroyed, so you
    might not need to use this method explicitly.

    Purges old data in df_file.

    If 365 is specified, the data which 365 days elapsed are purged.

    pawa <>



    Yahoo API: <>

    Tokyo Cabinet: <>

    S. Robertson, Understanding inverse document frequency: on theoretical
    arguments for IDF. Journal of Documentation 60, 503-520, 2004.

    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.

Something went wrong with that request. Please try again.