Skip to content
WebIDF calculator
Smalltalk Perl
Find file
Latest commit ee210bf Oct 7, 2012 @pawa- updated to Ver.0.43
Failed to load latest commit information.
lib/Lingua/JA updated to Ver.0.43 Oct 7, 2012
t improved tests Oct 7, 2012
utils added utils May 14, 2012
xt improved tests Sep 30, 2012
.gitignore initial commit May 11, 2012
.shipit initial commit May 10, 2012
Changes updated to Ver.0.43 Oct 7, 2012
INSTALL added db_open|db_close|purge methods Jul 21, 2012
MANIFEST.SKIP added dependency on File::ShareDir Jul 21, 2012
Makefile.PL removed dependency on JSON Sep 30, 2012
README updated to Ver.0.40 Oct 5, 2012


    Lingua::JA::WebIDF - WebIDF calculator

      use Lingua::JA::WebIDF;

      my $webidf = Lingua::JA::WebIDF->new(%config);

      print $webidf->idf("東京"); # low
      print $webidf->idf("スリジャヤワルダナプラコッテ"); # high

    Lingua::JA::WebIDF calculates WebIDF weight.

    WebIDF(Inverse Document Frequency) weight represents the rarity of a
    word on the Web. The WebIDF weight of a rare word is high. Conversely,
    the WebIDF weight of a common word is low.

    IDF is based on the intuition that a query term which occurs in many
    documents is not a good discriminator and should be given less weight
    than one which occurs in few documents.

  new( %config || \%config )
    Creates a new Lingua::JA::WebIDF instance.

    The following configuration is used if you don't set %config.

      KEY                 DEFAULT VALUE
      -----------         ---------------
      idf_type            1
      api                 'YahooPremium'
      appid               undef
      driver              'TokyoCabinet'
      df_file             './df.tch'
      fetch_df            0
      expires_in          365
      documents           250_0000_0000
      Furl_HTTP           undef
      verbose             1

    idf_type => 1 || 2 || 3
        The type1 is the most commonly cited form of IDF.

          idf(t_i) = log -----  (1)

          N  : the number of documents
          n_i: the number of documents which contain term t_i
          t_i: term

        The type2 is a simple version of the RSJ weight.

                           N - n_i + 0.5
          idf(t_i) = log ----------------  (2)
                            n_i + 0.5

        The type3 is a modification of (2).

                           N + 0.5
          idf(t_i) = log -----------  (3)
                          n_i + 0.5

    api => 'Yahoo' || 'YahooPremium'
        Uses the specified Web API when fetches WebDF(Document Frequency).

    driver => 'Storable' || 'TokyoCabinet'
        Fetches and saves WebDF with the specified driver.

    df_file => $path
        Saves WebDF to the specified path.

        In order to reduce access to Web API, please download a big df file
        from <>.

        I recommend that you change the file depending on the type of Web
        API you specifies because WebDF may be different depending on it.

    fech_df => 0
        Never fetches WebDF from the Web if 0 is specified.

        If the WebDF you want to know has already saved, it is used. If it
        is not so, returns undef.

    expires_in => $days
        If 365 is specified, WebDF expires in 365 days after fetches it.

    Furl_HTTP => \%option
        Sets the options of Furl::HTTP->new.

        If you want to use proxy server, you have to use this option.

    verbose => 1 || 0
        If 1 is specified, shows verbose error messages.

    Calculates the WebIDF weight of $word via df($word) method.

    Fetches the WebDF of $word.

    If the WebDF of $word has not been saved yet or has expired, fetches it
    by using the Web API you specified and saves it.

    If the WebDF of $word has expired and fetch_df is 0, the expired WebDF
    is used.

    Opens the database file which is located in $path.

    If you use TokyoCabinet, you have to open the database file via this
    method before idf|df|db_close|purge method is called.

    $mode is 'read' or 'write'.

    Closes the database file which is located in $path.

    This method is called automatically when the object is destroyed, so you
    might not need to use this method explicitly.

    Purges old data in df_file.

    If 365 is specified, the data which 365 days elapsed are purged.

    pawa <>



    Yahoo API: <>

    Tokyo Cabinet: <>

    S. Robertson, Understanding inverse document frequency: on theoretical
    arguments for IDF. Journal of Documentation 60, 503-520, 2004.

    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.

Something went wrong with that request. Please try again.