Skip to content

Functions

jrault edited this page Dec 21, 2012 · 2 revisions

HCI main functions

To build

  • Link - Selection of the web corpus boundaries by including, excluding and defining/categorizing [web entities](Web entities);
  • Contents - Qualification of the [web entities](Web entities) following a user specific code book.

To harvest

  • Link - harvest topology and trace implicit link;
  • Contents - crawling, harvest some or all the contents of corpus' web entities.

To index

  • Links - store, index, pre-compute and bring access to the topology of the corpus;
  • Contents - store, parse, index and bring access to the contents of the corpus.

To explore

  • Links - Exploratory Data Analysis tools through the graph of the corpus;
  • Contents - a focus search engine on the corpus.

To maintain

  • Links - scheduling periodic crawling of the corpus to detect the evolution of its topology
  • Contents - scheduling periodic harvesting to detect and harvest the diff in the contents

To preserve

  • Contents - assuring the reliability of the data through time.

To share/publish

  • Links - an export utility to drive specific network analysis on existing tools like Gephi (Gexf format);
  • Contents - an export utility or API to share the constituted web archive to form a web of web archives.

Hyphen user functions

Export/import

  • Export/import(replace) corpus parameters
  • Export/import(merge) corpus codebook
  • Export/import(replace or merge?) corpus web-entities
    • Limit export/import to included/excluded/undefined web-entities
    • Chose the export format (gdf, gexf, csv...)

Multi Users management

  • Create user accounts for a specific corpus
  • Chose who has the right to feed a corpus
  • Open a muti-user corpus
  • Open a mono-user corpus
  • Complete corpus user management (read, right, etc) (see corps versionning system)

Manage corpora/projects

  • Create and name new corpus
    • Give the corpus a series of metadata (description, keywords, author...) 
  • Chose to feed an existing corpus (if allowed)
  • Duplicate and rename a corpus
  • Set the parameters for a corpus (modifying the default configuration)
    • Define the web entity circonscription policy (URL > Web entity )
    • Define blacklisted/exceptions/heuristics (BUT be aware of the methodological problems and define their granularity)

Management of the codebook

function that may be merge with qualifying the web entities

  • Create and name a new codebook
    • Give the codebook a series of metadata (description, keywords, author...)
  • Backup codebook
  • Export codebook
  • Delete codebook (reset?)
  • Edit codebook
    • Create and delete 'qualifications terms'
    • Move qualification term 
      • Batch tag web entities (from a tag to a tag)
  • Group and ungroup qualifications-terms deciding of they are:
    • tag groups (not exclusive inside the group)
    • partitions (exclusive inside the group)
    • Wether the tag is public or private (workflow oriented or "system tags")
  • import tag from existing taxonomy (semantic web, experts, archiving library standards)

Incorporate web-entities

  • Add entities manually (these entities are flagged as 'manualy added')
    • Simple add
    • Multiple add
  • Track navigation (these entities are flagged as 'added by navigation')
    • Start/pause tracking
  • Semi-automatic crawl (these entities are flagged as 'added by crawl')
    • Start/pause/stop the crawl 
    • Define the settings of the crawl
      • Chose entry-points
        • Manually (scraping hyperlinks from any copied-pasted text)
        • Based on the whole corpus or a partition of it (by qualification terms)
      • Chose depth (in stemmed url)
      • Chose distance
      • Chose time interval seconds/randomize (to avoid blacklisting)
      • Filtering prospection (when we expand the corpus by crawl, eg. by indegree (co-link) or other things)
      • Monitoring a crawl (depth balance, live graph...)
      • Focus crawling ?
        • Define keywords and/or stopwords (these entities are flagged with the found keyword)
  • Scrapping (these entities are flagged as 'added by scrapp')
    • add entities from a keyword search (google, bing, twitter, etc. -> modules)
    • add entities from a link: or inanchor: search

Qualify web-entities


By default, entities are pending
"Deleted" status is also possible

  • Define the web entity status (included/excluded/undefined)
  • Assign a name to the web entity (the default is based on the level of granularity? Page Title?)
  • Manually assign qualification-terms to web-entities during navigation
    • Hide/show qualification groups
    • suggest tags or qualifications
  • Define a qualification for a set of web-entities (manual, navigation or crawl)
    • Set the status (undefined/excluded/included)
    • Set the qualification-term
  • Automatically fill meta-information on each web-entity
    • URL (dipendending on the level of granularity)
    • Default title if not manually changed
    • Incoming links
    • Outgoing links
      • nature of the link (flash, javascript, etc)
    • Source (manually added, add by navigation, add by crawl, add by scrapp)
    • Who added the web entity to the corpus
    • Date stamp (ex. date of adding)
    • All the graph indicators

View the corpus

  • Corpus overview (showing main information on entities) / full-spread-sheet view (showing all information on entities)
    • Sort entities
    • Filter entities
    • Search entities (simple or advanced)
  • Real-time graph of the corpus
    • Color nodes according to status or classification
    • Rank nodes according to incoming/outgoing links or graph indicators
    • Chose node label (URL, title, status)
    • Chose layout algorithm
  • Slideshow of the corpus or part of it (open subsequently in a tab or in multiple tabs)

Edit the corpus entities

  • Create/delete web-entities
  • Split web entities
  • Edit the status of a set of web-entities
  • Merge web-entities (Management of aliases)
  • Create links among web-entities (difficult)
  • Edit the qualifications of a set of web-entities
  • Edit the auto-information of a set of web-entities
  • Attach extra information to entities (ex. geographic coordinates)

TO DELETE ?

  • Define Web entities (choose granularity and specificness) based reverse URL pattern.
  • Limit the corpus (define boundaries and throw away rest of discovered URLs)
  • manually (accept/reject)
  • reverse URL scheme should also be used to define blacklist

Questions

  • search / time
  • difference in set-up / explore?

Granularity, stems and Web entitites

  • E.g. google.com/images/pageA.html
  • Blocks that constitute URLs are called stems
  • E.g. {com, com.google, com.google.images}
  • Web entities are typically nodes in graph of your corpus
  • Graph of stemmed URLs (Reverse URLs, arranged from least to most specific part):
  • d:com.d:google.p:images.p:pageA.html
  • Advantages:
    • easy to redefine Web entity without having to recalculate everything
    • forces social scientists to think about *websites
  • Granularity: how far to go in URL?
  • Web entities are relevant only for visualization and possibly selection (e.g. co-link)

User - UI - CVS

for corpora (local or online?) - ... - ARC - WWW

Sharing corpora

  • Integrating different corpora in a single repository
  • Presentation / mapping
  • Harvesting by OAI-PMH protocol
  • Search by SRU protocol
Clone this wiki locally