Skip to content

Functions and Interfaces group

Benjamin Ooghe-Tabanou edited this page Dec 21, 2012 · 1 revision

Table of Contents

Members of this group

Tommaso, Andrews, Fabien, Alexis, Paolo, Daniele, Donato, Julien,

Subjects

  • Navicrawler next version interfaces
  • cartographic exploration : content + topology

User(s)

  • Social sciences researchers
      who want to use the web as source of data for they research
  • Librarians
      who want to constitute a collection of websites, make it available to library's users (student, teacher, researcher), make it harvestable by national or international repositories like CERIMES http://www.signets-universites.fr/ with OAI-PMH protocol , make it searchable by the library portal with SRU protocol.
  • Issue experts (or aspiring to be)
      who want to confirm or discover the topology of the web discussion about their issues

Functions

  • Set up of the corpus
    • Manage corpora/projects
      • Give the corpus a name
      • Create a new corpus or chose to feed an existing corpus (if allowed)
      • Chose who has the right to feed a corpus
    • Define keywords and stopwords
    • Define the entry points
      • Scrape hyperlinks from any copied-pasted text (url harvester from DMI tools)
      • Import existing (and previously exported) PEACE corpora
      • (Connect to other existing archives) (not in the prototype)
    • Define the granularity of the corpus [domains,]
      • Define heuristics (BUT be aware of the methodological problems and define their granularity)
    • Chose the default classification action (undefined/excluded/included)
  • Management of the codebook
    • Attribute a title to the codebook
    • Import/export the codebook with notice
    • Creating 'qualifications terms'
    • Grouping qualifications-terms deciding of they are
      • tag group (not exclusive inside the group)
      • partitions (exclusive inside the group)
  • Corpus
    • Compact view
      • url (ordered by number of incoming links from the included entities)
      • filter the entities by
                      # status
    • Full view (spread-sheet for viewing and editing the corpus)
      • url (dipendending on the level of granularity)
      • title (by default granularity)
      • status (include/excluded/undefined)
      • groups of tags or partitions
      • incoming links
      • outgoing links
      • who added the web entity to the corpus
      • all the graph indicators calculated by the server
  • Assisted navigation
    • Define is the web entity status (included/excluded/undefined)
    • Assign qualification terms to the web entity
    • Assign a name to the web entity (the default is based on the level of granularity)
  • Visual
    • graph visualization and navigation
  • Questions
    • search / time
    • difference in set-up / explore?
  • Granularity, stems and Web entitites
    • E.g. google.com/images/pageA.html
    • Blocks that constitute URLs are called stems
      • E.g. {com, com.google, com.google.images}
    • Web entities are typically nodes in graph of your corpus
    • Graph of stemmed URLs (Reverse URLs, arranged from least to most specific part):
      • d:com.d:google.p:images.p:pageA.html
    • Advantages: 1) easy to redefine Web entity without having to recalculate everything 2) forces social scientists to think about websites
    • Granularity: how far to go in URL?
    • Web entities are relevant only for visualization and possibly selection (e.g. co-link)
  • User - UI - CVS for corpi (local or online?) - ... - ARC - WWW
  • Assisted crawling
    • starting point(s): URLs
    • One iteration:
      • Snowball Crawl
      • Define Web entities (choose granularity and specificness) based reverse URL pattern.
      • Limit the corpus (define boundaries and throw away rest of discovered URLs)
                      # manually (accept/reject)
                      # reverse URL scheme should also be used to define blacklist
                      # co-link
                      # etc
    • GOTO One iteration
    • Qualify
    • Analyse
  • Sharing corpora
    • Integrating different corpora in a single repository
    • Presentation / mapping
    • Harvesting by OAI-PMH protocol
    • Search by SRU protocol