Functions and Interfaces group

Table of Contents Members of this group Subjects User(s) Functions

Members of this group

Tommaso, Andrews, Fabien, Alexis, Paolo, Daniele, Donato, Julien,

Subjects

Navicrawler next version interfaces
cartographic exploration : content + topology

User(s)

Social sciences researchers

      who want to use the web as source of data for they research

Librarians

      who want to constitute a collection of websites, make it available to library's users (student, teacher, researcher), make it harvestable by national or international repositories like CERIMES http://www.signets-universites.fr/ with OAI-PMH protocol , make it searchable by the library portal with SRU protocol.

Issue experts (or aspiring to be)

      who want to confirm or discover the topology of the web discussion about their issues

Functions

Set up of the corpus
- Manage corpora/projects
  - Give the corpus a name
  - Create a new corpus or chose to feed an existing corpus (if allowed)
  - Chose who has the right to feed a corpus
- Define keywords and stopwords
- Define the entry points
  - Scrape hyperlinks from any copied-pasted text (url harvester from DMI tools)
  - Import existing (and previously exported) PEACE corpora
  - (Connect to other existing archives) (not in the prototype)
- Define the granularity of the corpus [domains,]
  - Define heuristics (BUT be aware of the methodological problems and define their granularity)
- Chose the default classification action (undefined/excluded/included)
Management of the codebook
- Attribute a title to the codebook
- Import/export the codebook with notice
- Creating 'qualifications terms'
- Grouping qualifications-terms deciding of they are
  - tag group (not exclusive inside the group)
  - partitions (exclusive inside the group)
Corpus
- Compact view
  - url (ordered by number of incoming links from the included entities)
  - filter the entities by

                      # status

- Full view (spread-sheet for viewing and editing the corpus)
  - url (dipendending on the level of granularity)
  - title (by default granularity)
  - status (include/excluded/undefined)
  - groups of tags or partitions
  - incoming links
  - outgoing links
  - who added the web entity to the corpus
  - all the graph indicators calculated by the server
Assisted navigation
- Define is the web entity status (included/excluded/undefined)
- Assign qualification terms to the web entity
- Assign a name to the web entity (the default is based on the level of granularity)
Visual
- graph visualization and navigation
Questions
- search / time
- difference in set-up / explore?
Granularity, stems and Web entitites
- E.g. google.com/images/pageA.html
- Blocks that constitute URLs are called stems
  - E.g. {com, com.google, com.google.images}
- Web entities are typically nodes in graph of your corpus
- Graph of stemmed URLs (Reverse URLs, arranged from least to most specific part):
  - d:com.d:google.p:images.p:pageA.html
- Advantages: 1) easy to redefine Web entity without having to recalculate everything 2) forces social scientists to think about websites
- Granularity: how far to go in URL?
- Web entities are relevant only for visualization and possibly selection (e.g. co-link)
User - UI - CVS for corpi (local or online?) - ... - ARC - WWW
Assisted crawling
- starting point(s): URLs
- One iteration:
  - Snowball Crawl
  - Define Web entities (choose granularity and specificness) based reverse URL pattern.
  - Limit the corpus (define boundaries and throw away rest of discovered URLs)

                      # manually (accept/reject)
                      # reverse URL scheme should also be used to define blacklist
                      # co-link
                      # etc

- GOTO One iteration
- Qualify
- Analyse
Sharing corpora
- Integrating different corpora in a single repository
- Presentation / mapping
- Harvesting by OAI-PMH protocol
- Search by SRU protocol

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functions and Interfaces group

Table of Contents

Members of this group

Subjects

User(s)

Functions

Clone this wiki locally