Functions

HCI main functions

To build

Link - Selection of the web corpus boundaries by including, excluding and defining/categorizing [web entities](Web entities);
Contents - Qualification of the [web entities](Web entities) following a user specific code book.

To harvest

Link - harvest topology and trace implicit link;
Contents - crawling, harvest some or all the contents of corpus' web entities.

To index

Links - store, index, pre-compute and bring access to the topology of the corpus;
Contents - store, parse, index and bring access to the contents of the corpus.

To explore

Links - Exploratory Data Analysis tools through the graph of the corpus;
Contents - a focus search engine on the corpus.

To maintain

Links - scheduling periodic crawling of the corpus to detect the evolution of its topology
Contents - scheduling periodic harvesting to detect and harvest the diff in the contents

To preserve

Contents - assuring the reliability of the data through time.

To share/publish

Links - an export utility to drive specific network analysis on existing tools like Gephi (Gexf format);
Contents - an export utility or API to share the constituted web archive to form a web of web archives.

Hyphen user functions

Export/import

Export/import(replace) corpus parameters
Export/import(merge) corpus codebook
Export/import(replace or merge?) corpus web-entities
- Limit export/import to included/excluded/undefined web-entities
- Chose the export format (gdf, gexf, csv...)

Multi Users management

Create user accounts for a specific corpus
Chose who has the right to feed a corpus
Open a muti-user corpus
Open a mono-user corpus
Complete corpus user management (read, right, etc) (see corps versionning system)

Manage corpora/projects

Create and name new corpus
- Give the corpus a series of metadata (description, keywords, author...)
Chose to feed an existing corpus (if allowed)
Duplicate and rename a corpus
Set the parameters for a corpus (modifying the default configuration)
- Define the web entity circonscription policy (URL > Web entity )
- Define blacklisted/exceptions/heuristics (BUT be aware of the methodological problems and define their granularity)

Management of the codebook

function that may be merge with qualifying the web entities

Create and name a new codebook
- Give the codebook a series of metadata (description, keywords, author...)
Backup codebook
Export codebook
Delete codebook (reset?)
Edit codebook
- Create and delete 'qualifications terms'
- Move qualification term
  - Batch tag web entities (from a tag to a tag)
Group and ungroup qualifications-terms deciding of they are:
- tag groups (not exclusive inside the group)
- partitions (exclusive inside the group)
- Wether the tag is public or private (workflow oriented or "system tags")
import tag from existing taxonomy (semantic web, experts, archiving library standards)

Incorporate web-entities

Add entities manually (these entities are flagged as 'manualy added')
- Simple add
- Multiple add
Track navigation (these entities are flagged as 'added by navigation')
- Start/pause tracking
Semi-automatic crawl (these entities are flagged as 'added by crawl')
- Start/pause/stop the crawl
- Define the settings of the crawl
  - Chose entry-points
    - Manually (scraping hyperlinks from any copied-pasted text)
    - Based on the whole corpus or a partition of it (by qualification terms)
  - Chose depth (in stemmed url)
  - Chose distance
  - Chose time interval seconds/randomize (to avoid blacklisting)
  - Filtering prospection (when we expand the corpus by crawl, eg. by indegree (co-link) or other things)
  - Monitoring a crawl (depth balance, live graph...)
  - Focus crawling ?
    - Define keywords and/or stopwords (these entities are flagged with the found keyword)
Scrapping (these entities are flagged as 'added by scrapp')
- add entities from a keyword search (google, bing, twitter, etc. -> modules)
- add entities from a link: or inanchor: search

Qualify web-entities

By default, entities are pending
"Deleted" status is also possible

Define the web entity status (included/excluded/undefined)
Assign a name to the web entity (the default is based on the level of granularity? Page Title?)
Manually assign qualification-terms to web-entities during navigation
- Hide/show qualification groups
- suggest tags or qualifications
Define a qualification for a set of web-entities (manual, navigation or crawl)
- Set the status (undefined/excluded/included)
- Set the qualification-term
Automatically fill meta-information on each web-entity
- URL (dipendending on the level of granularity)
- Default title if not manually changed
- Incoming links
- Outgoing links
  - nature of the link (flash, javascript, etc)
- Source (manually added, add by navigation, add by crawl, add by scrapp)
- Who added the web entity to the corpus
- Date stamp (ex. date of adding)
- All the graph indicators

View the corpus

Corpus overview (showing main information on entities) / full-spread-sheet view (showing all information on entities)
- Sort entities
- Filter entities
- Search entities (simple or advanced)
Real-time graph of the corpus
- Color nodes according to status or classification
- Rank nodes according to incoming/outgoing links or graph indicators
- Chose node label (URL, title, status)
- Chose layout algorithm
Slideshow of the corpus or part of it (open subsequently in a tab or in multiple tabs)

Edit the corpus entities

Create/delete web-entities
Split web entities
Edit the status of a set of web-entities
Merge web-entities (Management of aliases)
Create links among web-entities (difficult)
Edit the qualifications of a set of web-entities
Edit the auto-information of a set of web-entities
Attach extra information to entities (ex. geographic coordinates)

TO DELETE ?

Define Web entities (choose granularity and specificness) based reverse URL pattern.
Limit the corpus (define boundaries and throw away rest of discovered URLs)
manually (accept/reject)
reverse URL scheme should also be used to define blacklist

Questions

search / time
difference in set-up / explore?

Granularity, stems and Web entitites

E.g. google.com/images/pageA.html
Blocks that constitute URLs are called stems
E.g. {com, com.google, com.google.images}
Web entities are typically nodes in graph of your corpus
Graph of stemmed URLs (Reverse URLs, arranged from least to most specific part):
d:com.d:google.p:images.p:pageA.html
Advantages:
- easy to redefine Web entity without having to recalculate everything
- forces social scientists to think about *websites
Granularity: how far to go in URL?
Web entities are relevant only for visualization and possibly selection (e.g. co-link)

User - UI - CVS

for corpora (local or online?) - ... - ARC - WWW

Sharing corpora

Integrating different corpora in a single repository
Presentation / mapping
Harvesting by OAI-PMH protocol
Search by SRU protocol

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functions

HCI main functions

To build

To harvest

To index

To explore

To maintain

To preserve

To share/publish

Hyphen user functions

Export/import

Multi Users management

Manage corpora/projects

Management of the codebook

Incorporate web-entities

Qualify web-entities

View the corpus

Edit the corpus entities

Questions

Granularity, stems and Web entitites

User - UI - CVS

Sharing corpora

Clone this wiki locally