-
Notifications
You must be signed in to change notification settings - Fork 59
Functions
jrault edited this page Dec 21, 2012
·
2 revisions
- Link - Selection of the web corpus boundaries by including, excluding and defining/categorizing [web entities](Web entities);
- Contents - Qualification of the [web entities](Web entities) following a user specific code book.
- Link - harvest topology and trace implicit link;
- Contents - crawling, harvest some or all the contents of corpus' web entities.
- Links - store, index, pre-compute and bring access to the topology of the corpus;
- Contents - store, parse, index and bring access to the contents of the corpus.
- Links - Exploratory Data Analysis tools through the graph of the corpus;
- Contents - a focus search engine on the corpus.
- Links - scheduling periodic crawling of the corpus to detect the evolution of its topology
- Contents - scheduling periodic harvesting to detect and harvest the diff in the contents
- Contents - assuring the reliability of the data through time.
- Links - an export utility to drive specific network analysis on existing tools like Gephi (Gexf format);
- Contents - an export utility or API to share the constituted web archive to form a web of web archives.
- Export/import(replace) corpus parameters
- Export/import(merge) corpus codebook
- Export/import(replace or merge?) corpus web-entities
- Limit export/import to included/excluded/undefined web-entities
- Chose the export format (gdf, gexf, csv...)
- Create user accounts for a specific corpus
- Chose who has the right to feed a corpus
- Open a muti-user corpus
- Open a mono-user corpus
- Complete corpus user management (read, right, etc) (see corps versionning system)
- Create and name new corpus
- Give the corpus a series of metadata (description, keywords, author...)
- Give the corpus a series of metadata (description, keywords, author...)
- Chose to feed an existing corpus (if allowed)
- Duplicate and rename a corpus
- Set the parameters for a corpus (modifying the default configuration)
- Define the web entity circonscription policy (URL > Web entity )
- Define blacklisted/exceptions/heuristics (BUT be aware of the methodological problems and define their granularity)
function that may be merge with qualifying the web entities
- Create and name a new codebook
- Give the codebook a series of metadata (description, keywords, author...)
- Backup codebook
- Export codebook
- Delete codebook (reset?)
- Edit codebook
- Create and delete 'qualifications terms'
- Move qualification term
- Batch tag web entities (from a tag to a tag)
- Group and ungroup qualifications-terms deciding of they are:
- tag groups (not exclusive inside the group)
- partitions (exclusive inside the group)
- Wether the tag is public or private (workflow oriented or "system tags")
- import tag from existing taxonomy (semantic web, experts, archiving library standards)
- Add entities manually (these entities are flagged as 'manualy added')
- Simple add
- Multiple add
- Track navigation (these entities are flagged as 'added by navigation')
- Start/pause tracking
- Semi-automatic crawl (these entities are flagged as 'added by crawl')
- Start/pause/stop the crawl
- Define the settings of the crawl
- Chose entry-points
- Manually (scraping hyperlinks from any copied-pasted text)
- Based on the whole corpus or a partition of it (by qualification terms)
- Chose depth (in stemmed url)
- Chose distance
- Chose time interval seconds/randomize (to avoid blacklisting)
- Filtering prospection (when we expand the corpus by crawl, eg. by indegree (co-link) or other things)
- Monitoring a crawl (depth balance, live graph...)
- Focus crawling ?
- Define keywords and/or stopwords (these entities are flagged with the found keyword)
- Define keywords and/or stopwords (these entities are flagged with the found keyword)
- Chose entry-points
- Scrapping (these entities are flagged as 'added by scrapp')
- add entities from a keyword search (google, bing, twitter, etc. -> modules)
- add entities from a link: or inanchor: search
By default, entities are pending
"Deleted" status is also possible
- Define the web entity status (included/excluded/undefined)
- Assign a name to the web entity (the default is based on the level of granularity? Page Title?)
- Manually assign qualification-terms to web-entities during navigation
- Hide/show qualification groups
- suggest tags or qualifications
- Define a qualification for a set of web-entities (manual, navigation or crawl)
- Set the status (undefined/excluded/included)
- Set the qualification-term
- Automatically fill meta-information on each web-entity
- URL (dipendending on the level of granularity)
- Default title if not manually changed
- Incoming links
- Outgoing links
- nature of the link (flash, javascript, etc)
- Source (manually added, add by navigation, add by crawl, add by scrapp)
- Who added the web entity to the corpus
- Date stamp (ex. date of adding)
- All the graph indicators
- Corpus overview (showing main information on entities) / full-spread-sheet view (showing all information on entities)
- Sort entities
- Filter entities
- Search entities (simple or advanced)
- Real-time graph of the corpus
- Color nodes according to status or classification
- Rank nodes according to incoming/outgoing links or graph indicators
- Chose node label (URL, title, status)
- Chose layout algorithm
- Slideshow of the corpus or part of it (open subsequently in a tab or in multiple tabs)
- Create/delete web-entities
- Split web entities
- Edit the status of a set of web-entities
- Merge web-entities (Management of aliases)
- Create links among web-entities (difficult)
- Edit the qualifications of a set of web-entities
- Edit the auto-information of a set of web-entities
- Attach extra information to entities (ex. geographic coordinates)
TO DELETE ?
- Define Web entities (choose granularity and specificness) based reverse URL pattern.
- Limit the corpus (define boundaries and throw away rest of discovered URLs)
- manually (accept/reject)
- reverse URL scheme should also be used to define blacklist
- search / time
- difference in set-up / explore?
- E.g. google.com/images/pageA.html
- Blocks that constitute URLs are called stems
- E.g. {com, com.google, com.google.images}
- Web entities are typically nodes in graph of your corpus
- Graph of stemmed URLs (Reverse URLs, arranged from least to most specific part):
- d:com.d:google.p:images.p:pageA.html
- Advantages:
- easy to redefine Web entity without having to recalculate everything
- forces social scientists to think about *websites
- Granularity: how far to go in URL?
- Web entities are relevant only for visualization and possibly selection (e.g. co-link)
for corpora (local or online?) - ... - ARC - WWW
- Integrating different corpora in a single repository
- Presentation / mapping
- Harvesting by OAI-PMH protocol
- Search by SRU protocol