Skip to content

Commit

Permalink
use custom exc
Browse files Browse the repository at this point in the history
  • Loading branch information
interrogator committed Aug 24, 2019
1 parent 9a826c9 commit a0f1e94
Show file tree
Hide file tree
Showing 5 changed files with 61 additions and 37 deletions.
14 changes: 12 additions & 2 deletions buzzword/parts/explore.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import pandas as pd

from buzz.dashview import _df_to_figure
from buzz.exceptions import DataTypeError
from buzzword.parts.helpers import (
_get_specs_and_corpus,
_translate_relative,
Expand Down Expand Up @@ -251,10 +252,19 @@ def _new_search(
# after which, either we return previous, or return none:
df = df.iloc[:0, :0]
else:
search = _cast_query(search_string, col)
method = "just" if not skip else "skip"
df = getattr(getattr(corpus, method), col)(search_string.strip())
try:
df = getattr(getattr(corpus, method), col)(search)
# todo: tell the user the problem?
except DataTypeError:
df = df.iloc[:0, :0]
msg = "Query type problem. Query is {}, col {} is {}."
if isinstance(search, list):
search = search[0]
msg = msg.format(type(search), col, corpus[col].dtype)
# if there are no results
if not len(df):
if not len(df) and not msg:
found_results = False
msg = "No results found, sorry."

Expand Down
18 changes: 18 additions & 0 deletions buzzword/parts/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -201,3 +201,21 @@ def _get_initial_table(slug):
return INITIAL_TABLES[slug]
corpus = _get_corpus(slug)
return corpus.table(show="p", subcorpora="file")

def _cast_query(query, col):
"""
ALlow different query types (e.g. numerical, list, str)
"""
query = query.strip()
if col in {'t', 'd'}:
return query
if query.startswith("[") and query.endswith(']'):
if ',' in query:
query = ','.split(query[1:-1])
return [i.strip() for i in query]
if query.isdigit():
return int(query)
try:
return float(query)
except Exception:
return query
62 changes: 29 additions & 33 deletions docs/guide.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,42 @@
# buzzword
> *buzzword* is a web app for corpus linguistics, built with [buzz](https://github.com/interrogator/buzz) as a backend. *buzzword* allows you to parse, search and visualise language datasets in your browser. The tool is soon to be hosted by UZH, but can also be run locally (see [here](run.md) for instructions).
> *buzzword* is a web app for corpus linguistics, built with [buzz](https://github.com/interrogator/buzz) as a backend. *buzzword* allows you to parse, search and visualise corpora in your browser. The tool is soon to be hosted by UZH, but can also be run locally (see below for instructions).
*buzzword* consists of two main pages: a *Start* page, where you select or upload a corpus, or get help; and an *Explore* page, which provides an interface for working with a parsed/annotated corpus. The *Explore* page is divided into tabs for different tasks, such as searching, building frequency tables, visualising and concordancing. Below are instructions for each page and tab.

*buzzword* consists of two main pages: a *Start* page, where you select or upload a corpus, or get help; and an *Explore* page, which provides an interface for working with a parsed corpus. The *Explore* page is divided into tabs for different linguistic tasks. Below are instructions for each page and tab.

## Start page
# Start page

The main page of the tool allows you to select a pre-loaded corpus, or upload your own. If you want to browse a pre-loaded corpus, simply click its name in the table.

If you want to upload your own data, you can upload plain text files (which will be parsed), plain text with metadata (which will also be parsed), or CONLL-U formatted files, which will be loaded directly into the tool without any processing.

For information about how to create metadata-rich corpora, see [here](/building). Once you've done that, you can simply upload the one or more created files via the drag-and-drop interface. Then, simply give the corpus a name, select the language, and hit *Upload and parse*. After parsing, a link to the corpus will become available.
For information about how to create metadata-rich corpora, see [here](building.md). Once you've done that, you can simply upload one or more created files via the drag-and-drop interface. Then, simply give the corpus a name, select the language, and hit *Upload and parse*. Parsing shouldn't take long, even for large and metadata-rich corpora. After parsing, a link to the corpus will become available in the table.

## Explore page
# Explore page

Once you have either uploaded or selected a pre-loaded corpus, you will be taken to the *Explore* page. This page provides an interface for searching and visualising corpus data.
Once you have either uploaded or selected a pre-loaded corpus, you will be taken to the *Explore* page. This page provides an interface for searching and visualising a dataset.

### Dataset view
## Dataset view

In this view, you can see the loaded corpus, row by row, with its token and metadata features. Like the other tables in *buzzword*, the table is interactive: you can filter, sort and remove rows and columns as you wish. To filter, just enter some text into one of the blank cells above the first table row. More advanced kinds of filtering strings are described [here](https://dash.plot.ly/datatable/filtering).
In this view, you can see the loaded corpus, row by row, with its token and metadata features. Like the other tables in *buzzword*, this one is interactive: you can filter, sort and remove rows and columns as you wish. To filter, just enter some text into one of the blank cells above the first table row. More advanced kinds of filtering strings are described [here](https://dash.plot.ly/datatable/filtering). Note that filters are just temporary views of the corpus, and are not applied permanently to your data; use the *Search* functionality to create subsets of data that can be searched further, concordanced or plotted.

#### Searching
### Searching

The dataset tab is also the place where you search the corpus. In *buzzword*, because you can search within search results, the best way to find what you're looking for is usually to "drill-down" into the data by searching multiple times, rather than writing one big, complicated query.
The *Dataset tab* is the place where you search your dataset. In *buzzword*, because you can search within search results, the best way to find what you're looking for is usually to "drill-down" into the data by searching multiple times, rather than writing one big, complicated query.

**Feature to search**: in the leftmost dropdown field, you select the feature you want to search (word form, lemma form, POS, etc). Each of these options targets a token or metadata feature, except *Dependencies*, which is used to search the dependency grammar with which a corpus has been parsed. To start out, select something simple, like 'Word', so that your search string will be compared to the word as it was writtin in the original, unparsed text.
**Feature to search**: in the leftmost dropdown field, you select the feature you want to search (word form, lemma form, POS, etc). Each of these options targets a token or metadata feature, except *Dependencies*, which is used to search the dependency grammar with which a corpus has been parsed. To start out, select something simple, like *Word*, so that your search string will be compared to the word as it was written in the original, unparsed text.

**Query entry**: in the text entry field, you neeed to provide a case-sensitive regular expression that you want to match. The only exception to this is if you are searching *Dependencies*, in which case you will need to use [the depgrep query language](depgrep.md). If you're new to regular expressions, and just want to find words that exactly match a string, enter `^word$`. The caret (`^`) denotes 'start of string', and the dollar sign (`$`) denotes the end of string. Without the dollar sign, the query would match not only *word*, but *wordy*, *wording*, *word-salad*, and so on.
**Query entry**: in the text entry field, you provide a case-sensitive regular expression that you want to match. The only exception to this is if you are searching *Dependencies*, in which case you will need to use [the depgrep query language](depgrep.md). If you're new to regular expressions, and just want to find words that exactly match a string, enter `^word$`. The caret (`^`) denotes 'start of string', and the dollar sign (`$`) denotes the end of string. Without the dollar sign, the query would match not only *word*, but *wordy*, *wording*, *word-salad*, and so on. Once you get used to using regular expressions, you'll find them very powerful, so learning to write them is a valuable skill! If you want your search to be case-insensitive, the best way to do this is to begin your query with `(?i)`, which denotes case insensitivity in the language of regular expressions.

**Inverting searches**: finally, you can toggle result inversion using the toggle switch (i.e. return rows *not matching* the search criteria). This is especially useful if you want to remove particular unwanted results. For example, if you want to match any pronoun but *me*, rather than writing a regular expression to match every other pronoun (*I*, *you*, *he*, etc.), simple search for any token whose part of speech is *PRP*, and then run another search for any word matching *me*, inverting the result.

Once your query is formulated, hit *Search*. Search time depends mostly on how many items are returned, though very complex *depgrep* queries can also take a few seconds.

#### Search results
### Working with search results

When the search has completed, the table in the *Dataset* tab will be reduced to just those rows that match your query. At the top of the tool, you'll see that your search has entered into the "Search from" space, translated into something resembling natural English. Beside the search is a bracketed expression telling you how many results you have. This component has the format:
When the search has completed, the table in the *Dataset* tab will be reduced to just those rows that match your query. At the top of the tool, you'll see that your search has entered into the *Search from* space, translated into something resembling natural English. Beside the search is a bracketed expression telling you how many results you have. This component has the format:

`(absolute-frequency/percentage-of-parent-search/percentage-of-corpus)`

Whatever is selected in the *Search from* dropdown is the dataset to be searched. Therefore, if you search again with the current search selected, you will "drill down", removing more rows as you go. So, if you are interested in nouns ending in *ing* that do not fill the role of *nsubj*, you can search three times:
Whatever is selected in the *Search from* dropdown is now the dataset to be searched. Therefore, if you search again with the current search selected, you will "drill down", removing more rows as you go. So, if you are interested in nouns ending in *ing* that do not fill the role of *nsubj*, you can search three times:

1. Wordclass matching `NOUN`
2. Word matching `ing$`
Expand All @@ -52,51 +50,49 @@ X"NOUN" = w/ing$/ != F"nsubj"

If you want to learn to use the *depgrep* language, you can view the [*depgrep* page](depgrep.md) for a guide to query construction.

At any time, if you want to delete your search history, you can use the 'Clear history' button to forget all previous searches.

### Frequencies view
At any time, if you want to delete your search history, you can use the 'Clear history' button to forget all previous searches, returing the tool to its original state.

In contrast to other tools, *buzzword* separates the processes of searching datasets and displaying their results. This gives you the flexibility to display the same search result in different ways, without the need for performing duplicated searches.
## Frequencies view

So, once you are happy with a search result generated in the *Dataset* tab, you can move to the *Frequencies* tab to transform your results into a frequency table.
In contrast to other tools, *buzzword* separates the processes of searching datasets and displaying their results. This gives you the flexibility to display the same search result in different ways, without the need for performing duplicated searches. So, once you are happy with a search result generated in the *Dataset* tab, you can move to the *Frequencies* tab to transform your results into a frequency table.

#### Tabling options
### Tabling options

To generate a frequency table, start by searching the correct search from the *Search from* dropdown menu at the top of the tool. These are the results you will be calculating from. You can select either the entire corpus, or a search result generated in the *Dataset* tab.
To generate a frequency table, start by selecting the correct dataset from the *Search from* dropdown menu at the top of the tool. These are the results you will be calculating from. You can select either the entire corpus, or a search result generated in the *Dataset* tab.

Next, you need to choose which feature or features you want to use as the column(s) of the table. Multiple selections are possible; for example, you can select `Word` and then `Wordclass` to get results in the form: `happy/ADJ`.
Next, you need to choose which feature or features you want to use as the column(s) of the table. Multiple selections are possible; for example, you can select `Word` and then `Wordclass` to get results in the form: `happy/adj`.

After selecting columns, select which feature should serve as the index of the table. You don't have to use `Filename` here: you can model things like *lemma form by speaker* or *Functions by POS*. Multiple selections are not possible here.
After selecting columns, select which feature should serve as the index of the table. You don't have to use `Filename` here: you can model things like *lemma by speaker* or *Function by POS*. Multiple selections are not possible here (though they are in the *buzz* backend, if you would prefer to use that).

Next, choose how you would like to sort your data. By default, the most frequent items appear first. But, you can sort by infrequent (i.e. reverse total), alphabetically, and so on. Particularly special here are *increase* and *decrease*. These options use linear regression to figure out which items become comparatively more common from the first to last items on the x-axis. They are particularly useful when your choice for table index is sequential, such as chapters of a book, scenes in a film, or dates.

Finally, choose what kind of calculation you would like. The default, *Absolute frequency*, simply counts occurrences. You can also choose to model relative frequencies, using either this search or the entire corpus as the denominator. What this means is that if you have searched for all nouns, you can calculate the relative frequency of each noun relative to all nouns (*Relative of result*), or each noun relative to all tokens in the corpus (*Relative of corpus*).

Alternatively, you can also choose to calculate keyness, rather than absolute/relative frequency. Both [log-likelihood](http://ucrel.lancs.ac.uk/llwizard.html) and percentage-difference measures are supported. These measures are particularly useful as means to find out which tokens are unexpectedly frequent or infreuent in the corpus.
Alternatively, you can also choose to calculate keyness, rather than absolute/relative frequency. Both [log-likelihood](http://ucrel.lancs.ac.uk/llwizard.html) and percentage-difference measures are supported. These measures are particularly useful as means to find out which tokens are unexpectedly frequent or infrequent in the corpus.

#### Generating tables: a note on performance
### Generating tables: a note on performance

Once you've chosen your parameters, hit *Generate table* to start the tabling process. Like searching, the calculations themselves are fairly fast. What might take some time, however, is displaying a result with many rows/columns. So, it is important to bear in mind when tabling that some choices can lead to unmanageably large results, which could potentially cause memory-related errors: if you choose to display *Word*, *Filename*, *Sent #* and *Token #* as columns, for example, you will be generatng a column for every single token in the corpus. Before generating a table, it's therefore wise to quickly consider the expected size of the output.

#### Editing the *Frequencies* table
### Editing the *Frequencies* table

Whenever you edit the frequencies table (i.e. deleting rows or columns), the changes will be reflected in the underlying data. So, charts generated from this table will not contain the deleted rows and columns. Note, however, that deleting rows and columns will not lead to recalculation of relative frequencies. If you want to recalculate relative frequencies, you should instead run a new sub-search from the *Dataset* view, skipping the rows and/or columns you don't want.

#### Downloading the table
### Downloading the table

You can click the *Download* link to download the entire frequency table in CSV format. You can then load this file into software like Excel in order to explore it further,
You can click the *Download* link near the bottom of the page to download the entire frequency table in CSV format. You can then load this file into software like Excel in order to explore it further.

### Chart view

Once you have a table representing the results you're interested in, you can go to the Chart tab to visualise the result. Visualisation is easy: just select the plot type, the number of items to display, and whether or not you want to transpose the rows and columns.
Once you have a table representing the results you're interested in, you can go to the Chart tab to visualise the result. Visualisation is easy: just select the plot type, the number of items to display, and whether or not you want to transpose (i.e. invert) the rows and columns.

All figures are interactive, so go ahead and play with the various tools that appear above the charts. Using these tools, you can also choose to save the figure to your hard drive for access outside of *buzzword*,

Note that there are multiple chart spaces. Just click '*Chart #1*' to fold this space in and out of view. Each space is completely independent from the others. You can therefore visualise completely different things in each.

### Concordance view

Concordances, or *Keyword-in-context* (KWIC) views, involve aligning each search result in a column, displaying their co-text on either side. In *buzzword*, concordances are another way of viewing a search result. Therefore, rather than directly running a concordance query like you do in most other tools, you first run a search in the *Dataset* view, and then use the *Concordance* tab to display this result as a concordance.
Concordances, or *Keywords-in-context* (KWIC), involve aligning each search result in a column, displaying their co-text on either side. In *buzzword*, concordances are another way of viewing a search result. Therefore, rather than directly running a concordance query like you do in most other tools, you first run a search in the *Dataset* view, and then use the *Concordance* tab to display this result as a concordance.

To display a concordance, first, select the search result you want to visualise from the *Search from* dropdown menu at the top of the page. Then, select how you would like to format your matches. Like when making frequency tables, you can format matches using multiple token features. One popular combination is *word/POS*, but there are plenty more combinations that might be useful, too!

Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
buzz>=3.0.6
buzz>=3.0.8
flask==1.1.1
dash==1.1.1
dash-core-components==1.1.1
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def read(fname):
license="MIT",
keywords=[],
install_requires=[
"buzz>=3.0.6",
"buzz>=3.0.8",
"python-dotenv==0.10.3",
"flask==1.1.1",
"dash==1.1.1",
Expand Down

0 comments on commit a0f1e94

Please sign in to comment.