GitHub - JavaScriptorian/textelixir

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
textelixir		textelixir
LICENSE		LICENSE
README		README
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Repository files navigation

# TextElixir

TextElixir is a Python module that facilitates the data collection and analysis of textual corpora. While some programs like AntConc and WordCruncher are faster in speed, TextElixir provides the flexibility to gather data on limitless queries without needing to navigate through a graphical interface. 


## Installation

Install textelixir with pip

```bash
pip install textelixir
```


    
## Importing TextElixir

```bash
from textelixir import TextElixir
```

## Tagging a Corpus

```http
  elixir = TextElixir('my_corpus.txt', lang='en', tagger_option='spacy:efficient:pos')
```

| Parameter | Type     | Description                |
| :-------- | :------- | :------------------------- |
| `filename` | `string` or `glob` | **Required**. Accepts a path to a filename, which can be a TXT file, a TSV file, or a glob of multiple TXT files. |
| `lang` | `string` | **Optional**. Accepts a language code. Defaults to `en` for English. For available languages, see [SpaCy](https://spacy.io/models) or [Stanza](https://stanfordnlp.github.io/stanza/available_models.html) for available tagging models. |
| `tagger_option` | `string` | **Required**. Accepts one of these for options for tagging:<br>`spacy:efficient:pos`: Uses a fast SpaCy tagger and uses tags like VERB, NOUN<br>`spacy:accurate:xpos`: Uses a slow SpaCy tagger and uses tags like VBZ, NN1<br>`stanza:pos`: Uses a Stanza tagger and uses tags like VERB, NOUN<br>`stanza:xpos`: Uses a Stanza tagger and uses tags like VBZ, NN1|

TextElixir will tag a TXT file, a glob of TXT files, or a TSV file. By running this line of code, it'll create a subfolder within the directory you provide and add all of the POS tags and lemmas to each word. By saving the tagged corpus into that subfolder, this line only needs to be run once.

If you provide a TSV file, make sure that each column has headers and that the column with the text you want tagged has the column header of `text`. By doing this, you can use the other columns for search filtering and reference grouping.
 
If you provide a glob of TXT files, the filename will still be associated to each word within the corpus, which helps with basic search filtering if the text filename has a semantic meaning.
 
TextElixir has a couple options for different taggers to use. At the moment, you can have TextElixir tag your text with SpaCy or Stanza taggers. The modules for spacy and stanza are included in the installation of TextElixir, but the language models for each must be downloaded prior to tagging a text.



## Loading the Tagged Corpus

```http
  elixir = TextElixir('my_corpus', verbose=False, punct_pos=['PUNCT', 'SYM', 'SPACE'])
```

| Parameter | Type     | Description                |
| :-------- | :------- | :------------------------- |
| `filename` | `string` | **Required**. Accepts the name of the subdirectory created when tagging a corpus. |
| `verbose` | `boolean` | **Optional**. If False, it turns off any messages printed out to the terminal. Defaults to True. |
| `punct_pos` | `list` | **Optional**. Accepts a list of strings of the parts of speech that you want to consider `subwords`. These words will be skipped when doing searches and when calculating collocates, ngrams, etc. To search for punctuation, you'll need to set this argument to an empty list.|

When loading a tagged corpus, the `lang` and `tagger_option` arguments are no longer needed.

# Search
Performing a search is one of the main methods that you can use on a **TextElixir**. The search method returns a **SearchResults** class, which contains the frequency of the search query and a list of indices for where your search query occurs in the corpus. The SearchResults class contains several other methods for calculating collocates, frequency distribution charts, sentences, and concordance lines.
## Searching for a Word

```bash
results = elixir.search('engage')
```
## Searching for a Lemma

```bash
results = elixir.search('ENGAGE')
```


## Searching for a Part of Speech

```bash
results = elixir.search('/VERB/')
```


## Searching for a Word with its Part of Speech

```bash
results = elixir.search('leaves_NOUN')
```


## Searching for a Lemma with its Part of Speech

```bash
results = elixir.search('ADVOCATE_NOUN')
```
## Searching for a Phrase

```bash
results = elixir.search('/ADJ/ cat')
```
## Searching with Wildcards

```bash
# Find inform, informs, information, etc.
results = elixir.search('inform*') 
# Find bat, cat, etc.
results = elixir.search('?at') 
```
## Searching with Regular Expressions

```bash
# Find 4-digit numbers
results = elixir.search(r'\d{4}', regex=True)
```
## Search for Words Separated by Distance

```bash
# Finds the word 'supporter' 1-5 words away from the lemma 'cat'
results = elixir.search('supporter ~5~ CAT')
```
# Filter the Corpus Prior to Search
Filters can be applied to a corpus prior to searching. This allows you to get data on specific sections of your corpus rather than the entire corpus.

## Positive Filter
```
# Searches for the word 'advocate' as long as it's within the category of 'Philosophy'
results = elixir.search('advocate', text_filter={'category': 'Philosophy'})
```
## Negative Filter
```
# Searches for the word 'advocate' as long as it's NOT within the category of 'Philosophy'
results = elixir.search('advocate', text_filter={'category': '!Philosophy'})
```
# Calculate KWIC/Concordance Lines

```
kwic = results.kwic_lines(before=8, after=8, group_by='lower')
```


## Export KWIC Lines to HTML
This will generate a webpage with an interface to sort and filter KWIC lines. You can also switch the display of columns to show pos, lemma, and lower text.

```
results.export_as_html('my_kwic_lines.html')
```


## Export KWIC Lines to TXT

This will generate a text file with just the KWIC lines. There is a tab character before and after the search hit, making it easy to paste into Google Sheets.

```
results.export_as_txt('my_kwic_lines.txt')
```
# Calculate Sentences

Get the full sentence for more context of your search query.

```
sentences = results.sentences()
```