This repository has been archived by the owner. It is now read-only.
Various tools for working with text and corpora
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

#Text-Tools Overview Stopped using GitHub 2018-04-19 - not worth the hassle for the number of people using the service. Just email me for code and help. Below is a description of what USED to be here.

Various tools for working with text and corpora.

  • Some projects have been described in articles. PDFs available here
  • My normal website here

One option for learning vocabulary is for students to self-select words they do not know from a frequency list. However, monitoring student progress on this can be difficult.

This tool:

  • downloads vocab that students have submitted via Google Drive Forms and cleans it
  • downloads quicktypes students have submitted
  • generates a Learner Analytics-style dashboard letting student see visually if they are having problems. (Based on the Purdue Signals project - i.e. red traffic light = falling well behind, green = OK)
  • generates personalised vocabulary tests (parallel sentence cloze + definition matching, with a number of options for answer format)
  • exports the above as web pages for sharing back to students using a website or Dropbox, etc.
  • detects cheating ( or accidental repeated submission)
  • produces class-by-class overviews for the teacher *detect issues in level of word chosen by adding modules from (i.e. are students working on first 1-2000 words, or are they choosing random useless words) The idea is to catch issues when they happen - issues that weekly checks or end-of-term checks by a teacher would miss due to the quantity of data (>280,000 tokens/semester, currently).

Work in progress to

  • add more visualisation
  • add more quiz types, and more options for source data
  • maybe add adaptive learning suggestions (i.e. "If you learned these words this week, here are some words at the next level you might learn")
  • integrate the vocab and QW data.

Article describing this tool [here ] for the quickwrite chunks. Was a poster at BAAL 2015 (Aston Univ., September 3-5). Article in progress for the Vocab sections, hopefully attached to a EuroSLA presentation.

Similar in function to AntWordProfiler or the LexTutor VocabProfile , but aimed more at editing. Specifically, editing short texts down to a specific vocab level (e.g. if when no words outside a list are allowed in an entrance exam or reader)

The tool will:

  • give basic information on the vocab in the text (tokens, types, families) by level (BNC-COCA, BNC, AWL+GSL)
  • highlight the vocabulary level of each word by colour
  • list the words which are from each list
  • list words MISSING from each list (e.g. if seeding a graded reader)

Development stopped. Requires Python versions 2.7 and above. Constantly breaking one thing when I fix another, so let me know if you spot problems.

Article describing tool here Write a quiz using Google Forms, this tool will download the data, mark the quiz, and give Item Facility, Item Discrimination and distractor efficiency analysis. Tests for criterion referenced on the way. Article in press for JSCE. Initial attempt to assist in identification of formulaic sentence starters in a corpus of learner essays, as existing tools tried were token-focused and sentence-blind. Pulls, combines, and sorts 3, 4 and 5-gram sequences from the beginning of each sentence in a folder of text files. For example, the top five three-token (including punctuation) sentence starters used in 90 Japanese undergraduate university essays were found to be: "For example , / In addition , / However , I / Of course , / When I was". Very rough draft - distribution not accounted for, so easily misled by repetition or fixed essay topics. Currently requires NLTK for sentence tokenization. No automatic cleaning used, so without hand cleaning many of top results are not really formulaic phrases. With tweaks, could be used to measure uptake of taught items or identify lacks.

Article describing this tool here - used to compare the Uppsala Student English corpus with a small group of competition essays. and

Python script for pulling usable word lists (single tokens, bigrams, trigrams, 4-grams and 5-grams) from Google Books nGram datasets ( The English section is about 300 billion tokens - 20.6 billion "the"s alone, but the files are not in a format that can be opened by normal software, or processed by standard corpus software. The files are 1GB each (up to 800 of them), with words dating back to 1520, split across files by capitalisation. This tool generates a much more easily analysed summary list ( e.g. "words before 1994 occurring 10000+ times"). Not really for teaching use - OCR errors abound and teachers should probably stick to COCA/BNC unless they have a desperate need for verbs from the 17th century. Data format changes in 2012, hence the new script. New format is much easier to process, and has POS. Project has stopped development, but email me if it's broken.

Article describing this tool here Simple alternative to Zotero, Mendeley, and other reference management software: If you have article PDFs and note TXT files with same name, program will traverse folders and find all hashtags in TXT files (e.g. #memory or #socialpsych), creating an html file tagcloud and index with links to the notes/PDFs. Checks every file one time for every tag, so best to break up projects once they get above 30 tags or so. Created for self because scared of vendor lock-in (i.e. getting notes and files INTO Zotero is easy, but getting them OUT again should development stop is difficult). Use related tools in "Single Use" folder to clean filenames, produce blank TXT note files from PDF filenames, or a reference list in TXT format.

No article describing tool.

#SyllablesNaiveBayes In answer to a request to create take as input a vocab list and output the same list with syllable count. Looks up two dictionaries (*)CMU, Moby), and falls back on a Naive Bayes classifier for words not in dictionary. Classifier classifies best with one feature, so is useless. However, most words in normal lists covered by dictionaries anyway.

No article describing tool. Implementation of markdown and a templating system. Takes a text file and turns it into a website. Main addition to markdown is quizzes, added to help flip classroom slightly (i.e. get students to learn at home so more practice time in class). Using syntaz "questionQQQcorrectanswernumberQQQanswer1QQQanswer2QQQanswer3QQQanswer4", creates self marking javascript quizzes inside web page. Also, tables. Internal templates option, so site and quizzes works on Dropbox without the need for an actual website.

No article descirbing tool, but my website uses it: Examples of pages and quizzes available here

Imports a vertical word list, output the same list with a count of syllables in each word. Created in response to a request on the JALT Vocab SIG Facebook page. No idea what it is needed for, but I'll be using it to free my Readabillity tool from NLTK requirements.

No article describing tool. Impress is an alternative to Powerpoint, but it can be fiddly. This takes a text file - each line is a slide. The line can be the name of an image file, or it can be Markdown text.

No article describing tool, but look at Prentice, M. (2015, December 20). Easy test analysis with Google Forms for an example.

#General Tools These tools are imported into other tools:

  • GoogleTools.pyu - collection of modules for downloading files from drive, parsing share codes, handling timestamps from Google Drive Forms, etc.
  • - handles mailing things using a (throwaway) gmail address
  • - currently broken, pulls text from PDFs
  • - everything else. Unicode, punctuation, bigram collection. #Dead projects
  • - The old collocation and reability tools from TextGrader, moved out because they require NLTK and were making it hard to package the app for distribution. Ceased development because until recently I had no syllable counter.
  • - tried to make my own calendar app (displays in GeekTool, emails a reminder in the morning). Works OK but too fiddly. Dead.

#Single use tools Just some scraps I've needed in the past to pre- and post- process data.

  • convertwordnettodictionary - pulls definitions and POS tags only from WordNet for use in Vocab test generation
  • - cleans a folder of files from bad characters in their filenames (e.g. for Dropbox syncing)
  • - counts the lines in file, without counting blank lines. For when you can't open a file because too big
  • - tool to match parallel TED subtitles. Failed - code OK, but subtitles matched by time not meaning (i.e. verb in Japanese is at end, so appears at different time code).
  • NLTK_Overlap - Two scripts for basically running DIFF of two texts - one files, one cut and paste
  • - Turns a folder of Paul Nation's word lists for RANGE (organised by tab with numbers) into lines, with headword at beginning, so easier to process for families. Used to prepare lists for
  • - takes a folder of PDFs and produces one text file for each, ready for notes, with optional default tags. For TextFileHashTagger
  • TextFileHashTagTXTFILEConverter - given a text file with one reference per line, produces one text file for each, ready for notes, with optional default tags. For TextFileHashTagger