Skip to content

learntextvis/code-samples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

code-samples repo and project notes

draft code to communicate ideas.

Please read the proposal doc for audience and intent: https://docs.google.com/document/d/1aLM0y56zCDUqUd6NRyiL0Nl9HPbKognBvK0aczcq3_4/edit?usp=sharing (this is a copy that can be commented on)

Would recommend playing with Lexos if you haven't.

Notes & ToDo List

Goal: single doc analysis/vis but also comparative multi-document in as many cases as possible.

"Analysis" to offer

Counting Things

  • characters (maybe - for punctuation in particular)
  • words / lemmas
  • sentences
  • POS
  • bigrams -- needed as part of the tokenization step

StopWord Handling!

  • We should always display the stop words used in the UI -- because they are a choice and carry consequences.
  • Start simple with an inline variable, then file, then tool in browser to update them?

Vis using the analysis/counts above

  • structure of page/doc? -- is this interesting? might be a good newbie view.

    • (does this make sense in multi-document cases?)
    • option to highlight items "inline"
    • Examples:
      • Ben Fry's Darwin thing
      • Art things on my pinterest
      • My example in html_js/book_shape.html - which doesn't work for line preceding spaces yet (e.g., for tabs, indents, poetry, etc)
  • count totals, count as percentage of whole

    • bars
    • word clouds and variants
    • tf-idf for multi-document cases
      • see examples like html_js/shiffman_tfidf.html, wordclouds_with_shiffman_tfidf.html
  • Word networks

  • Word Clouds: Let's investigate ways to make them less sucky/more tolerable.

    • Get rid of random coloring or words - color only as indicator
    • Idea: little bar charts beside wordclouds to show distribution of counts?
    • Idea: ordered words with counts (mocked up)
    • Extremely useful in small-multiple/multi-doc situations, but design issues.
      • Tf-idf sizing is most interesting in that case. See:
      • Show dynamic side-by-side and merged, with difference. Examples:
      • merged: maybe a network style? also, that NYT bubble thing with circles.
      • need good design for the combination operation for overlapping words when counts differ between the overlaps
    • Single doc cases can become small-multiple if we allow word-clouds of POS, chars, etc.
    • Ideally perform bigram analysis first!
    • stop words are iterative process with word cloud displays
  • Timeseries

    • show location of word in doc over time (concordance view
    • use windowsize (user-settable) and count, show over time
      • examples are wordclouds per chapter in a book, in order
      • who talks when in a debate / play
      • the Tarantino obscenity chart in 538. We should be able to make that.
      • Simple example in timeseries.html shows just words per chapter in order in Emma.
    • Vis types over time - bar, line, even word cloud??
  • Clustering docs

    • see hclust code - actually, this is pretty bad in python in that it requires a bunch of libs. R's is easier. (See my notebook code in python dir using Pattern. It's easier to use scikit-learn but harder to explain to newbies. However, outputting the data from Pattern is awkward. Maybe NLTK is the best approach, I think you can save out numpy arrays easily?)
    • R examples: https://eight2late.wordpress.com/2015/07/22/a-gentle-introduction-to-cluster-analysis-using-r/, mine in: R/tm_clustering_example.R
    • tree view for first version? see output from R networkD3 dendrogram in html_js/networkD3_hclust_output_from_R.html
    • Ideally the tree clustering allows collapsible nodes for large trees, and customizable labels on the edges.

How about a Structure Like This for the Site...

  1. Getting Setup:
    • Options for R, Python, Pure JS
    • (Explain cons of just in-browser JS)
  2. Shapes of texts
  3. Concordance-views (simple search) -- keywords in context
  4. Tokenization, Simple Counts, stop words [We explain fancy tokenization happens in the python/r scripts, simple space sep can be done in js.] Word networks come here - and bigrams/n-grams.
  5. Parts of speech, lemmatization - show how counts change, what POS gets you.
  6. Word clouds -- this basically sets us up for doing tf-idf because simple counts are bad for comparative documents, but tf-idf is better
    • includes a variety of word cloud types -- bubble/networks, regular, maybe my ordered count css version
  7. Time Series - breaking a text into sections, using multiple texts that have time ordering
    • Maybe simple sentiment via polarity word lists I added to the repo too.
  8. Clustering (if we get to it, or I can add it later, if you help with the cluster-output-to-tree structure for js)

Languages for Discussion

Discuss: Scripts for pre-processing (python/r) and/or in-page analysis with js.

###JS

  • RiTa for POS: http://www.ghostweather.com/files/image_replacement/; I haven't experimented with all of it's capabilities yet. Not sure I believe it can be as good as nltk/SpaCy and R's tm/NLP/SnowballC etc.
  • shiffman's tf-idf in js (used in some of my examples)
  • wordcount.js
  • that Natural node/js lib had issues last time I used it, but Shiffman was issueing merge requests against it...
  • I included some dude's textAnalysisSuite in the html_js/js dir, but I don't understand it's tfidf. The bigram thing looks interesting/useful.
  • I don't really think the js tools are as good yet or as complete; and for sizable projects that would be slow in browser, would we give node instructions? (Argh)

###Python

  • Note that python stuff generally needs downloads. SpaCy and NLTK both do.
  • I can make command-line scripts that prep data as json if we want that... My example script in preprocess_files.py was for another class, and i used it here to get POS for the word cloud experiments
  • Need to well-document and test the install&run instructions for someone newbie
  • Requires command line expertise
  • SpaCy might be a good lib to use, I have used it for word2vec related things but haven't checked carefully the contrast with NLTK/pattern. It won't do tf-idf for us.

###R

  • Is this simplest actually? scripts that can be run from RStudio or command line?
  • Still requires certain packages
  • My R is rusty but Jim's is good I bet

Discuss: Should we have some kind of consistent format like JSON for output from the scripts? Or csv, which is easier for newbies. Configuration settings could be in a simple JSON file separate from the data files.

ToDo: Compare the accuracy and quality of the results in the 3 languagues and multiple libraries in Python/R. There are a lot. I can do that.)

###Data Sets ToDo's

  • Need to act a play and/or script
  • Poetry examples
  • Maybe recipes? some very different genre...

About

draft code to communicate ideas

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published