### What is Text Analysis (aka Text Mining)?
> *The use of computational methods and techniques to extract knowledge from text.*

- Often discovery of previously unknown structure 
- Often very large amounts of text
- Often unstructured text
- Automatic or semi-automatic IR process
- Often some form of birds eye view of the text

### Why is it relevant?
- Rapid increase of available huge quantities of unstructured text
- Digitized newspapers & cultural heritage
- Online open data, social media, blogs
- Computational methods needed due to size of text
- Business cases (google etc)

### Some applications / domain
- Document classification / clustering
- Document search (e.g. Google search)
- Text similarity (plagiarism, “viral text”)
- Find text sentiments (twitter, forum posts)
- Find actors and entities in text
- Word trends

### Challenges
**What’s easy for humans can be extremely hard for computers**
- Ambiguity and fuzziness of terms and phrases
- Poor data quality
- Context, metadata, domain-specific data
- Data size (to much, to little)
- Missing data

**Human-in-the-loop (aka supervised learning) can be very expensive**


### Representation

The text must be represented in a form that enables the use of computational methods
e.g. bag-of-words (BOW)
(insert illustration)

A simple representation that assumes that word sequence is irrelevant

All terms are assigned a unique number { t1, t2, …, tn }
A document is then just a set of word counts { c1, c2, …, cn }
The word count of word t2 is c2
Is a simple vector representation of documents
And vector computations are what a computer can do better than humans

### Table of Contents

**DESCRIBE: Sample basic text data metrics (feature extraction)**

- Number of words
- Number of characters
- Average word length
- Number of stopwords
- Number of special characters
- Number of numerics
- Number of uppercase words

**PREPARE: Sample basic text data preperations**
- Lower casing
- Punctuation removal
- Stopwords removal
- Frequent words removal
- Rare words removal
- Spelling correction
- Tokenization
- Stemming
- Lemmatization

**DESCRIBE / MODEL: Sample more advanced text modelling**

- N-grams
- Term Frequency
- Inverse Document Frequency
- Term co-occurance
- Term Frequency-Inverse Document Frequency (TF-IDF)
- Pointwise mutual information
- Bag of Words
- Sentiment Analysis
- Vector space Models (VSM)
- Word Embedding

### Text Analysis Sample Flow

The figure below gives a view of a generic text analysis workflow, together with a few sample tasks for each step in the flow. Note that the tasks for a specific project depends on for instance the state and quality of  the text at hand, the specific research question, the kind of text etc. (Note that this is not the researcher's workflow.)

<img src="./images/text-analysis_sample_tasks.svg" style="width: 75%;padding: 0; margin: 0;">

<center><i>**Fig**. Sample text analysis tasks</i></center>

This notebook focuses mostly on parts of the "Evaluate & Interpret" step, but also visualisation that can be used in the "Narrate & Dissiminate" step.
Assessing the quality of a topic model is a qualitative process that requires the "human-in-the-loop". The system can assist the researcher in a number of ways, with features such as:

* Easy way of browsing through topic-word distributions
* Easy way of browsing through document-topic distributions
* Intuitive ways of finding conceptual interpretations of topics

* Use of metrics to highlight suspect data

 * Display similarity of topics to known distributions (uniform distribution, mean corpus distribution etc)
 * Display similar or overlapping topics,  topic clusters (for some metric)
 * Display how ubiquitousness of topics
 * Display document clusters
 * Display topic-topic co-occurrence (same document)
 * Use reference documents that should have some expected topic?

These notebook contains sample implementations of some of these features. See also:
> - Reading tea leaves: how humans interpret topic models, Chang et al. (2009) <br></br>
> - Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality, Lau et al. <br></br>
> - http://dirichlet.net/pdf/wallach09evaluation.pdf

### A sample high-level workflow

Öppna för manuella steg.
Feedback-loop - förfining. Kolla lokal med Södettörn. 

<img src="./images/Södertörn_workshop_workflow.svg" alt="" width="1200"/>


0. Intro Humlab och mig

1. Introduction
   Vad är textanalys?
   Vad kan man göra med textanalys?
   
   Varför är det populärt idag?
   "Big-data", "digitalisering", social media analys, "internet"
   Vad är alternativen? När finns inga alternativ? 
   
   Exempel: (Ben, Johan)

2. Key concepts

3. Områden och metoder
   Sökning, klassificering, sentiment, IR, ..., nätverksanalys, statistik, geolocationing, ...
   
4. Basal Corpus Statistics

Nyckeltal Corpus

Document, ordfrekvenser Frekvenser, frekvenser över tid, 
collaction, samförkomst
   
Document view Document | Words | Unique words (Types) | Lemmas | Unique words (Types) / Words | Words / Sentences | Mean (Types/Tokens) per 1000 tokens chunks

Textkomplexitet?

Text analytics involves a set of techniques and approaches towards bringing textual content to a point where it is represented as data and then mined for insights/trends/patterns.

5. "Data science"



### Distributional hypothesis

>"You shall know a word by the company it keeps" (Firth, J. R. 1957:11)


## Glossary


|Term|Description|
|----|-----|
Corpus||
Document
Token, term, word
Delimiter
Sentence
Paragraph
Phrase
Term distribution (in corpus)
Term frequency
n-gram
BOW, CBOW | Bag-of-words (BOW)
Dictionary
Stop words
Topic, Topic modelling
Keyword extraction
Keyword in context (KWIC)
Co-occurrance|when two words occur in the same context (corpus, paragraph, sentence, window). Dependen on kind of context, windows size etc.
Collocation|When a is adjacent to another word i.e. the two words are next to each other (a subset of co-occuring words)
Word embeddings

Visualization:

Wordcloud - Word frequency (weight)
Networks - Collocation, Cooccurance

Table of document term frequencies
Table of corpus term frequencies

Correlations

Document view 
Document | Words | Unique words (Types) | Lemmas | Unique words (Types) / Words | Words / Sentences | Mean (Types/Tokens) per 1000 tokens chunks
