### What is Text Analysis (aka Text Mining)?

> *The use of computational methods and techniques to extract knowledge from text.* [Wikipedia](https://en.wikipedia.org/wiki/Text_mining)

The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.

- Often discovery of previously unknown structure 
- Often very large amounts of text
- Often unstructured text
- Automatic or semi-automatic information retrieval (IR) process
- Often some form of birds eye view of the text

### Why is it relevant?
- Rapid increase of available huge quantities of unstructured text
- Digitized newspapers & cultural heritage
- Online open data, social media, blogs
- Computational methods needed due to size of text
- Business cases/oppertunities

### Some applications / domain
- Document classification / clustering
- Document search (e.g. Google search)
- Text similarity (plagiarism, “viral text”)
- Find text sentiments (twitter, forum posts)
- Author attribution / stylometrics
- Find actors and entities in text
- Word / genre trends

### Challenges
- **What’s easy for humans can be extremely hard for computers**
- **Human-in-the-loop or supervised learning can be very expensive**
- Ambiguity and fuzziness of terms and phrases
- Poor data quality, errors in data, wrong data, missing data, ambigeous data
- Context, metadata, domain-specific data
- Data size (to much, to little)
- Computational methods requires a structured internal representation
- Internal models are a simplified views of the data
- etc...

### A sample high-level workflow

<img src="./images/text_analysis_workflow.svg" alt="" width="1200"/>

**DESCRIBE: Sample basic text data metrics (feature extraction)**

- Number of words
- Number of characters
- Average word length
- Number of stopwords
- Number of special characters
- Number of numerics
- Number of uppercase words

**PREPARE: Sample basic text data preperations**
- Lower casing
- Punctuation removal
- Stopwords removal
- Frequent words removal
- Rare words removal
- Spelling correction
- Tokenization
- Stemming
- Lemmatization

**DESCRIBE / MODEL: Sample more advanced text modelling**

- N-grams
- Term Frequency
- Inverse Document Frequency
- Term co-occurance
- Term Frequency-Inverse Document Frequency (TF-IDF)
- Pointwise mutual information
- Bag of Words / Representation
- Sentiment Analysis
- Vector Space Models (VSM)
- Word Embedding

### Text Analysis Sample Flow

- A generic text analysis workflow
- Sample tasks for each step in the flow.
- Tasks depend on specific project
- Also depend on state and quality of the text at hand
- ...the specific research question
- ...the kind of text etc.
(This is not the researcher's workflow.)

> <img src="./images/text-analysis_sample_tasks.svg" style="width: 75%;padding: 0; margin: 0;">
> <center><i>**Fig**. Sample text analysis tasks</i></center>


# Some Concepts
- **Computers _only_ understand numbers - everything else is abstractions of numbers**

### Bag-of-Word

> - A simple **vector representation** of documents that discards words' order and lexical units
> - Each unique word is assigned a number (e.g. 1 to N where N is number of unique words)
> - A document is a N-sized vector giving the the number of times each word occurs in the document
> - Index n can be the count for word represented by number n
> - Most entries are 0 e.g. a sparse vector - memory efficient
> - **Very** efficient and relaible numerical methods can be used on this representation

### Distributional hypothesis

> "You shall know a word by the company it keeps" (Firth, J. R. 1957:11)

### NLP
> Natural language processing


|Term|Description|
|----|-----|
NLP| Natural language processing
IR|Information Retrieval (or extraction) - the use of computational methods to extract information from data|
Corpus|A collection of (related) documents or text|
Document|A text document (an article, web page, tweet, book) |
Token, term, word
Delimiter|A character och sequence of characters that seperates parts of the text (i.e. words, sentences, paragraphs, chapters).
Phrase|
Entity|A "thing",  a distinct item, (can be refered ty as proper noun)|
Term distribution (in corpus)
Term frequency|Frequency of word occurances in a document
n-gram
BOW, CBOW | A simplified (but yet powerful) representation of a document where the text has no order, structure or grammer. All the words are thrown into a "bag" and hence the contextual information is lost. A common representation is a word distribution vector.
Dictionary|
Stop words|Words (tokens) that are often discarded from text analysis in order to improve (quality) performance. Can e.g. be frequent words.
Sentiment analysis|A method to assign metrics to the sentiment (conveyed emotional meaning) to text (or words).
Parsing|The process of analysing the syntax of a text. Creates an internal representation (data structure) of the component parts/units such as words, delimiters, sentences and paragraphs.
Dependency parsing|
Topic, Topic modelling
Keyword extraction
Keyword in context (KWIC)
Co-occurrance|when two words occur in the same context (corpus, paragraph, sentence, window). Dependen on kind of context, windows size etc.
Collocation|When a is adjacent to another word i.e. the two words are next to each other (a subset of co-occuring words)
Word embeddings

Visualization:

Wordcloud - Word frequency (weight)
Networks - Collocation, Cooccurance

Table of document term frequencies
Table of corpus term frequencies

Correlations

Document view 
Document | Words | Unique words (Types) | Lemmas | Unique words (Types) / Words | Words / Sentences | Mean (Types/Tokens) per 1000 tokens chunks
