## Text Analysis Workshop

### Who am I?

- Software developer and architect (15+ year consultant in telecom and utility business)
- Humlab, Umeå university since 2013
- 15+ years consultant in telecom and utility business
- Not a statistician, not an expert in text analysis
- As a consultant, problems are tackled in a pragmatic, less theoretical, way
- Software development has well defined work flows, high demands on tracability, versioning and change management

### Workshop Web Page

> https://open-science.humlab.umu.se

```
Username: ws_ta_180925_01 ... ws_ta_180925_30
Password: humlab
```


### What is (Computational) Text Analysis (aka Text Mining)?

> *The use of computational methods and techniques to extract knowledge from text.* [Wikipedia](https://en.wikipedia.org/wiki/Text_mining)<br>
> The overarching goal is, essentially, to **turn text into data for analysis**, via application of natural language processing (NLP) and analytical methods.
> - Often discovery of previously **unknown structure** 
> - Often very **large amounts of text**
> - Often **unstructured text**
> - Automatic or semi-automatic **information retrieval (IR) process**
> - Often some form of **birds eye view** (distant reading) of the text

### The Data Science Buzzword 

**Data science** is defined in Wikipedia as

> "...an interdisciplinary field of scientific methods, processes, algorithms and systems **to extract knowledge or insights from data** in various forms, either structured or unstructured..."</br>

> <img src="./images/data-science_new.svg" style="width: 80%; padding: 0; margin: 0;">

- Computational texts analysis **is** data science

- Computers can _only_ computer numbers - everything else is abstractions of numbers


#### What is The Jupyter Project and Jupyter Notebook

> Project [Jupyter](http://jupyter.org/) develops open-source software for **interactive and reproducible computing**.<br>
> The **open science movement** is a driving force for Jupyter's popularity.<br>
> In part a response to the **reproducibility crisis in science** and the **statistical crisis in science** (aka data dredging, p-hacking) in science.<br>
> With Jupyter Notebooks contain **excutable code, equations, visualizations and narrative text**.<br>
> It is a **web application** (can run locally) with a simple and easy to use web interface.
> <img src="./images/narrative_new.svg" style="width: 300px;padding: 0; margin: 0;"><br>
> Jupyter supports a large number of programming languages (50+ e.g. Python, R, JavaScript)

The project is sponsered by large companies such as Google and Microsoft, and funders such as Alfred P. Sloan foundation. See link [jupyter.org/about](http://jupyter.org/about) for all sponsors.

#### Why use Jupyter Notebooks?

- **Easy to learn**, use and deploy, lots of people know **Python* and **R**.
- Ready to use **online platform** - trivial to create simple interactivity.
- Much faster (and cheaper) development: **immediate feedback**, **agile**, **collaborative**.
- Can **defer decisions** from developer to researcher. 
- The **ability to combine data, narrative and code** into an interactive user interface.
- Fits users with **different tech skills**, some researcher wants to understand and be able to tune the logic.
- Supports **Python**, that has big ecosystem of (open source) software libraries.
- Very popular, millons of notebooks exist on GitHub.

#### Brief Instructions on How to Use Notebooks
- **Menu Help -> User Interface Tour** gives an overview of the user interface.
- **Code cells** contains the script code and have **In [x]** in the left margin.
  - **In []** indicates that the code cell hasn't been executed yet.
  - **In [n]** indicates that the code has been executed(n is an integer).
  - **In [\*]** indicates that the code is executing, or waiting to be executed (i.e. other cells are executing).
- **The current code** is highlighted with a blue border - you make it current by clicking on it.
- **SHIFT+ENTER** or **Play button** executes the current cell. Code cells aren't executed automatically.
- **Out[n]** indicates the output (or result) of a cell's execution and is directly below the executed cell.
- **SHIFT+ENTER** automatically selects the next code cell.
- **SHIFT+ENTER** can hence be used repeatedly to executes the code cells in sequence.
- **Menu Cell -> Run All** executes the entire notebook in a single step (can take some time to finish, notice how "In [\*]" indicators change to "In [n]" ).
- **Double-Click** on a cell to edit its content.
- **ESC key** Leaves edit mode (or just click on any other cell).
- **Kernel -> Restart** restarts server side kernel (use if notebook seems stuck)


#### Methods and Technologies
<img src="./images/concept_tools_new.png" style="width: 800px; padding: 0; margin: 0;">

- Tremendous "**technological noise**"
- Most time is spent on learning tools, methods and frameworks (transient knwoledge).
- **Black-box use is unavoidable** - emphasizes the need of **a proper validation process**.

#### Risks and Challanges
- The risk of using tools and methods **without fully understanding** them
- The risk of using tools and methods **for non-intended purposes or in new contexts**
- How to verify **performance** (correctness of result)
- Risk of **data dredging**, p-hacking, "the statistical crisis".
- The risk that **engineer makes micro-decisions** the researcher don'r know about
- The risk of **reading to much into visualizations** (networks, layouts, clusters).

### Challenges
- **What’s easy for humans can be extremely hard for computers**
- **Human-in-the-loop or supervised learning can be very expensive**
- Ambiguity and fuzziness of terms and phrases
- Poor data quality, errors in data, wrong data, missing data, ambigeous data
- Context, metadata, domain-specific data
- Data size (to much, to little)
- Computational methods requires a structured internal representation
- Internal models are a simplified views of the data
- etc...

### A sample high-level workflow

<img src="./images/text_analysis_workflow.svg" alt="" width="1200"/>

### Text Analysis Sample Flow
> <img src="./images/text-analysis_sample_tasks.svg" style="width: 75%;padding: 0; margin: 0;">
