## Text Analysis Workshop

### Who am I (and what is Humlab)?

- Software developer and architect (computer science)
- Mostly worked in private companies developing business systems
- I'm not an statistician, nor am I an expert in text analysis
- As a consultant, problems are tackled in a pragmatic, less theoretical, way
- Software development has well defined work flows, high demands on tracability, versioning and change management - as do private businesses.


### Texts Analysis is Data Science

**Data science**, is define in Wikipedia as

> "*...an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, either structured or unstructured...*"</br>

- Data science is a big “buzzword” outside academia est. 1.5 million positions in US alone. Hottest job in US for last couple of years.
- Data science lies in the intersection between several fields.

> <img src="./images/data-science_new.svg" style="width: 80%; padding: 0; margin: 0;">
- Costly to acquire and keep these skills...
- Skills and knowledge are built-in to ready-to use software tools & libraries

**But it is risky to use these kind of libraries  as black-boxes.**

Texts analysis **is** data science.


#### Open Science as the New Normal

An important driving force behind the increased use of Jupyter Notebooks is the current "open science" movement. This is in part caused by the so called *reproducibility crisis*, and the *statistical crisis* (aka data dredging) in science. See
- [Presentation by Deevy Bishop](https://www.slideshare.net/deevybishop/what-is-the-reproducibility-crisis-in-science-and-what-can-we-do-about-it)
- [Replication Crisis](https://en.wikipedia.org/wiki/Replication_crisis)
- [An EU initiative for open science e-Learning](https://www.fosteropenscience.eu/)
- Simmons, J., L. Nelson, and U. Simonsohn. 2011. False-positive psychology
- [An article on the statistical crisis](https://www.americanscientist.org/article/the-statistical-crisis-in-science)


#### About The Jupyter Project
On the projects web sites at [http://jupyter.org/] it is stated that
> Project Jupyter exists to develop open-source software, open standards, and services for **interactive and reproducible computing**
 > The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text
 
><img src="./images/narrative_new.svg" style="width: 300px;padding: 0; margin: 0;">
> Fig. Computational narratives as the engine of collaborative data science

The **open science movement** is a driving force.

The project is sponsered by large companies such as Google and Microsoft, and funders such as Alfred P. Sloan foundation. See link [jupyter.org/about](http://jupyter.org/about) for all sponsors.


#### Jupyter Notebooks

- Jupyter Notebook is a **web application**
- Can  be run on a local computer, or hosted in a multi-user environment.
- An computational environment in you web browser
- A large number of programming languages are supported.

> <img src="./images/jupyter_stack.svg" style="width: 180px;">
> <i>**Fig**. Jupyter server</i>

#### Why use Jupyter Notebooks?

- Easy to learn, use and deploy, lots of people know Python.
- Ready to use online platform - trivial to create simple interactivity.
- Much faster (and cheaper) development - immediate feedback - agile, collaborative.
- Can defer decisions from developer to researcher. 
- The ability to combine data, narrative and code into an interactive user interface.
- Fits users with different tech skills, some researcher wants to understand and be able to tune the logic.
- Supports Python, that has big ecosystem of (open source) software libraries.
- Very popular, millons of notebooks exist on GitHub.

#### Brief Instructions on How to Use Notebooks
A notebook is a document with embedded executable code presented in a simple and easy to use web interface. Most important things to note are:

|----|----|
|
- Click on the menu Help -> User Interface Tour for an overview of the Jupyter Notebook App user interface.
- The **code cells** contains the script code (Python in this case, but can be other languages are also suported) and are the sections marked by **In [x]** in the left margin. It is marked as **In []** if it hasn't been executed, and as **In [n]** when it has been executed(n is an integer). A cell marked as **In [\*]** is either executing, or waiting to be executed (i.e. other cells are executing).
- The **current cell** is highlighted with a blue (or green if in "edit" mode) border. You make a cell current by clicking on it,
- Code cells aren't executed automatically. Instead you execute the current cell by either pressing **shift+enter** or the **play** button in the toolbar. The output (or result) of a cell's execution is presented directly below the cell prefixed by **Out[n]**.
- The next cell will automatically be selected (made current) after a cell has been executed. Repeatadly pressing **shift+enter** or the play button hence executes the cells in sequence.
- You can run the entire notebook in a single step by clicking on the menu Cell -> Run All. Note that this can take some time to finish. You can see how cells are executed in sequence via the indicator in the margin (i.e. "In [\*]" changes to "In [n]" where n is an integer).
- The cells can be edited if they are double-clicked, in which case the cell border turns green. Use the ESC key to escape edit mode (or click on any other cell).

To restart the kernel (i.e. the computational engine assigned to your session), click on the menu Kernel -> Restart. 


#### Methods and Technologies

- The figures below show some of the components used in a topic modelling project workflow.
- The  "stack" spans a large variety of domains and fields
- Each **single** method or technology (i.e. word in the clouds) can have a rather steep learning curve.
- Many methods, tools and frameworks to chose from.
- Lots of overlapping/similar functionality.
- Each component needs to be configured for proper (or intended) use.
- This require both specific knowledge, and usage experience in order to safely apply the methods and tools on the problem at hand.

- Tremendous "technological noise" - as a developer you spend the vast majority of the time in learning tools, methods and frameworks.

> **Concepts**
> <img src="./images/wordcloud_concepts.png" style="width: 50%; padding: 0; margin: 0;">

> **Methods**
> <img src="./images/wordcloud_methods.png" style="width: 50%; padding: 0; margin: 0;">

> **Tools**
> <img src="./images/wordcloud_tools.png" style="width: 50%; padding: 0; margin: 0;">

> **Framework**
> <img src="./images/wordcloud_frameworks.png" style="width: 50%; padding: 0; margin: 0;">

- And then there is the actual *problem domain* that is required to interpret and validate the result of the topic models.

This method and tool chain unavoidably requires a multitude of both major and minor decisions. It can even be difficult to determine which decision is minor or major. To some extent, black-box use is unavoidable, but this emphasizes the need of a proper validation process. The Jupyter Notebook helps transferring some of these decisions from the tech specialist to the end user i.e. the researcher. It is of great advantage - and a signum of the Python ecosystem - that so many proven and battle-tested open source frameworks and tools exists.


#### Risks and Challanges
- The risk of using tools and methods without fully understanding them
- The risk of using tools and methods for non-intended purposes or in new contexts
- How to verify performance (correctness of result)
- Risk if data dredging, p-hacking, "the statistical crisis".
- The risk that engineer makes micro-decisions the researcher don'r know about
- The risk of reading to much on visualizations (networks, layouts, clusters).