### ST445 Managing and Visualizing Data

# Introduction to Data

### Week 1 Lecture, MT 2020 - Chengchun Shi

# What is Data?


 "Data is a set of values of subjects with respect to qualitative or quantitative variables. " -– Wikipedia

-----------------------------



summarized in the form of 

* vector or matrix
* tensor (high-order matrix)
* image, or text

# Data vs Information

### Data

* raw, unorganized facts that need to be processed
* unusable until it is organized

### Information

* created when data is processed, organized, structured
* needs to be situated in an appropriate _context_ in order to become useful

<!--There are important differences in how humans and computers treat data as information-->



# Information Theory

The information content of a message depends on its probability:

$$ I(x)=-\log_2 p(x)$$

* Two independent events with $p(x,y)=p(x)p(y)$ will have information $I(x,y)=I(x)+I(y)$

    which is the sum of the information of the individual events.
    

* In transmitting a message modelled as a random variable the average amount of information received is: 


$$ H[X]=-\sum_{x}p(x)\log_2 p(x)=\sum_{x}p(x)I(x) $$

* The quantity H[X] is called *entropy*.

* A measure of information in a single random variable.

# Information Theory

* Joint entropy:

$$ H(X,Y)=-\sum_{x,y}p(x,y)\log_2 p(x,y) $$

* Conditional entropy:

$$ H(X|Y)=-\sum_{x,y}p(x,y)\log_2 p(x|y)=H(X,Y)-H(Y) $$

* Mutual information: 

$$ I(X|Y)=H(X)-H(X|Y)=H(X)+H(Y)-H(X|Y) $$
* Equals zero when $X$ and $Y$ are independent.

* Expect to see more on probabilstic models later in the course!

# Visualising Distributions 


This is not a statistics course but the powerful tools to visualise distributions can helpyou understand your data.

<img src="figs/DistributionHistogram.png" width="600">

See the code for this plot on Page 281 of "Python for Data Science" by Wes McKinney

# Simplest can be best...

use **matplotlib.pyplot.pie**

![Pie Chart](figs/pie_demo_features.png "Pie Chart")

Taken from https://matplotlib.org/gallery/index.html

# Exploring Data Visually

Combines scatter plots and histograms. Data is from the Boston Housing dataset available from scikit-learn. Use **pandas.scatter_matrix**

<img src="figs/boston-housing-pairplot.png" width="500">

# As a data scientist

* Most (approximately 70%) time in data science is spent on cleaning and organsising data


* Collecting data sets can also be time consuming


* Little time is spent refining algorithms


* The tools and techniques you will learn in the following two lectures on NumPy and Pandas are

    well adapted for data cleaning (and many other tasks).
    
--------------


Next slide shows struggle to obtain good data. 

# Common Data Quality Issues


* Missing data is a common problem. A good solution is to build a simple model to estimate the missing values. Pandas has good tools for this.

    * Sequenced Treatment Alternatives to Relieve Depression (STAR\*D) data: 17% obs. are missing.
    
    * The schizophrenia study: over 50% obs. are missing.
    
    * The Nefazodone-CBASP clinical trial study: 5% obs. are missing.


* Duplicate data is another common problem. Again Pandas has tools for this.


* Incorrect values in the dataset. 


* In engineering methods have been developed for correcting measurements where there are networks of sensors, some of which may have failed. This is often called _Data validation and reconciliation_  


# Missing data

<img src="figs/trump.jpg" width="600">

Taken from https://www.cnbc.com/2016/02/21/is-trump-vs-hillary-inevitable.html

<img src="figs/1400x-1.jpg" width="600">

Taken from https://www.bloomberg.com/features/2019-trump-or-biden-quotes-quiz/

# Changes in the world of data


* high-dimensional data ($p\gg n$), e.g., genetic data (dimension reduction, penalized regression, random projection)


* functional data, e.g., time series, images (functional principle component analysis, deep learning)



* big data/massive data (subsampling, divide and conquer, parallel computing)
    - volume of data in the modern world: 90% of the world's data [generated in the last _two years_](https://www.sciencedaily.com/releases/2013/05/130522085217.htm)
    - an that was in 2013

# Examples of big data

<img src="figs/yahoo.JPG" width="500">


* Yahoo! Front Page Today Module User Click Log Dataset, version 1.0 (1.1 GB).


* contains a fraction of user click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page during the first ten days in May 2009.



* a total of 45,811,883 user visits to the Today Module.

# Clever algorithms are very important...



* The Apollo landing relied on algorithmic developments such as the Kalman Filter to process noisy data from multiple sensors. 


* Big Data has been powered by algorithms such as Google's PageRank

# Examples of small data


* Sequenced Treatment Alternatives to Relieve Depression (STAR\*D) data: 383 obs.
    


* The schizophrenia study: 165 obs.
    


* The Nefazodone-CBASP clinical trial study: 681 obs. 



* ACTG 175 study: 2139 obs.



* A Data from the InternationalWarfarin Pharmacogenetics Consortium: 3848 obs.


# Basic units of data

* Bits
   - smallest unit of storage, a 0 or 1
   - with $n$ bits, can store $2^n$ patterns - so one byte can store 256 patterns


* Bytes
   - eight _bits_ = one _byte_
   - ASCII (American Standard Code for Information Interchange) - represented characters, such as `A` represented as 65
   
  ![ASCII](figs/ASCII.png)

### multi-byte units

| unit     | abbreviation | total bytes  | nearest decimal equivalent |
|:--------:|:------------:|-------------:|---------------------------:|
| kilobyte |     KB       | 1,024^1      |             1000^1         |
| megabyte |     MB       | 1,024^2      |             1000^2         |
| gigabyte |     GB       | 1,024^3      |             1000^3         |
| terabyte |     TB       | 1,024^4      |             1000^4         |
| petabyte |     PB       | 1,024^5      |             1000^5         |
| exabyte  |     EB       | 1,024^6      |             1000^6         |
| zettabyte|     ZB       | 1,024^7      |             1000^7         |
| yottabyte|     YB       | 1,024^8      |             1000^8         |

* this is why 1GB is greater than 1 billion bytes

# Programming language popularity: TIOBE index


<img src="figs/PL.png" width="700"> 


Taken from https://towardsdatascience.com/visualize-programming-language-popularity-using-tiobeindexpy-f82c5a96400d

# Programming language popularity


<img src="figs/stackoverflow.png" width="600"> 


Taken from https://hackernoon.com/top-3-most-popular-programming-languages-in-2018-and-their-annual-salaries-51b4a7354e06

# Open Source Software


* Free computer software which the user can modify and distribute within the terms of a licence  


* https://www.python.org/download/releases/3.3.5/license/


* Collaborative development has created diverse and very powerful software ecosystems


* Both major data science languages - Python and R are Open-source


* Python files are saved with the .py extension. These files on their own are called modules.


* Modular structure permits users to build an environment exactly suited to their needs.

### In Python Everything is an Object

* objects have _classes_, meaning they represent a "type" of object,for example *string* or *function*


* _attributes_ are features of objects or variables in a class


* _methods_  are functions

# Data types: Generically

* objects are _bound_ to an identifier, e.g.

In [7]:
temperature = 98.6
print(temperature)
print(id(temperature))

98.6
92603568


* here, `temperature` is a variable name assigned to the literal floating-point object with the value of 98.6
* in Python, this is an instance of the **float** class
* function `id` returns the identity of an object

In [6]:
temperature1 = 98.6
print(temperature1 is temperature)
print(temperature1 == temperature)

92603616
False
True


* variable names in R and Python are _case-sensitive_
* some variable names are typically reserved, e.g.
    ```Python
    False, True, None, or, and  # Python
    FALSE, TRUE, NA, NAN        # R
    ```

* All programming languages use comments, for humans to read
    - this is anything that follows the `#` character in both Python and R

> "Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do."  -- Donald Knuth, _Literate Programming_ (1984)

* "immutable" objects cannot be subsequently changed
    
| **Python class** | **Immutable** | **Description**                   |  **R class** |
|:-----------------|:-------------:|:----------------------------------|:------------:|
| bool             |      Yes      | Boolean value                     |    logical   |
| int              |      Yes      | integer number                    |    integer   |
| float            |      Yes      | floating-point number             |    numeric   |
| list             |       No      | mutable sequence of objects       |     list     |
| tuple            |      Yes      | immutable sequence of objects     |       -      |
| str              |      Yes      | character string                  |   character  |
| set              |       No      | unordered set of distinct objects |       -      |
| NumPy array      |       No      | mutable array                     |       -      |
| dict             |       No      | dictionary                        | (named) list |

### (indexing data cont.)

* index from 0 or from 1?

   - where an index begins counting, when addressing elements of a data object
   - [most languages index from 0](https://en.wikipedia.org/wiki/Comparison_of_programming_languages_%28array%29#Array_system_cross-reference_list)
   - human ages - do they index from 0?

In [23]:
string_example = 'Hello World'
string_example[0:5]

'Hello'


* Python indexes from 0<pre>         <pre>Be warned!


# git

* `git`: a version control system
* Allows for complete history of changes, branching, staging areas, and flexible and distributed workflows
* simplified workflow (from [Anita Cheng's excellent blog post](http://anitacheng.com/git-for-non-developers))

   <img src="figs/git.jpg" width="400"> 

# GitHub

* a website and hosting platform for git repositories

* [GitHub classroom](https://classroom.github.com)
* Free stuff for students! https://education.github.com/pack


# More great resources for using git/GitHub

* [An easy git Cheatsheet](http://rogerdudler.github.io/git-guide/files/git_cheat_sheet.pdf), by Nina Jaeschke and Roger Dudler 
* [git - the simple guide](http://rogerdudler.github.io/git-guide/) by Roger Dudler

# git Example

### Fixing a broken Python Jupyter notebook

This Jupyter notebook needs de-bugging:

https://github.com/lse-st445/lectures/week01/DebugExercise.ipynb


### How to fix it:
* clone the repository
* edit the file
* stage the changes
* commit the changes
* issue a "pull request"

# Markdown (and other markup languages)

* Idea of a "markup" language: HTML, XML, LaTeX
* "Markdown"
    - Created by John Gruber as a simple way for non-programming types to write in an easy-to-read format that could be converted directly into HTML
    - No opening or closing tags
    - Plain text, and can be read when not rendered
* Markdown has [many "flavours"](https://github.com/commonmark/CommonMark/wiki/Markdown-Flavors)

# Markdown example

This is a markdown example.
* bullet list 1
* bullet list 2

> "[I love deadlines. I like the whooshing sound they make as they fly by.](https://www.brainyquote.com/quotes/quotes/d/douglasada134151.html?src=t_funny)"  
-- _Douglas Adams_

----
```
# Markdown example

This is a markdown example
* bullet list 1
* bullet list 2

> "[I love deadlines. I like the whooshing sound they make as they fly by.](https://www.brainyquote.com/quotes/quotes/d/douglasada134151.html?src=t_funny)"  
-- _Douglas Adams_
```


A good reference for Markdown: https://ia.net/writer/support/general/markdown-guide/.

and also: https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet 

# Upcoming

-------

* **Lab**: Working with Jupyter and Github
* **Next week**: Python and NumPy Data Structures