# PSTAT 134/234 - Statistical Data Science <a class="tocSkip">



# What is Data Science?
![DataScience](images/DS.JPG)

## (Classical) Scientific Method

* Scientific method used since ~1200s and formalized in 1500s

* Roughly, scientific method is an approach [[reference](https://en.wikipedia.org/wiki/Scientific_method)]

    * Question
    * Hypothesis
    * Prediction
    * Testing
    * Analysis



## New Paradigm of Science

**Fourth Paradigm: Data-intensive scientific discovery** (Jim Gray): everything about science is changing

* Scope of Research has broadened: **Basic science vs. Business insight**

* "Some models are useful": **Precise vs. Approximate**

* What is more important? **Understanding vs. Prediction**

## Data Science is ...

* A multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data [reference](https://doi.org/10.1145/2500499)

* Merger of statistics, data analysis, machine learning and their related methods in order to understand and analyze actual phenomena with data [reference](https://doi.org/10.1007/978-4-431-65950-1_3)

* Composed of techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, and information science. [reference](https://www.microsoft.com/en-us/research/publication/fourth-paradigm-data-intensive-scientific-discovery/)

## Beginnings of Data Science

![Cleveland](images/data-science-cleveland.png)
[[International Statistical Review](http://doi.org/10.2307/1403527)]

The article proposed data science program to train a new generation of data analysts:

* **Domain expertise**: data analysis collaborations in subject matter areas.

* **Mathematics/Statistics**: models, estimation, and distribution based on probabilistic inference.

* **Computing**: hardware and software; computational algorithms
    
* **Theory**: foundations of data science; mathematical investigations of models and methods
    

[International Statistical Review](http://doi.org/10.2307/1403527)

## Data Scientific Approach

![DataScienceLifeCycle](images/DataScienceLifeCycle.jpg)  
[[UC Berkeley, School of Information](https://datascience.berkeley.edu/about/what-is-data-science/)]

# Statistics and Data Science

- Focus: Traditional statistical methods tackle problems often under strict set of assumptions (Probabilistic Models)

- Methodology: Data science employs a broader range of methods, including statistical techniques, machine learning algorithms and big dta technologies

- Approach: Data science often iterate to make improvements, and develops processes (often custom built)

# Computational Environment

<table style="width: 100%;">
    <tr>
    <td style="width: 50%;"> <img src="images/jupyternotebook.png" alt="Drawing" style="width: 100%;"/> </td>
    <td style="width: 50%;"> <img src="images/jupyter-lab.png" alt="Drawing" style="width: 100%;"/> </td>
    </tr>
    <tr>
    <td style="text-align: center; font-weight: bold;"/> Jupyter Notebook</td>
    <td style="text-align: center; font-weight: bold;"/> Jupyter Lab</td>
    </tr>
</table>

## Jupyter Notebooks vs. Jupyter Lab

- Jupyter cluster is here: https://pstat-134-234.lsit.ucsb.edu

- If your username is `[UCSB NetID]`,  

- Jupyter Notebook: `https://pstat-134-234.lsit.ucsb.edu/user/[UCSB NetID]/tree`  

- Jupyter Lab (default): `https://pstat-134-234.lsit.ucsb.edu/user/[UCSB NetID]/lab`   

- **Jupyter Notebook + Additional Features = Jupyter Lab**

## Jupyter Notebook

- Notebook environment: code, formatted text, and graphics

- Conversion to other formats: e.g. markdown, latx, PDF, HTML, etc.

- Interactivity HTML widgets

- [Jupyter notebook basics](https://nbviewer.jupyter.org/github/jupyter/notebook/blob/master/docs/source/examples/Notebook/Notebook%20Basics.ipynb)

![Jupyter Notebook](images/jupyternotebook.png)

[Jupyter Notebook Jupyter kernels](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels) (Python, shell, R, Julia, others)  
    
![Jupyter Diagram](images/jupyter-diagram.jpg)


![Jupyter Diagram](images/jupyterhub-notebook.png)

- Read [Jupyter notebook basics](https://nbviewer.jupyter.org/github/jupyter/notebook/blob/master/docs/source/examples/Notebook/Notebook%20Basics.ipynb)

### Writing formatted text

- Text formatting with [markdown syntax](https://guides.github.com/features/mastering-markdown/): e.g., variations like [Github Flavored Markdown (GFM)](https://guides.github.com/features/mastering-markdown/#GitHub-flavored-markdown)

- Math equations (latex): e.g., `$$ \hat \mu = \frac{1}{n}\sum_{i=1}^n x_i $$`
    $$ \hat \mu = \frac{1}{n}\sum_{i=1}^n x_i,\qquad \hat\sigma^2 = \frac{1}{n}\sum_{i=1}^n (x_i-\hat\mu)^2 $$
    [Mathpix](https://mathpix.com/)

- Images: e.g. `![alt-text](some-image.jpg)`  
    ![Not Lazy](images/im-not-lazy-im-just-in-energy-saving-mode-sleepy-cat.jpg)   
    _Note: Markdown can display an image given its URL; however, doing so doesn't work with PDF generation_  
    `![Not Lazy](http://thoughtfulcampaigner.org/wp-content/uploads/2017/10/im-not-lazy-im-just-in-energy-saving-mode-sleepy-cat.jpg)`  

### More than Python

- Python interpreter

- IPython is short for interactive Python with additional functionality  
    e.g., tab completion, syntax highlighting, magic, etc. (Chapter 2 in McKinney)  

- IPython's "[line, cell] magic" commands  
    e.g., debugging, code timing, shell commands, execute external script, etc.

### Line magic: `%`

- [Demo] Debugging example: `%debug`, etc.

- [`xmode`](https://jakevdp.github.io/PythonDataScienceHandbook/01.06-errors-and-debugging.html#Controlling-Exceptions:-%xmode): Exception handler mode

- More detail on debugging: [Vanderplas - Chapter 1](https://jakevdp.github.io/PythonDataScienceHandbook/01.06-errors-and-debugging.html)

#### Example: Debugging

In [3]:
# exception mode to verbose output: Comment/uncomment both xmode and see what happens with func2(1)
#%xmode Plain
%xmode Verbose

from IPython.core.debugger import set_trace

def func1(a, b):
    return a / b

def func2(x):
    
    a = x
    # set_trace()
    b = x - 1
    return func1(a, b)

# Refer to https://docs.python.org/3/library/pdb.html#debugger-commands
# Press h for help
# Uncomment below to trigger an error
func2(1) 

Exception reporting mode: Verbose


ZeroDivisionError: division by zero

In [4]:
# After an exception occurs, calling %debug 
# starts the debugger at last error
# uncomment the next line for demo
%debug

# input following at ipdb prompt
#print(a)
#print(b)
#exit()

> [1;32mc:\users\lnbar\appdata\local\temp\ipykernel_55008\3771726007.py[0m(8)[0;36mfunc1[1;34m()[0m



ipdb>  print(a)


1


ipdb>  print(b)


0


ipdb>  exit()


#### Other magic commands

- [many more](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook)

In [22]:
# shift-tab for documentation
%lsmagic

Available line magics:
%alias  %alias_magic  %autoawait  %autocall  %automagic  %autosave  %bookmark  %cd  %clear  %cls  %colors  %conda  %config  %connect_info  %copy  %ddir  %debug  %dhist  %dirs  %doctest_mode  %echo  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %macro  %magic  %matplotlib  %mkdir  %more  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %pip  %popd  %pprint  %precision  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %ren  %rep  %rerun  %reset  %reset_selective  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%cmd  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%markdown  %%perl  %%prun  %%pypy  %%python 

### Cell magic: `%%`

- Entire cell is interpreted differently

#### Example: Measuring running time

In [4]:
%%timeit -n500 -r10
total = []
for i in range(1000):
    total += [i]

84.1 µs ± 9.85 µs per loop (mean ± std. dev. of 10 runs, 500 loops each)


- `%%timeit`: This is the magic command that indicates the following cell's code will be timed.
- `-n500`: Specifies the number of loops per timing measurement. In this case, the loop (the code inside the cell) will be executed 500 times per timing measurement.
- `-r10`: Specifies the number of repetitions. The entire timing measurement process will be repeated 10 times to get more accurate and consistent results.

In [5]:
%%timeit -n500 -r10  ## ## list comprehension
total = [i for i in range(1000)]

37.6 µs ± 7.53 µs per loop (mean ± std. dev. of 10 runs, 500 loops each)


## Command-Line Interface

---

* [Bash](https://en.wikipedia.org/wiki/Bash_(Unix_shell)): text (command-line) interface to operating system (OS)

* OS (e.g. Windows, MacOS) handles file operations interfacing with network, etc 

* From https://pstat-134-234.lsit.ucsb.edu/, choose **Terminal** from **Launcher** menu. Clicking on big blue + button gives you a new launcer pane. 

* Run system operations using text commands

    ```bash
    echo "######## hello" > somefile.txt ## prints string into file 'somefile.txt'
    ls -alh                              ## list files in directory
    cat somefile.txt                     ## print contents of 'somefile.txt'
    rm somefile.txt                      ## remove 'somefile.txt'
    ``` 

* [Explain Shell](https://explainshell.com/) or AI assistant.

* **Demo**: Use `git` command to clone a repository from Github  
    _Note: self-learning git is recommended but not required for the course_ [[Software carpentry lesson: git](https://swcarpentry.github.io/git-novice/)]
    
* More on command line later    

### Run CLI from Jupyter Notebook

### Example: Run bash commands: `%%bash`

Bash is a shell scripting language. Bash language can coexist in Jupyter notebook. 

In [6]:
%%bash

echo "######## hello" > somefile.txt
cat somefile.txt
echo "######## did you see a hello?"
echo "######## listing files"
rm somefile.txt
ls -alh

######## hello
######## did you see a hello?
######## listing files
total 680K
drwxr-xr-x 1 lnbar lnbar    0 Apr  1 10:34 .
drwxr-xr-x 1 lnbar lnbar    0 Mar 31 22:20 ..
drwxr-xr-x 1 lnbar lnbar    0 Apr  1 00:30 .ipynb_checkpoints
-rw-r--r-- 1 lnbar lnbar 7.9K Mar 31 23:08 00-Syllabus.ipynb
-rw-r--r-- 1 lnbar lnbar  29K Apr  1 10:28 01-Statistical-Data-Science.ipynb
-rw-r--r-- 1 lnbar lnbar 607K Apr  1 08:30 01-Statistical-Data-Science.slides.html
drwxr-xr-x 1 lnbar lnbar    0 Mar 31 22:22 data
drwxr-xr-x 1 lnbar lnbar    0 Mar 31 23:10 images


### Example: Jupyter Notebook and Bash

- [Who is Jovyan?](https://github.com/jupyter/docker-stacks/issues/358)
- ["In science fiction, a Jovian is an inhabitant of the planet Jupiter."](https://en.wikipedia.org/wiki/Jovian_%28fiction%29)

Shell output can be saved into a Python variable

In [39]:
nbfiles = !ls *.ipynb  # store filenames in nbfiles variable

for f,one in enumerate(nbfiles):
    print("file", f, ":", one)


file 0 : ls: cannot access '#': No such file or directory
file 1 : ls: cannot access 'store': No such file or directory
file 2 : ls: cannot access 'filenames': No such file or directory
file 3 : ls: cannot access 'in': No such file or directory
file 4 : ls: cannot access 'nbfiles': No such file or directory
file 5 : ls: cannot access 'variable': No such file or directory
file 6 : 00-Syllabus.ipynb
file 7 : 01-Statistical-Data-Science.ipynb


### Example: Exporting notebooks

- Notebooks can be converted using [`nbconvert`](https://nbconvert.readthedocs.io/en/latest/)
    - Slides
    - Static web pages  
        e.g. auto-updating reporting, blogs, etc

- Shell script to automated execution  
    `jupyter nbconvert --to html --execute mynotebook.ipynb`
  

- Notebook conversion depends on other software.

In [37]:
  # In bash: jupyter nbconvert --to html --execute 00-Syllabus.ipynb