# PSTAT 134/234 - Statistical Data Science

---

## Instructor: Sang-Yun Oh

- Lectures: TR 11 am - 12:15 pm

- Office: South Hall 5514

# Course Information 

---

## Grading

* Attendance in lectures and sections are required (10%)  
    Total of five will be dropped. No exceptions

* Individual in-class midterm (25%)

* Individual assignments (35%)

* Group final project & poster session (30%)


## Textbooks

- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jake Vanderplas

- Other resources as needed


# Tentative course outline

---

* **Week 1** (4/1-4/5): Data and uncertainty   
    - Computing: Jupyter notebook and Python primer
    - Reading: [Chapter 1 (skim)-2](https://jakevdp.github.io/PythonDataScienceHandbook/index.html#1.-IPython:-Beyond-Normal-Python) in Vanderplas
    
* **Week 2** (4/8-4/12): Data scraping, transformation, and wrangling
    - Computing: Shell commands and Pandas
    - Reading: [Chapter 3](https://jakevdp.github.io/PythonDataScienceHandbook/index.html#3.-Data-Manipulation-with-Pandas) in Vanderplas  
        [The Unix Shell](http://swcarpentry.github.io/shell-novice/) by Software Carpentry  
        
* **Week 3-4** (4/15-4/26): Visualization and exploratory analysis
    - Computing: Matplotlib and Scikit-learn
    - Reading: [Chapter 4 (skim) - 5](https://jakevdp.github.io/PythonDataScienceHandbook/index.html#4.-Visualization-with-Matplotlib)

* **In-class midterm** (4/30)

* **Week 5-6** (5/1-5/10): Finance data module

* **Week 7-8** (5/13-5/23): Health data module
               
* **Week 9** (5/27-5/31): Text data module

* **Week 10** (6/3-6/7): TBD

* **Final Projects** (6/14): Final project poster session

# Data Science

---

- A Forbes magazine editorial on [History of Data Science](https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#7131af3355cf)  

- Isolated statistical/machine learning algorithms are not enough in real usage scenarios 

- Real world challenges are much more broad 

## Data challenges

- Data collection (text files, web pages, pdf files, APIs, ...)

- Data storage (disk space, redundancy, data structure, databases, security, privacy, ...)

- Data access (ease of access, cloud vs. local, security, ...)

- Data uniformity and consistency (cleaning, wrangling, entity resolution, ...)

## Analysis challenges

- Not all data are useful: e.g., "signal to noise", curse of dimensionality

- Critical thinking for formulating questions  
    e.g., prediction of medical diagnosis based on predictors

- Derived features may be redundant and lead to unstable results 

- Effects of outliers on algorithms

- Getting a “feel” for the data (visualization and summary statistics)

- Setting a direction

- Getting a simple but complete analysis done is better than doing one part “perfectly”

## Interpretation challenges

- What results do you obtain from your analysis?

- Do you believe it? What can you conclude? (analysis outcome vs. conclusion)

- Refining your analysis

## Learn by doing

- Critical thinking is crucial

- Substantial amount of programming is needed; however,

- This is not a programming course

- Many software tools will be new 
    e.g., Python, command line tools, etc

- Proactive attitude is a must!  
    e.g., asking questions, discussing, experimenting, RTM (read-the-manual)

- Diverse backgrounds mean you will have different strengths!  
    Help each other, and assess your own areas of improvement

- You don't have to be an expert at everything

- But you have to be willing to dig deeper on your own


# Computational Environment

---

## Github (optional but recommended)

* [Github Student Account](https://education.github.com/pack)


## Jupyterhub

* [Course Jupyter Hub](https://pstat134.lsit.ucsb.edu)

* For PSTAT 134/234 coursework only

* Your work can be inspected by teaching staff at any time

# Jupyter Notebook

---

- Interactive python environment for code and text

- Accessed using a web browser

- "Kernels" accessible through Jupyter include Python, shell, R, Julia, others  
    [Jupyter kernels](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels)
    
- Cells contain code (depends on kernel or magic), formatted text (markdown)

## Writing formatted text

- Text formatting with [markdown syntax](https://guides.github.com/features/mastering-markdown/)

- Math equations with latex commands:  
    e.g., `$$ \hat \mu = \frac{1}{n}\sum_{i=1}^n x_i $$` produces:
    $$ \hat \mu = \frac{1}{n}\sum_{i=1}^n x_i,\qquad \hat\sigma^2 = \frac{1}{n}\sum_{i=1}^n (x_i-\hat\mu)^2 $$

- Images with `![Not Lazy](http://thoughtfulcampaigner.org/wp-content/uploads/2017/10/im-not-lazy-im-just-in-energy-saving-mode-sleepy-cat.jpg)`
    ![Not Lazy](http://thoughtfulcampaigner.org/wp-content/uploads/2017/10/im-not-lazy-im-just-in-energy-saving-mode-sleepy-cat.jpg)

</br>

- Headings, tables, lists, bold, italic, etc.

- Markdown syntax can have variations: e.g., [Github Flavored Markdown (GFM)](https://guides.github.com/features/mastering-markdown/#GitHub-flavored-markdown)

## Coding in Python and more

- Python interpreter

- IPython is short for interactive Python with additional functionality  
    e.g., tab completion, syntax highlighting, magic, etc. (Chapter 1 in Vanderplas)  
    [demo]

- IPython's "[line, cell] magic" commands  
    e.g., debugging, code timing, shell commands, execute external script, etc.

In [1]:
%lsmagic

Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%markdown  %%perl  %%prun  %%pypy  %%python  %%python

### Line magic starts with `%`

- Debugging example (python debugger): `%pdb`  
    [Demo]


In [2]:
%pdb off

def print_string(x, y):
    print('x is lowercase:', x.islower())
    print('y is lowercase:', y.islower())

print_string('a', 'b')

Automatic pdb calling has been turned OFF
x is lowercase: True
y is lowercase: True


### Cell magic starts with `%%`

- Entire cell is interpreted differently

#### Time running time of a cell by `%%timeit`

In [3]:
%%timeit -n500 -r10
total = []
for i in range(1000):
    total += [i]

107 µs ± 6.38 µs per loop (mean ± std. dev. of 10 runs, 500 loops each)


In [4]:
%%timeit -n500 -r10
total = [i for i in range(1000)]

33.1 µs ± 1.22 µs per loop (mean ± std. dev. of 10 runs, 500 loops each)


#### Run bash commands: `%%bash`

Bash is a shell scripting language. Bash language can coexist in Jupyter notebook. Following cell is interpreted by a bash shell in the background, and any shell output is passed back to our notebook: e.g., list of files

In [4]:
%%bash

echo "######## hello" > somefile.txt
cat somefile.txt
echo "######## did you see a hello?"
echo "######## listing files"
rm somefile.txt
ls -alh

######## hello
######## did you see a hello?
######## listing files
total 28M
drwxrwxr-x 10 jovyan 1000 4.0K Mar 21 04:28 .
drwxrwxr-x 14 jovyan 1000 4.0K Mar 21 03:48 ..
-rw-rw-r--  1 jovyan 1000 4.5K Mar 21 04:20 00-Syllabus.ipynb
-rw-rw-r--  1 jovyan 1000  22K Mar 21 04:26 01-Statistical-Data-Science.ipynb
-rw-rw-r--  1 jovyan 1000 300K Jul 29  2018 02-Data-and-Uncertainty.ipynb
-rw-rw-r--  1 jovyan 1000 1.3M Jul 29  2018 03-Data-collection-and-manipulation.ipynb
-rw-rw-r--  1 jovyan 1000 122K Jul 30  2018 04-Pandas-Data-Frame.ipynb
-rw-rw-r--  1 jovyan 1000 255K Jul 30  2018 05-Data-Frame-and-Visualization.ipynb
-rw-rw-r--  1 jovyan 1000  21M Jul 29  2018 06-Basketball-Analytics-Alex-Franks.pdf
drwxrwxr-x  2 jovyan 1000 4.0K Nov 22 07:33 07-Shooting-Pattern-Analysis-files
-rw-rw-r--  1 jovyan 1000 715K Jul 31  2018 07-Shooting-Pattern-Analysis.ipynb
-rw-rw-r--  1 jovyan 1000  18K Aug  2  2018 08-Web-Services-and-Data-Interfaces.ipynb
-rw-rw-r--  1 jovyan 1000  23K Jul 29  2018 09-M

- [Who is Jovyan?](https://github.com/jupyter/docker-stacks/issues/358)
- ["In science fiction, a Jovian is an inhabitant of the planet Jupiter."](https://en.wikipedia.org/wiki/Jovian_%28fiction%29)

Shell output can be saved into a Python variable

In [26]:
nbfiles = !ls *.ipynb  # store filenames in nbfiles variable

for f,one in enumerate(nbfiles):
    print("file", f, ":", one)

file 0 : 00-Syllabus.ipynb
file 1 : 01-Statistical-Data-Science.ipynb
file 2 : 02-Data-and-Uncertainty.ipynb
file 3 : 03-Data-collection-and-manipulation.ipynb
file 4 : 04-Pandas-Data-Frame.ipynb
file 5 : 05-Data-Frame-and-Visualization.ipynb
file 6 : 07-Shooting-Pattern-Analysis.ipynb
file 7 : 08-Web-Services-and-Data-Interfaces.ipynb
file 8 : 09-Markowitz-Portfolio-Selection.ipynb
file 9 : 10-Markowitz-Portfolio-Covariance-Estimation.ipynb
file 10 : 11-Text-Mining.ipynb
file 11 : FitBitAPI.ipynb


### Other magic commands

    - `%load`: load outside script
    - `%store`: pass variables between notebooks
    - [many more](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook)

## Exporting notebooks

- Notebooks can be converted using [`nbconvert`](https://nbconvert.readthedocs.io/en/latest/)
    - Slides
    - Static web pages  
        e.g. auto-updating reporting, blogs, etc

- Shell script to automated execution  
    `jupyter nbconvert --to html --execute mynotebook.ipynb`


## Following Python Data Science Handbook notebooks

- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) is written in Jupter notebook and [the author made the source code available on Github](https://github.com/jakevdp/PythonDataScienceHandbook)

- You can follow the book in our course Jupterhub (file upload)

- You can fork the repository to your Github and make changes

# Bash Shell

---

* Bash is a text interface to the operating system (OS)

* OS handles file operations, interfacing with network, etc

* Bash allows you to execute system operations using text commands

    ```bash
    echo "######## hello" > somefile.txt ## prints string into file 'somefile.txt'
    ls -alh                              ## list files in directory
    cat somefile.txt                     ## print contents of 'somefile.txt'
    rm somefile.txt                      ## remove 'somefile.txt'
    ```
* Super cool resource: [Explain Shell](https://explainshell.com/)

* [Demo] Use git command to clone a repository on Github
    
* More on command line later    