# STA 130 Lab 1

## Goals
Today's lab will cover a brief introduction to Python and some of its scientific computing libraries. Since most students in STA 130 should already be familiar with Python, this will be a brief refresher, as well as an introduction to Jupyter notebooks and reproducible research.

## A brief foray into Microsoft Excel

- The popular spreadsheet software is very common in finance, consulting, and even some engineering firms. 
- Can be helpful for quick calculations, but it is not reproducible and formulas are prone to error and even unintended changes if not properly protected.
- Many data repositories, especially those run by the federal government, still provide data downloads in the form of .xls, .xslx, or the simpler .csv file format. 
- We will download some data in Excel and demonstrate some basic data manipulation in Python.

## Getting data
- The UCI Machine Learning Repository is a large collection of data sets (of varying vintages) that are in the public domain.
- Many government website also provide raw data in various formats.
- We will examine some data from the CDC that is available at https://www.openintro.org/stat/data/?data=cdc

In [5]:
import pandas as pd # library for basic data manipulation tools

- Download the data by clicking the yellow "CSV Download" button
- First open the file using Microsoft Excel. We'll explore `Text to Columns`, write a quick formula, and make a graphic.
    - **Note**: Excel is not reproducible and **should not** be used as a data analysis tool for research. It can also be troublesome as a data entry tool, since it tries to be helpful by automatically converting data types. These changes often cannot be undone.
    - It is more prudent to save data in plain-text format (e.g. as a .csv file) and handle any data cleaning and analysis tasks in software like R or Python. 
    - Statistical software gives you more control and makes it much more difficult to overwrite your source data. It is also reproducible, since your well-commented script is an outline of each step that you took from the moment that you began working with the data. 
- Now upload to your Docker container on the OIT server and read into your Jupyter notebook using `read_csv` from the `pandas` library.

In [6]:
cdc_df = pd.read_csv("./cdc.csv")
cdc_df.head(10) # display the first 10 records

Unnamed: 0,genhlth,exerany,hlthplan,smoke100,height,weight,wtdesire,age,gender
0,good,0,1,0,70,175,175,77,m
1,good,0,1,1,64,125,115,33,f
2,good,1,1,1,60,105,105,49,f
3,good,1,1,0,66,132,124,42,f
4,very good,0,1,0,61,150,130,55,f
5,very good,1,1,0,64,114,114,55,f
6,very good,1,1,0,71,194,185,31,m
7,very good,0,1,0,67,170,160,45,m
8,good,0,1,1,65,150,130,27,f
9,good,1,1,0,70,180,170,44,m


## Basic data summary

In [7]:
cdc_df.dtypes # automatically assigned data types

genhlth     object
exerany      int64
hlthplan     int64
smoke100     int64
height       int64
weight       int64
wtdesire     int64
age          int64
gender      object
dtype: object

In [8]:
# correct data types for categorical variables
cdc_df = cdc_df.astype(dict(genhlth = "category",
                            exerany = "category",
                            hlthplan = "category",
                            smoke100 = "category",
                            gender = "category"
                           )
                      )

In [9]:
cdc_df.dtypes

genhlth     category
exerany     category
hlthplan    category
smoke100    category
height         int64
weight         int64
wtdesire       int64
age            int64
gender      category
dtype: object

In [10]:
# describe will return descriptive stats for numerical variables
cdc_df.describe()

Unnamed: 0,height,weight,wtdesire,age
count,20000.0,20000.0,20000.0,20000.0
mean,67.1829,169.68295,155.09385,45.06825
std,4.125954,40.08097,32.013306,17.192689
min,48.0,68.0,68.0,18.0
25%,64.0,140.0,130.0,31.0
50%,67.0,165.0,150.0,43.0
75%,70.0,190.0,175.0,57.0
max,93.0,500.0,680.0,99.0


In [16]:
# we can get frequency and relative frequency for specific variables as well
cdc_df["genhlth"].value_counts()

very good    6972
good         5675
excellent    4657
fair         2019
poor          677
Name: genhlth, dtype: int64

In [18]:
len(cdc_df)

20000

In [23]:
cdc_df.shape

(20000, 9)

**Note how the above code reflects our goal to *be reproducible*! We make explicit comments and include all code necessary to generate the output that we need for our discussion.** This will become more important as our analyses become more complex. 
- More complicated work also introduces new issues of reproducibility, like setting the seed for Python's random number generator, and keeping track of what version of a particular library or OS your are using. 
- These issues will be incredibly important should you continue to do numerical computing for courses, and especially research.

## Writing a function

Remember that if you are going to be doing a task repeatedly in a script, you should always start thinking about modular programming. This makes your code much more readable and may help cut down on mistakes when you inevitably need to modify your code. Below is a fairly contrived example, but it should serve to remind you of a few key concepts:
- function name should be brief and meaningful
- always include a docstring that describes the purpose of your function, inputs, and outputs
- never forget about data validation and error handling

In [11]:
def corr_fun(x, y):
    """
    This is a sample function to demonstrate the idea of modular programming in statistics.
    It takes two arrays of type int64 and calculates the correlation coefficient between them. 
    It returns corr(x,y) as a scalar, or an error message if inputs fail the data validation steps.
    """
    
    # use numpy library version of correlation coefficient
    import numpy as np
    
    # check input type - corr not defined for categorical vars
    if x.dtype == "int64" and y.dtype == "int64":
        return np.corrcoef(x, y)[0,1] # return only var, not complete cov matrix
    else:
        return "Error: at least one input was not of type int64"

In [12]:
# good call to function
corr_fun(cdc_df.weight, cdc_df.height)

0.5553221916098966

In [13]:
# bad call to function
corr_fun(cdc_df.smoke100, cdc_df.height)

'Error: at least one input was not of type int64'

## A quick overview of Markdown

### Headings

### Plain Text

Lorem ipsum...

- **bold text**
- *italic text*

### Latex-style math 

- **In-line equations:** $y = x^2 + 2x + 3$
- **Centered, multi-line equations:**
$$
\Sigma_{j=1}^{n} x_{j}^2 \\
\frac{1}{n}\Sigma_{j=1}^{n} x_j
$$

### Basic tables

Header 1 | Header 2 | Header 3 
--------:| --------:| -------:
123 | 456 | 789
123 | 456 | 789
123 | 456 | 789
123 | 456 | 789

Header 1 | Header 2 | Header 3 
-------- | -------- | -------
123 | 456 | 789
123 | 456 | 789
123 | 456 | 789
123 | 456 | 789

## References

VanderPlas, J. (2016). *Python Data Science Handbook*. Retrieved from: https://jakevdp.github.io/PythonDataScienceHandbook/.