## Learning Goals for Module 1: Programming and data visualization
- Working with Jupyter Notebooks
- Expressions
- Variables & upcoming Data Types
    - Strings
    - Integer
    - Float
- built-in functions: print("Hello World"), max(), min(), abs(), pow(), round()
- Kernel
- errors
- check work with check('tests/q2_1.py')
- Translating science into Python formulas: Newton's equations
- importing code functionality: import math
    - math.pi()
    - math.sqrt()
    - math.log()
    - math.factorial()
- arrays: from datascience import *
    - make_array(0.125, 4.75, -1.3)
- arrays: import numpy as np
    - np.array([0, 1, -1, math.pi, math.e])
- lists: [1, 2, 3, 4]
- Functions
    - customize Python
    - syntax
        - def
        - return
     - arguments
     - local variables
- Visualization
    - datascience Table
        - plot
        - scatter
        - hist
     - arrays
         - from data columns
         - matplotlib
         - plotly

# Today's Data Wrangling Example 
![Heart](data/valentines-day-2023-6753651837109573.3-law.gif)

Data from Kaggle see [https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset)
1. age
2. sex
3. chest pain type (4 values)
4. resting blood pressure
5. serum cholestoral in mg/dl
6. fasting blood sugar > 120 mg/dl
7. resting electrocardiographic results (values 0,1,2)
8. maximum heart rate achieved
9. exercise induced angina
10. oldpeak = ST depression induced by exercise relative to rest
11. the slope of the peak exercise ST segment
12. number of major vessels (0-3) colored by flourosopy
13. thal: 0 = normal; 1 = fixed defect; 2 = reversable defect

The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.

*target* (0 = no heart disease and 1 = heart disease)

In [9]:
from datascience import *
import numpy as np
# import for plotting
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
# Fix for datascience plots
import collections as collections
import collections.abc as abc
collections.Iterable = abc.Iterable

In [10]:
path = 'data/'
data = path + 'heart.csv'
heart = Table.read_table(data)
heart

age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
58,0,0,100,248,0,0,122,0,1.0,1,0,2,1
58,1,0,114,318,0,2,140,0,4.4,0,3,1,0
55,1,0,160,289,0,0,145,1,0.8,1,1,3,0
46,1,0,120,249,0,0,144,0,0.8,2,0,3,0
54,1,0,122,286,0,0,116,1,3.2,1,2,2,0


In [11]:
heart.group("age")

age,count
29,4
34,6
35,15
37,6
38,12
39,14
40,11
41,32
42,26
43,26


In [12]:
heart.group("target",np.average)

target,age average,sex average,cp average,trestbps average,chol average,fbs average,restecg average,thalach average,exang average,oldpeak average,slope average,ca average,thal average
0,56.5691,0.827655,0.482966,134.106,251.293,0.164329,0.456914,139.13,0.549098,1.6002,1.16633,1.15832,2.53908
1,52.4087,0.570342,1.37833,129.245,240.979,0.134981,0.598859,158.586,0.134981,0.569962,1.59316,0.370722,2.11977


# Functions
Thousands of functions are built into the Python computer language and still others can be loaded by using the `import` Python command. This is very powerful and provides almost limitless capability to the Python language. However, there are many times when a custom function may be needed and this is a very powerful way to automate repetitive data handling and analysis tasks in a reproducible manner. Functions take arguments given in paretheses *()* directly following the name. For instance below is the built-in Python print function:

In [13]:
dogname = "Phineas" # Define `dogname` variable
print(dogname)      # `dogname` is the argument for the function, print

Phineas


Now let's give this a try by learning how to write our own functions.

In [None]:
def double(x):
    """ doubles """
    return 2*x

In [None]:
?double

In [None]:
def triple(xtra):
    """ triples """
    return 3*xtra

In [None]:
x = double(4)*triple(4)
x

In [None]:
x = triple(x)

In [None]:
print(x)

In [None]:
?double

In [None]:
double(10)

## Defining functions

Here is a very simple function that converts a proportion to a percentage by multiplying it by 100.  For example, the value of `to_percentage(.5)` should be the number 50.  (No percent sign.)

A function definition has a few parts.

##### `def`
It always starts with `def` (short for **def**ine):
```
    def
```

##### Name
Next comes the name of the function.  Let's call our function `to_percentage`.
```
    
    def to_percentage
```
##### Signature
Next comes something called the *signature* of the function.  This tells Python how many arguments your function should have, and what names you'll use to refer to those arguments in the function's code.  `to_percentage` should take one argument, and we'll call that argument `proportion` since it should be a proportion.
```
    def to_percentage(proportion)
```

We put a colon after the signature to tell Python it's over.
```

    def to_percentage(proportion):
```

##### Documentation
Functions can do complicated things, so you should write an explanation of what your function does.  For small functions, this is less important, but it's a good habit to learn from the start.  Conventionally, Python functions are documented by writing a triple-quoted string:
```
    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
```
    
    
##### Body
Now we start writing code that runs when the function is called.  This is called the *body* of the function.  We can write anything we could write anywhere else.  First let's give a name to the number we multiply a proportion by to get a percentage.
```
    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100
```

##### `return`
The special instruction `return` in a function's body tells Python to make the value of the function call equal to whatever comes right after `return`.  We want the value of `to_percentage(.5)` to be the proportion .5 times the factor 100, so we write:
```
    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100
        return proportion * factor
```

In [None]:
def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100
        return proportion * factor

In [None]:
print(to_percentage(0.45),"%")

In [None]:
to_percentage(0.45)

In [None]:
import numpy as np
from datascience import *
years = np.arange(1990,2022)
table1 = Table().with_columns("years", years,"Odd",years %2)
table1.where("Odd",are.equal_to(1))

#### For [unicode emojis](https://unicode.org/emoji/charts-14.0/full-emoji-list.html)
(https://unicode.org/emoji/charts-14.0/full-emoji-list.html)[https://unicode.org/emoji/charts-14.0/full-emoji-list.html]

In [None]:
# CLDR
print("\N{grinning face with smiling eyes}")

In [None]:
import numpy as np
def happy_print(n):
    """ Prints happy n times """
    for i in np.arange(n):
        # print(i+1)
        print(i+1,"\N{grinning face with smiling eyes}")
    return n

In [None]:
happy_print(60)

#### Now try more complex function

In [None]:
import numpy as np
# Compute the ratio as a percentage

def per_change(x,y):
    """ Takes ratio of x to y and
    converts to a % change by subtracting 1
     >>> per_change(20, 16)
    0.2500
    
    """
    return np.round(x/y-1,2)

In [None]:
per_change(3.89,3.69)

#### Now use apply to compute new Table column

In [None]:
from datascience import *
data = 'http://www2.census.gov/programs-surveys/popest/datasets/2010-2020/national/asrh/nc-est2020-agesex-res.csv'
full_census_table = Table.read_table(data)
partial_census_table = full_census_table.select('SEX', 'AGE', 'POPESTIMATE2010', 'POPESTIMATE2020')
partial_census_table = partial_census_table.relabeled('SEX', 'GENDER').relabeled('POPESTIMATE2010', '2010').relabeled('POPESTIMATE2020', '2020')
partial_census_table

In [None]:
census=partial_census_table.where(0,0).where('AGE',are.below(99))
census=census.with_columns(
    "% change",census.apply(per_change,'2020','2010')
)
census.set_format('% change',PercentFormatter)

Need to import these to plot

In [None]:
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
# Fix for datascience plots
import collections as collections
import collections.abc as abc
collections.Iterable = abc.Iterable

In [None]:
census.select('AGE','% change').plot('AGE')

Select proper columns

In [None]:
census.select('AGE','2010','2020').plot('AGE')

### scatter depicts relationship between two variables

In [None]:
census.scatter('2010','2020')

In [None]:
import plotly.express as px

series1 = census.column('2010')
age = census.column('AGE')
fig = px.line(x=age, y=[series1])

fig.show()

In [None]:
census.hist('% change')

# Visualize
Start again with Census data

In [None]:
# Load data
from datascience import *
data = 'http://www2.census.gov/programs-surveys/popest/datasets/2010-2020/national/asrh/nc-est2020-agesex-res.csv'
full_census_table = Table.read_table(data)
partial_census_table = full_census_table.select('SEX', 'AGE', 'POPESTIMATE2010', 'POPESTIMATE2020')
partial_census_table = partial_census_table.relabeled('SEX', 'GENDER').relabeled('POPESTIMATE2010', '2010').relabeled('POPESTIMATE2020', '2020')
partial_census_table

In [None]:
import numpy as np
# Compute the ratio as a percentage
def per_change(x,y):
    """ Takes ratio of x to y and
    converts to a % change by subtracting 1
     >>> per_change(20, 16)
    0.2500
    
    """
    return np.round(x/y-1,2)

In [None]:
census=partial_census_table.where(0,0).where('AGE',are.below(99))
census=census.with_columns(
    "% change",census.apply(per_change,'2020','2010')
)
census.set_format('% change',PercentFormatter)

In [None]:
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
# Fix for datascience plots
import collections as collections
import collections.abc as abc
collections.Iterable = abc.Iterable

In [None]:
census.select('AGE','% change').plot('AGE')

In [None]:
import plotly.express as px

series1 = census.column('2010')
age = census.column('AGE')
fig = px.line(x=age, y=[series1])

fig.show()

In [None]:
COVID_data = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/rolling-averages/us.csv'
COVID=Table.read_table(COVID_data)
COVID=COVID.set_format(0, DateFormatter(format='%Y-%m-%d',))
COVID

In [None]:
import time
time.time()

In [None]:
import time                # Python time functions
from time import strptime 
#time.time() # Seconds since common epoch
time1 = time.mktime(strptime('2020-10-01', '%Y-%m-%d'))
time2 = time.mktime(strptime('2022-10-01', '%Y-%m-%d')) # Seconds since epoch
Early2022 = COVID.where('date',are.between(time1,time2))
Early2022

In [None]:
type(Early2022)

Smoothing

In [None]:
dt = Early2022.column('deaths')
kernel_size = 10
kernel = np.ones(kernel_size) / kernel_size
data_convolved = np.convolve(dt, kernel, mode='same')

Early2022 = Early2022.with_columns('moving_avg', data_convolved)

In [None]:
# Input Data to plot
dates = Early2022.column('date')  
deaths = Early2022.column('moving_avg') 
# mdates does the trick!
## DATE PLOTTING CODE TEMPLATE TO COPY ##
import matplotlib.dates as mdates
date = Early2022.column('date').astype('datetime64[s]') # Need to convert to a datetime64[s] object
loc = mdates.AutoDateLocator() # Fancy function for dates
fmt = mdates.AutoDateFormatter(loc)
plt.gca().xaxis.set_major_formatter(fmt)
plt.gca().xaxis.set_major_locator(loc)
## END: DATE PLOTTING CODE TEMPLATE TO COPY ##
#
# Now plot
plt.plot(date,deaths)
plt.gcf().autofmt_xdate()

In [None]:
Early2022.hist('deaths', bins=np.arange(0,5000,500))

### Data Smoothing Example
Only smooth to reveal long term trend and disclose the process of smotthing

In [None]:
import numpy as np
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.8

def smooth(y, box_pts):
    box = np.ones(box_pts)/box_pts
    y_smooth = np.convolve(y, box, mode='same')
    return y_smooth

plt.plot(x, y,'o') # Blue dots
plt.plot(x, smooth(y,3), 'r-', lw=2) # Red line
plt.plot(x, smooth(y,19), 'g-', lw=2) # Green line

## Is COVID receding?
Table group work

In [None]:
from datascience import *
COVID_data = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/rolling-averages/us.csv'
COVID=Table.read_table(COVID_data)
COVID=COVID.set_format(0, DateFormatter(format='%Y-%m-%d',))

In [None]:
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import plotly.express as px
# Fix for datascience plots
import collections as collections
import collections.abc as abc
collections.Iterable = abc.Iterable

In [None]:
import time                # Python time functions
from time import strptime 
#time.time() # Seconds since common epoch
time1 = time.mktime(strptime('2020-10-01', '%Y-%m-%d'))
time2 = time.mktime(strptime('2022-10-01', '%Y-%m-%d')) # Seconds since epoch

### Best visualization that shows COVID status relative to height of pandemic
Share from tables to class
- Matplotlib [reference](https://matplotlib.org/stable/gallery/showcase/anatomy.html#sphx-glr-gallery-showcase-anatomy-py)
- Plotly [reference](https://plotly.com/python/)