## Some Notebook instructions

Overview:
 - Importing Packages
 - Making a pandas DataFrame
 - Plotting the data from pandas with altair

### 1. Importing Packages

Packages are groups of groups of pre-defined functions for specific tasks, e.g. manipulation strings, plotting graphs. Sharing packages avoids people re-writing common pieces of code and maintains standard functions.

Python has many useful packages use when coding. Many are built-in when python is install; others can be installed later using the _pip_ module on the command-line:
 > python -m pip install SOME_PACKAGE

Inside the notebook, packages are usually imported at the start of some code, this makes it clear what will be used/required later on.
Importing has the form:
> import SOME_PACKAGE

The functions of the package can then be accessed:
> SOME_PACKAGE.SOME_FUNCTION()

You can also use an alias so it is easier to refer to the package later on:
> import SOME_PACKAGE as sp

The functions of the package can then be accessed:
> sp.SOME_FUNCTION()



In [None]:
### import packages
import pandas as pd
import altair as alt


### 2. Making a pandas DataFrame

Pandas is a package for data analysis with many useful tools for _wrangling_ and _manipulating_ data

__wrangling__

Before data analysis can begin the data must be read in to a common format. Often some manipulation is required, e.g. remove punctuation, change case.

__manipulating__

Once the data is formatted some initial adjustments can be made, e.g. format data type, scale values


In [None]:
### define some data as a list of dictionaries
input_data=[ {'key':"this", 'value':True},
      {'key':"that", 'value':3.14},
      {'key':"other", 'value':"some text"},
      {'key':"another", 'value':[1,2,3]}
      ]
input_data

In [None]:
### convert to a DataFrame
df_input=pd.DataFrame( input_data )
df_input

Example wrangling:
- make array of numbers (_another_) into seperate values
    - see offical documentation: [link](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html)
- remove non-numeric values
    - see a user example: [link](https://stackoverflow.com/questions/34855859/is-there-a-way-in-pandas-to-use-previous-row-value-in-dataframe-apply-when-previ)

In [None]:
### make array into separate values
df_input.explode('value')


In [None]:
### non-numeric values
df_input[pd.to_numeric(df_input['value'], errors='coerce').notnull()]


In [None]:
### use both together and assign to new dataframe
# array --> rows
df_new=df_input.explode('value')
# remove non-numeric
df_new=df_new[pd.to_numeric(df_new['value'], errors='coerce').notnull()]
# reset row index
df_new=df_new.reset_index(drop=True)
# can view just by using variable or by display(VARIABLE)
df_new


### 3. Plotting the data from pandas with altair

Altair is a useful data visualisation package which is designed to work with pandas DataFrames. It uses the column names from the input DataFrame to define the variables for plotting. The variables can be used for a number of visualisation _channels_, e.g.: position, colour, shape.



In [None]:
# define some data
plot_data=[{'x':1, 'y':2, 'z':"up"},
           {'x':2, 'y':4, 'z':"up"},
           {'x':3, 'y':4, 'z':"up"},
           {'x':4, 'y':6, 'z':"up"},
           {'x':5, 'y':7, 'z':"up"},
           {'x':6, 'y':9, 'z':"up"},
           {'x':7, 'y':12, 'z':"up"},
           {'x':8, 'y':10, 'z':"down"},
           {'x':9, 'y':9, 'z':"down"},
           {'x':10, 'y':10, 'z':"up"}
           ]
# convert to dataframe
df_plot=pd.DataFrame(plot_data)
df_plot

In [None]:
### plotting
chart=alt.Chart(df_plot).mark_point().encode(
    x='x:Q',
    y='y:Q',
    color='z:N'
).properties( title="Simple scatter" )
display(chart)

More complex example
 - plotting 4 dimensions (channels)
 - set scale
 - use tooltip (give information when mouse rolls over) 

In [None]:
### generate fake data
new_data=[]
# loop 20 times
for i in range(0,20,1):
    # add dictionary to data list
    new_data.append({'x':i})
# make data frame
df_new=pd.DataFrame(new_data)
df_new

In [None]:
# define y value related to each x value:
# x*3 - random value
import random # should be defined at the beginning
df_new['y']=df_new['x'].apply(lambda x: x*3 - random.random()*10)
df_new

In [None]:
# define z value:
# "up" when gradient positive, "down" otherwise
df_new['z']=None
df_new['gradY']= df_new['y'] - df_new['y'].shift(1)
df_new['z']=df_new['gradY'].apply(lambda x: "up" if x>0 else "down")
df_new

In [None]:
### plotting: using detailed channel encoding, e.g. alt.X()
chart=alt.Chart(df_new).mark_point().encode(
    # change title
    x=alt.X('x:Q', axis=alt.Axis(title="editted title")),
    # set domain [0,50] - other points will appear outside grid
    y=alt.Y('y:Q', scale=alt.Scale(domain=[0,50])),
    # turn off legend
    color=alt.Color('z:N', legend=None),
    # use tool tip to show information per point (use gradY, not z)
    tooltip=['x:Q','y:Q','gradY:Q']
).properties( title="Another scatter" )
display(chart)