# Introduction to python 

## Table of contents
1. [What is Python?](#what_is_python)
2. [Using python for arithmetic](#arithmetic)
3. [Objects](#objects)
4. [Data types](#classes)
5. [Data structures](#structures)


## What is Python? <a name="what_is_python"></a>
Python is a general purpose programming language that is also excellent for data analysis. It allows you to analyse and plot virtually any kind of data with complete flexibility. Python has many similarities to the R statistical programming language, but has important syntactical differences. Most people will write python code in an *integrated development environment* (IDE), which is an application that just gives you some additional tools to make programming easier. There are many IDEs you could use for python, but I would suggest either [Positron](https://github.com/posit-dev/positron/releases) or [VS Code](https://code.visualstudio.com/download) and following each's guidance on how to set them up with python.

I'm using VS Code and I went to File > Open Folder... and selected the folder where I cloned this repository into, so that my working directory is set to the cloned directory.


## Using python for arithmetic <a name="arithmetic"></a>
Let's start with some basic arithmetic.

In [112]:
2 + 2

4

In [113]:
7 - 8

-1

In [114]:
4 * 6

24

In [115]:
144 / 12

12.0

Notice that raising one number to the power of another number is different than in R.

In [116]:
8 ** 9

134217728

In [117]:
2 + 2 * (2 ** 6) # this is a comment

130

Note that in python, anything after # on a line is a comment, and is not interpreted by python. This allows you to write human readable comments explaining what you did and why.

## Objects <a name="objects"></a>
You can store the result of some kind of computation in R as an object, so that you can call the result later in the script.

### Creating objects
In python we use `=` to assign the value on the right-hand side, to an object name on the left. To reuse the value stored in the object, we simply call the object. Note that python is case sensitive and that `a` is not the same as `A`!

In [118]:
a = 10 * 6
a

60

In [119]:
A = 2 + 6
A

8

The value of an object can be overwritten by simply assigning a different value to the same object name (no warning is given).

In [120]:
a = 0
a

0

### Naming objects
How you name your objects is up to you, but there are some rules and recommendations. Firstly, you cannot start an object name with a number or punctuation character, and you should avoid punctuation characters anywhere in the name, with the exception of underscores (`_`). Object names should be a reminder of what value the object stores, so `elisa_data` is a better name than `x` for the same values. When you want your object to have two words in its name, the most common approaches are to use underscores to separate the words (known as snake case), or to capitalise the first letter of each subsequent word (known as camel case). Which you use is up to you, just remember to be consistent, and that python names are case sensitive. Try the naming options below. Can you predict which will result in an error?

In [121]:
# 1object <- 3
# !object <- 3
# -object <- 3
# object1 <- 3
# object! <- 3
# my.object <- 3
# my_object <- 3
# myObject <- 3

If you want to remove an object, you can use `del(object)`, e.g. `del(a)`.

## Data types <a name="classes"></a>
Python can handle a variety of types of data. There are a few types to be aware of, because python will treat these types differently in different circumstances. These are:

- integers (whole numbers)
- floats (decimal numbers)
- string (can contain letters and special characters)
- Boolean (can only be `True` or `False`, note the difference in case to R)

Python is usually able to tell what type a particular value should be. Run the code below to see what class each of the values belongs to.

In [122]:
type(12)

int

In [123]:
type(12.6)

float

In [124]:
type('cytometry')

str

In [125]:
type(True)

bool

So far we’ve only looked at data one value data time, but we usually need a way of storing and manipulating large amounts of data at the same time. We can do this using python’s data structures.

## Data structures<a name="structures"></a>
When we have multiple values or a table of values to work with, we can store them in one of a few different data structures in python. The first of which is the list.

### Lists
Python lists are used to store multiple items in a single object and are similar to lists in R in that the elements of the list don't all have to be the same type. Here are some examples below:

In [126]:
[1, 2, 3, 'hello', False]

[1, 2, 3, 'hello', False]

Lists can even contain other lists.

In [127]:
[[1, 2, 3], 1, 2, 3, 'hello', False]

[[1, 2, 3], 1, 2, 3, 'hello', False]

Sometimes you have a list of values, but you only want to use a subset of them, not the whole list. We can subset lists in python using square brackets (`[]`). We simply put the index or a list of indices inside the square brackets, for the elements we wish to extract. Python starts indexing from 0, so to get the first element of a list called `lst` we would use `lst[0]`.

In [128]:
days = ['mon', 'tue', 'wed', 'thu', 'fri']
days

['mon', 'tue', 'wed', 'thu', 'fri']

In [129]:
days[0]

'mon'

We can use `print()` to display some text as output. By including `f` before our string, we are able to inject the values of objects in the middle of the text by including them as curly brackets. For this reason, this is called an "f-string".

In [130]:
print(f'I play sport on {days[1]} and {days[3]}.')

I play sport on tue and thu.


We can extract a series of adjacent values using the `start:end` syntax.

In [131]:
days[1:5]

['tue', 'wed', 'thu', 'fri']

If we want to extract specific elements that are not all adjacent, we do:

In [132]:
[days[i] for i in [0, 2, 4]]

['mon', 'wed', 'fri']

which can be read as "give me the `i`th value of `days` for each `i` in the list `[0, 2, 4]`". 

To get the value that is second from the end, we can use the following (note the last value has index -1):

In [133]:
days[-2]

'thu'

If we can subset a list within a list by first subsetting for the list element we want, then further subsetting this with another set of square brackets.

In [134]:
lst = [[1, 2, 3], 4, 5, 6]
lst[0][1]

2

### Dictionaries
Dictionaries are used to store data in key:value pairs, allowing you to retrieve each element by its key. We create dictionaries using curly brackets (`{}`).

In [135]:
dct = {
    'a' : [1, 2, 3],
    'b' : True,
    'c' : 'laser'
}

dct

{'a': [1, 2, 3], 'b': True, 'c': 'laser'}

Dictionaries can be subsetted using square brackets and the key of the element you want.

In [136]:
dct['a']

[1, 2, 3]

For multiple entries we do the same as we did for lists earlier.

In [137]:
[dct[key] for key in ['a', 'c']]

[[1, 2, 3], 'laser']

### NumPy arrays
NumPy is a python *package* that provides us with the *NumPy array* data structure that is akin to a matrix (or multi-dimensional array) for holding data all of the same type. The Python you get out of the box has a lot of functionality, but people write extensions to Python that "package" up convenient functions and data structures for specific tasks. These extensions are called *packages* and can be installed freely on your computer. 

There are many ways to install Python packages (something that I've found confusing coming as an R user), but a very basic way of doing this is to run `pip install <some package>` in your terminal/powershell (not in Python itself). If you run `pip install numpy` and have an internet connection, your computer should automatically download and install the NumPy package for you.

Before we can access all of the functions in the NumPy package, we first need to load it into our Python session. We do this using the `import` keyword, and we can give an alias to the package that will be convenient later (many packages have a standard abbreviation like this, `np` is used for NumPy but there's no rule to say it must be this).

In [138]:
import numpy as np

Now we have the NumPy package loaded, we have access to the array data structure. In Python, whe we want to use a function or a method that belongs to a particular package, we use the `package.function()` syntax. So to make a NumPy array, we use `np.array()` (see now why that alias is a good idea?).

In [139]:
np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [140]:
np.arange(start = 1, stop = 10, step = 1)

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [141]:
np.array(['mon', 'tue', 'wed', 'thu', 'fri'])

array(['mon', 'tue', 'wed', 'thu', 'fri'], dtype='<U3')

To create a 2-dimensional array (i.e. a matrix) we define each new row as another list of values:

In [142]:
arr = np.array(
    [[5, 6, 7],
     [2, 3, 4],
     [-1, 0, 1]]
)

arr

array([[ 5,  6,  7],
       [ 2,  3,  4],
       [-1,  0,  1]])

We can perform arithmetic on arrays:

In [143]:
arr * 2

array([[10, 12, 14],
       [ 4,  6,  8],
       [-2,  0,  2]])

In [144]:
arr + arr / 2

array([[ 7.5,  9. , 10.5],
       [ 3. ,  4.5,  6. ],
       [-1.5,  0. ,  1.5]])

And we can subset arrays using square brackets where the last element is the column, the second last element is the row, and so on for n-dimensional arrays (we won't go further than 1 and 2-dimensions here). Hence, the block below extracts the value in the first row and third column.

In [145]:
arr[0, 2]

7

If I want everything in the first row, we can use `:` in place of specifying the columns and we'll get all of them.

In [146]:
arr[0, :]

array([5, 6, 7])

And similarly for all the rows and specific columns:

In [147]:
arr[:, [0, 2]]

array([[ 5,  7],
       [ 2,  4],
       [-1,  1]])

### Pandas DataFrames
Pandas is a python package that provides us with the *Pandas DataFrame* data structure that is akin to a spreadsheet for holding tabular data where values within a column are all of the same type. Once again, Pandas can be installed on your machine by running `pip install pandas` in a terminal/powershell, then everytime you want to use it we must import it.

In [148]:
import pandas as pd

To create a DataFrame we use `pd.DataFrame()` giving a dictionary of key:value pairs (where the key is the column name and the values are the data).

In [149]:
resp1 = np.random.normal(25, 5, 100)
resp2 = np.random.normal(23, 5, 100)

my_data = pd.DataFrame(
    {
        'id'       : range(200), # values from 0 to 199
        'group'    : (['Vehicle'] * 100) + (['Drug'] * 100),
        'response' : np.concatenate([resp1, resp2])  
    }
)

my_data

Unnamed: 0,id,group,response
0,0,Vehicle,24.034103
1,1,Vehicle,18.688120
2,2,Vehicle,31.061088
3,3,Vehicle,17.094718
4,4,Vehicle,22.472117
...,...,...,...
195,195,Drug,39.114174
196,196,Drug,25.152804
197,197,Drug,13.310465
198,198,Drug,18.652360


### Summarising DataFrames
You can see that a DataFrame is shown in an intuitive, tabular format. We can quickly interrogate and summarise a DataFrame using several useful methods.

In [150]:
my_data.head()

Unnamed: 0,id,group,response
0,0,Vehicle,24.034103
1,1,Vehicle,18.68812
2,2,Vehicle,31.061088
3,3,Vehicle,17.094718
4,4,Vehicle,22.472117


In [151]:
my_data.head(12)

Unnamed: 0,id,group,response
0,0,Vehicle,24.034103
1,1,Vehicle,18.68812
2,2,Vehicle,31.061088
3,3,Vehicle,17.094718
4,4,Vehicle,22.472117
5,5,Vehicle,32.074031
6,6,Vehicle,25.647312
7,7,Vehicle,11.715832
8,8,Vehicle,26.574576
9,9,Vehicle,22.725399


In [152]:
my_data.tail()

Unnamed: 0,id,group,response
195,195,Drug,39.114174
196,196,Drug,25.152804
197,197,Drug,13.310465
198,198,Drug,18.65236
199,199,Drug,15.827187


Note that when we use the `object.something()` syntax, we are applying a *method* to our object. A method is a function that is specific to the class of object we are applying it to. Below we extract the dimensions of our DataFrame using `my_data.shape`. Notice that there are no parentheses. This is because `shape` is an *attribute* of `my_data`.

In [153]:
my_data.shape

(200, 3)

We can use the `info()` method to get information about the type of each column, and the `describe()` method to get summary statistics for numeric columns. The `columns` attribute returns a list of the column names.

In [154]:
my_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        200 non-null    int64  
 1   group     200 non-null    object 
 2   response  200 non-null    float64
dtypes: float64(1), int64(1), object(1)
memory usage: 4.8+ KB


In [155]:
my_data.describe()

Unnamed: 0,id,response
count,200.0,200.0
mean,99.5,23.259946
std,57.879185,5.67659
min,0.0,7.056308
25%,49.75,19.71016
50%,99.5,23.192287
75%,149.25,27.036916
max,199.0,40.434016


In [156]:
my_data.columns

Index(['id', 'group', 'response'], dtype='object')

### Subsetting DataFrames by position
If we wish to extract particular rows and/or columns from a DataFrame, we can do so using the `iloc[]` method (noting that it uses square brackets instead of round ones).If we have a DataFrame called `df`, then `df.iloc[x, y]` will subset row `x` and column `y` of the DataFrame. Just like for NumPy arrays, we can either use single values or lists of values to subset a range of rows and/or columns. If we want all the rows or all the columns, we simply use `:`.

Note that as the columns of a DataFrame must have names, you can subset by index or by name. To subset by name we just use square brackets containing just the name of the column (or a list of columns) we want. In addition, notice that `df.column` is a shorthand for `df[column]`.

In [157]:
my_data.iloc[0:3, :]

Unnamed: 0,id,group,response
0,0,Vehicle,24.034103
1,1,Vehicle,18.68812
2,2,Vehicle,31.061088


In [158]:
my_data.iloc[[0, 2, 5, 10], [0, 2]]

Unnamed: 0,id,response
0,0,24.034103
2,2,31.061088
5,5,32.074031
10,10,31.656789


In [159]:
my_data['id']

0        0
1        1
2        2
3        3
4        4
      ... 
195    195
196    196
197    197
198    198
199    199
Name: id, Length: 200, dtype: int64

In [160]:
my_data.id

0        0
1        1
2        2
3        3
4        4
      ... 
195    195
196    196
197    197
198    198
199    199
Name: id, Length: 200, dtype: int64

In [161]:
cols = ['group', 'response']
my_data[cols]

Unnamed: 0,group,response
0,Vehicle,24.034103
1,Vehicle,18.688120
2,Vehicle,31.061088
3,Vehicle,17.094718
4,Vehicle,22.472117
...,...,...
195,Drug,39.114174
196,Drug,25.152804
197,Drug,13.310465
198,Drug,18.652360


### Subsetting DataFrames by value
Sometimes, rather than subsetting DataFrames for specific row indices, we might want to filter just the rows that meet certain criteria. We can do this by creating a Boolean *series* (a single column of `True` and `False` values), then use this series to subset the rows that match this criterion, with the `loc[]` method.

In [162]:
responders = my_data.response > 26
responders

0      False
1      False
2       True
3      False
4      False
       ...  
195     True
196    False
197    False
198    False
199    False
Name: response, Length: 200, dtype: bool

In [163]:
my_data.loc[responders,:]

Unnamed: 0,id,group,response
2,2,Vehicle,31.061088
5,5,Vehicle,32.074031
8,8,Vehicle,26.574576
10,10,Vehicle,31.656789
11,11,Vehicle,26.306230
...,...,...,...
175,175,Drug,30.928453
181,181,Drug,26.486042
191,191,Drug,27.987504
193,193,Drug,27.133155


You can think of this process as lining up the rows of the data with the Boolean series, and keeping only the rows that have a value of `True` (indicating that row passes the criterion).

Storing a Boolean series as an object and then using this to subset is a clean way of doing this, but we can also do this “on the fly” by stating our criteria directly inside the square brackets. Note that in the example below, we use some of Python’s comparator operators:

- `==` means equal to
- `!=` means not equal to
- `&` means and
- `|` means or
- `<`, `>`, `<=`, `>=` mean smaller than, larger than, smaller than or equal to, and larger than or equal to

In [164]:
my_data.loc[(my_data.group == 'Vehicle') & (my_data.response <= 23), :]

Unnamed: 0,id,group,response
1,1,Vehicle,18.68812
3,3,Vehicle,17.094718
4,4,Vehicle,22.472117
7,7,Vehicle,11.715832
9,9,Vehicle,22.725399
13,13,Vehicle,22.551569
17,17,Vehicle,14.081019
18,18,Vehicle,22.132242
23,23,Vehicle,19.776522
28,28,Vehicle,21.853385


In [165]:
my_data.loc[(my_data.group == 'Vehicle') | (my_data.response >= 21), :]

Unnamed: 0,id,group,response
0,0,Vehicle,24.034103
1,1,Vehicle,18.688120
2,2,Vehicle,31.061088
3,3,Vehicle,17.094718
4,4,Vehicle,22.472117
...,...,...,...
190,190,Drug,25.263729
191,191,Drug,27.987504
193,193,Drug,27.133155
195,195,Drug,39.114174


In [166]:
my_data.loc[(my_data.group != 'Vehicle') | (my_data.response > 23), :]

Unnamed: 0,id,group,response
0,0,Vehicle,24.034103
2,2,Vehicle,31.061088
5,5,Vehicle,32.074031
6,6,Vehicle,25.647312
8,8,Vehicle,26.574576
...,...,...,...
195,195,Drug,39.114174
196,196,Drug,25.152804
197,197,Drug,13.310465
198,198,Drug,18.652360


It’s also easy to add new columns to a DataFrame. We simply use `df[<new-column>]` to assign values to a column of our DataFrame that does not yet exist.

In [167]:
age = np.random.normal(40, 20, 200).astype('int')

for person in age: # for every value in the age array
    if person < 0: # if the value is less than 0
        person = 0 # set the value to 0

my_data['age'] = age
my_data.head()

Unnamed: 0,id,group,response,age
0,0,Vehicle,24.034103,46
1,1,Vehicle,18.68812,39
2,2,Vehicle,31.061088,59
3,3,Vehicle,17.094718,25
4,4,Vehicle,22.472117,7


## Functions and methods
The main way you apply complex operations on data in Python is by using functions and methods (a method is just a function that belongs to a particular object class). A function and methods takes some input, and outputs or *returns* the result of some computation. You know something is a function or method because you will use it by calling its name, followed by a pair of round brackets (e.g. `mean()`). Functions are called by giving the name of the package it comes from, follwed by `.`, followed by the function name. Methods are called on an object by giving the object, followed by `.`, followed by the method name (e.g. `my_array.mean()`). Any additional inputs, known as arguments, to the function or method go inside the brackets.

### Using functions/methods
Let's explore some important examples below.

Note that some functions/methods return single values only, while others return a value for each component of the input

In [168]:
my_values = np.array(range(100))
my_values

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

In [169]:
my_values.mean()

49.5

In [170]:
np.median(my_values) # it's weird median isn't a method!

49.5

In [171]:
my_values.min()

0

In [172]:
my_values.max()

99

In [173]:
my_values.sum()

4950

In [174]:
my_values.std()

28.86607004772212

In [175]:
my_values.__class__ # special "built-in" methods and attrbts use __<something>__

numpy.ndarray

In [176]:
my_values.size # actually an attribute not a method, but useful nonetheless

100

In [177]:
np.log(my_values)

  np.log(my_values)


array([      -inf, 0.        , 0.69314718, 1.09861229, 1.38629436,
       1.60943791, 1.79175947, 1.94591015, 2.07944154, 2.19722458,
       2.30258509, 2.39789527, 2.48490665, 2.56494936, 2.63905733,
       2.7080502 , 2.77258872, 2.83321334, 2.89037176, 2.94443898,
       2.99573227, 3.04452244, 3.09104245, 3.13549422, 3.17805383,
       3.21887582, 3.25809654, 3.29583687, 3.33220451, 3.36729583,
       3.40119738, 3.4339872 , 3.4657359 , 3.49650756, 3.52636052,
       3.55534806, 3.58351894, 3.61091791, 3.63758616, 3.66356165,
       3.68887945, 3.71357207, 3.73766962, 3.76120012, 3.78418963,
       3.80666249, 3.8286414 , 3.8501476 , 3.87120101, 3.8918203 ,
       3.91202301, 3.93182563, 3.95124372, 3.97029191, 3.98898405,
       4.00733319, 4.02535169, 4.04305127, 4.06044301, 4.07753744,
       4.09434456, 4.11087386, 4.12713439, 4.14313473, 4.15888308,
       4.17438727, 4.18965474, 4.20469262, 4.21950771, 4.2341065 ,
       4.24849524, 4.26267988, 4.27666612, 4.29045944, 4.30406

In [178]:
np.log10(my_values)

  np.log10(my_values)


array([      -inf, 0.        , 0.30103   , 0.47712125, 0.60205999,
       0.69897   , 0.77815125, 0.84509804, 0.90308999, 0.95424251,
       1.        , 1.04139269, 1.07918125, 1.11394335, 1.14612804,
       1.17609126, 1.20411998, 1.23044892, 1.25527251, 1.2787536 ,
       1.30103   , 1.32221929, 1.34242268, 1.36172784, 1.38021124,
       1.39794001, 1.41497335, 1.43136376, 1.44715803, 1.462398  ,
       1.47712125, 1.49136169, 1.50514998, 1.51851394, 1.53147892,
       1.54406804, 1.5563025 , 1.56820172, 1.5797836 , 1.59106461,
       1.60205999, 1.61278386, 1.62324929, 1.63346846, 1.64345268,
       1.65321251, 1.66275783, 1.67209786, 1.68124124, 1.69019608,
       1.69897   , 1.70757018, 1.71600334, 1.72427587, 1.73239376,
       1.74036269, 1.74818803, 1.75587486, 1.76342799, 1.77085201,
       1.77815125, 1.78532984, 1.79239169, 1.79934055, 1.80617997,
       1.81291336, 1.81954394, 1.8260748 , 1.83250891, 1.83884909,
       1.84509804, 1.85125835, 1.8573325 , 1.86332286, 1.86923

The output of a function/method can be stored in a new object and called.

In [179]:
my_sqrt = np.sqrt(my_values)
my_sqrt

array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ,
       3.16227766, 3.31662479, 3.46410162, 3.60555128, 3.74165739,
       3.87298335, 4.        , 4.12310563, 4.24264069, 4.35889894,
       4.47213595, 4.58257569, 4.69041576, 4.79583152, 4.89897949,
       5.        , 5.09901951, 5.19615242, 5.29150262, 5.38516481,
       5.47722558, 5.56776436, 5.65685425, 5.74456265, 5.83095189,
       5.91607978, 6.        , 6.08276253, 6.164414  , 6.244998  ,
       6.32455532, 6.40312424, 6.4807407 , 6.55743852, 6.63324958,
       6.70820393, 6.78232998, 6.8556546 , 6.92820323, 7.        ,
       7.07106781, 7.14142843, 7.21110255, 7.28010989, 7.34846923,
       7.41619849, 7.48331477, 7.54983444, 7.61577311, 7.68114575,
       7.74596669, 7.81024968, 7.87400787, 7.93725393, 8.        ,
       8.06225775, 8.1240384 , 8.18535277, 8.24621125, 8.30662386,
       8.36660027, 8.42614977, 8.48528137, 8.54400375, 8.60232

### Getting help on functions and methods
If you’re using a function or method for the first time and want to know how to use it, call the name of the function/method preceded by `?`. In the example below, I didn’t know how to use the `np.random.normal()` function. By executing `?np.random.normal` (without the brackets), the instructions are shown in the form of a *docstring*. This shows me that this function draws random numbers from a normal distribution. It also shows me the arguments to the function. Arguments without a default value are mandatory and we must supply them. The arguments we give to a function are matched by position, or by name. It’s common to leave the first argument unnamed, and name subsequent arguments, but you can name all of them if you wish.

In [180]:
?np.random.normal

[1;31mDocstring:[0m
normal(loc=0.0, scale=1.0, size=None)

Draw random samples from a normal (Gaussian) distribution.

The probability density function of the normal distribution, first
derived by De Moivre and 200 years later by both Gauss and Laplace
independently [2]_, is often called the bell curve because of
its characteristic shape (see the example below).

The normal distributions occurs often in nature.  For example, it
describes the commonly occurring distribution of samples influenced
by a large number of tiny, random disturbances, each with its own
unique distribution [2]_.

.. note::
    New code should use the `~numpy.random.Generator.normal`
    method of a `~numpy.random.Generator` instance instead;
    please see the :ref:`random-quick-start`.

Parameters
----------
loc : float or array_like of floats
    Mean ("centre") of the distribution.
scale : float or array_like of floats
    Standard deviation (spread or "width") of the distribution. Must be
    non-negative.


### Creating your own functions
Why always rely on functions other people have written when you can easily write your own? To create your own function, simply use the `def` keyword, followed by the name of our new function (with brackets containing any arguments we wish to use), and a `:`. Then, the *body* of the function starts on a new line and is indented by 4 spaces. Once defined, the function can then be called just like any other.

Let's make our own `mean()` function. We want our function to take a single argument `x` (a 1-dimensional NumPy array), calculate the mean of its values, and then return that value using the `return` keyword (this is the value that will be output by the function).

In [181]:
def my_mean(x):
    mean = x.sum() / x.size
    return mean

my_mean(my_values)

49.5

We can include optional arguments by simply giving them a default value. This next example uses a Boolean argument `verbose` to control how the function prints its output, using an `if` keyword. Note that objects defined inside a function only exist within the function and aren't created as objects in the environment.

In [182]:
def my_mean(x, verbose = False):
    sum = x.sum()
    size = x.size
    mean = sum / size
    if verbose:
        print(f'The mean is {sum}/{size} = {mean}.')
    return mean

Then if we call the function without specifying the argument `verbose`, it defaults to `False` and we get the same output. But if we specify `verbose = True` we get a message printed, in addition to a value being returned (only the output to an object.)

In [183]:
my_mean(my_values)

49.5

In [184]:
avg = my_mean(my_values, verbose = True)

The mean is 4950/100 = 49.5.


In [185]:
avg

49.5

### Applying a function to each element of a list or dictionary
Sometimes we may have data stored in a list or a dictionary, and we want to apply a function to each element separately, and return a new list. In the example below, we create a list of NumPy arrays (`np.arange()` just creates an array of a sequence of integers) called `arr_list`. We want to create a new list whose elements are the means of the arrays in `arr_list`. 

We start by creating a list of the same length as our result will be, filling it with just zeroes for now. We then use the keyword `for` to start a loop, that takes the form `for <something> in <somethings>:`, and place the body of the loop on subsequent lines, indented by 4 spaces. Here, we use the `enumerate()` function that returns both the index (position) and value of each element it iterates over, giving us access to both in the loop. This is why we use `for ind, value in enumerate(arr_list):`, `ind` will be a placeholder for the index, and `value` will be a placeholder for the value. We then update the placeholder list we created, indexing its position using `ind`, with the mean of the current value.

In [186]:
arr_list = [np.arange(1, 6), np.arange(1, 11), np.arange(5, 11)]
arr_list

[array([1, 2, 3, 4, 5]),
 array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
 array([ 5,  6,  7,  8,  9, 10])]

In [187]:
means_list = [0]*3

for ind, value in enumerate(arr_list):
    means_list[ind] = value.mean()

means_list

[3.0, 5.5, 7.5]

## Reading data into python
While you can create DataFrames by hand using a dictionary, we will often get our data from external sources (instruments, collaborators) as spreadsheet, .txt, or .csv format. Reading this data into python is straight forward using `read_csv()` from the Pandas package, and is probably the most common way you will create DataFrames for real projects.

We’re going to read the file pokemon.csv that should be in your data directory of your working directory.

In [188]:
pokemon = pd.read_csv('data/Pokemon.csv')
pokemon

Unnamed: 0,Nat,Pokemon,HP,Atk,Def,SA,SD,Spd,Total,Type.I,Type.II,Gender,Evolves.From,Evolves.Into,Captive
0,1,Bulbasaur,45,49,49,65,65,45,318,Grass,Poison,M (87.5%),--,Ivysaur,Captive
1,2,Ivysaur,60,62,63,80,80,60,405,Grass,Poison,M (87.5%),Bulbasaur,Venusaur,Wild
2,3,Venusaur,80,82,83,100,100,80,525,Grass,Poison,M (87.5%),Ivysaur,--,Captive
3,4,Charmander,39,52,43,60,50,65,309,Fire,,M (87.5%),--,Charmeleon,Captive
4,5,Charmeleon,58,64,58,80,65,80,405,Fire,,M (87.5%),Charmander,Charizard,Captive
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
246,247,Pupitar,70,84,70,65,70,51,410,Rock,Ground,50/50,Larvitar,Tyranitar,Captive
247,248,Tyranitar,100,134,110,95,100,61,600,Rock,Dark,50/50,Pupitar,--,Captive
248,249,Lugia,106,90,130,90,154,110,680,Psychic,Flying,,--,--,Wild
249,250,Ho-oh,106,130,90,110,154,90,680,Fire,Flying,,--,--,Wild


In [189]:
print(
    (f'The dataset has {pokemon.shape[0]} pokemon and {pokemon.shape[1]}' 
     ' variables. Here is a table of summary stats for the numeric variables:')
     )

pokemon.describe()


The dataset has 251 pokemon and 15 variables. Here is a table of summary stats for the numeric variables:


Unnamed: 0,Nat,HP,Atk,Def,SA,SD,Spd,Total
count,251.0,251.0,251.0,251.0,251.0,251.0,251.0,251.0
mean,126.0,66.729084,70.621514,68.609562,65.828685,68.258964,65.89243,405.940239
std,72.601653,29.820233,27.169324,30.401167,27.199532,27.441732,27.03909,104.37954
min,1.0,10.0,5.0,5.0,10.0,20.0,5.0,180.0
25%,63.5,49.0,50.0,49.5,44.5,50.0,45.0,320.0
50%,126.0,65.0,70.0,65.0,65.0,65.0,65.0,410.0
75%,188.5,80.0,87.5,85.0,85.0,85.0,85.0,490.0
max,251.0,255.0,134.0,230.0,154.0,230.0,140.0,680.0


We can summarise our DataFrame by groups if we wish, by combining the `groupby()` and `agg()` (for "aggregate") methods. We give `groupby()` a string or a list of strings indicating the column(s) we wish to groupby. Then we provide `agg()` a dictionary of summary statistics we want, where the key is the name of the column we want to summarise, and the value is the function (or list of functions) we wish to calculate for that column. The example below returns the median of the `Atk` variable for each value of the `Type.I` variable.

In [190]:
pokemon.groupby('Type.I').agg({'Atk' : 'median'})

Unnamed: 0_level_0,Atk
Type.I,Unnamed: 1_level_1
Bug,65.0
Dark,85.0
Dragon,84.0
Electric,60.0
Fighting,100.0
Fire,80.0
Ghost,55.0
Grass,62.0
Ground,80.0
Ice,52.5


We do the same below but add an additional grouping variable.

In [191]:
pokemon.groupby(['Type.I', 'Type.II']).agg({'Atk' : 'median'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Atk
Type.I,Type.II,Unnamed: 2_level_1
Bug,Fighting,125.0
Bug,Flying,45.0
Bug,Grass,82.5
Bug,Poison,60.0
Bug,Rock,10.0
Bug,Steel,110.0
Dark,Fire,75.0
Dark,Flying,85.0
Dark,Ice,95.0
Dragon,Flying,134.0


We can return as many summary statistics per column as we like by giving them as a list. We can define our own functions too. Below we define a function to return the interquatile range of an array and use this with the `agg()` method. Notice that functions built in to NumPy are given as strings, whereas custom functions are not.

In [192]:
def iqr(x):
    first = np.percentile(x, 25)
    third = np.percentile(x, 75)
    return third - first

pokemon.groupby('Type.I').agg({'Atk' : ['median', iqr], 'Def' : ['median', iqr]})

Unnamed: 0_level_0,Atk,Atk,Def,Def
Unnamed: 0_level_1,median,iqr,median,iqr
Type.I,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Bug,65.0,55.0,55.0,33.75
Dark,85.0,25.0,50.0,13.0
Dragon,84.0,35.0,65.0,25.0
Electric,60.0,34.0,57.0,27.5
Fighting,100.0,25.0,60.0,29.0
Fire,80.0,39.0,59.0,34.0
Ghost,55.0,15.0,52.5,18.75
Grass,62.0,31.0,65.0,33.0
Ground,80.0,25.0,95.0,37.5
Ice,52.5,27.5,42.5,35.0


We can add new columns by performing operations on existing columns.

In [193]:
pokemon['Atk_per_Def'] = pokemon['Atk'] / pokemon['Def']
pokemon

Unnamed: 0,Nat,Pokemon,HP,Atk,Def,SA,SD,Spd,Total,Type.I,Type.II,Gender,Evolves.From,Evolves.Into,Captive,Atk_per_Def
0,1,Bulbasaur,45,49,49,65,65,45,318,Grass,Poison,M (87.5%),--,Ivysaur,Captive,1.000000
1,2,Ivysaur,60,62,63,80,80,60,405,Grass,Poison,M (87.5%),Bulbasaur,Venusaur,Wild,0.984127
2,3,Venusaur,80,82,83,100,100,80,525,Grass,Poison,M (87.5%),Ivysaur,--,Captive,0.987952
3,4,Charmander,39,52,43,60,50,65,309,Fire,,M (87.5%),--,Charmeleon,Captive,1.209302
4,5,Charmeleon,58,64,58,80,65,80,405,Fire,,M (87.5%),Charmander,Charizard,Captive,1.103448
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
246,247,Pupitar,70,84,70,65,70,51,410,Rock,Ground,50/50,Larvitar,Tyranitar,Captive,1.200000
247,248,Tyranitar,100,134,110,95,100,61,600,Rock,Dark,50/50,Pupitar,--,Captive,1.218182
248,249,Lugia,106,90,130,90,154,110,680,Psychic,Flying,,--,--,Wild,0.692308
249,250,Ho-oh,106,130,90,110,154,90,680,Fire,Flying,,--,--,Wild,1.444444


We can sort the DataFrame by one or more of its variables, where subsequent variables are used to break ties.

In [194]:
pokemon.sort_values('Total')

Unnamed: 0,Nat,Pokemon,HP,Atk,Def,SA,SD,Spd,Total,Type.I,Type.II,Gender,Evolves.From,Evolves.Into,Captive,Atk_per_Def
190,191,Sunkern,30,30,30,30,30,30,180,Grass,,50/50,--,Sunflora,Captive,1.000000
12,13,Weedle,40,35,30,20,20,50,195,Bug,Poison,50/50,--,Kakuna,Captive,1.166667
9,10,Caterpie,45,30,35,20,20,45,195,Bug,,50/50,--,Metapod,Wild,0.857143
128,129,Magikarp,20,10,55,15,20,80,200,Water,,50/50,--,Gyarados,Captive,0.181818
171,172,Pichu,20,40,15,35,35,60,205,Electric,,50/50,--,Pikachu,Wild,2.666667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148,149,Dragonite,91,134,95,100,100,80,600,Dragon,Flying,50/50,Dragonair,--,Captive,1.410526
250,251,Celebi,100,100,100,100,100,100,600,Psychic,Grass,,--,--,Captive,1.000000
249,250,Ho-oh,106,130,90,110,154,90,680,Fire,Flying,,--,--,Wild,1.444444
149,150,Mewtwo,106,110,90,154,90,130,680,Psychic,,,--,--,Captive,1.222222


In [195]:
pokemon.sort_values('Total', ascending = False)

Unnamed: 0,Nat,Pokemon,HP,Atk,Def,SA,SD,Spd,Total,Type.I,Type.II,Gender,Evolves.From,Evolves.Into,Captive,Atk_per_Def
249,250,Ho-oh,106,130,90,110,154,90,680,Fire,Flying,,--,--,Wild,1.444444
248,249,Lugia,106,90,130,90,154,110,680,Psychic,Flying,,--,--,Wild,0.692308
149,150,Mewtwo,106,110,90,154,90,130,680,Psychic,,,--,--,Captive,1.222222
250,251,Celebi,100,100,100,100,100,100,600,Psychic,Grass,,--,--,Captive,1.000000
247,248,Tyranitar,100,134,110,95,100,61,600,Rock,Dark,50/50,Pupitar,--,Captive,1.218182
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
171,172,Pichu,20,40,15,35,35,60,205,Electric,,50/50,--,Pikachu,Wild,2.666667
128,129,Magikarp,20,10,55,15,20,80,200,Water,,50/50,--,Gyarados,Captive,0.181818
12,13,Weedle,40,35,30,20,20,50,195,Bug,Poison,50/50,--,Kakuna,Captive,1.166667
9,10,Caterpie,45,30,35,20,20,45,195,Bug,,50/50,--,Metapod,Wild,0.857143


In [196]:
pokemon.sort_values(['Total', 'HP'])

Unnamed: 0,Nat,Pokemon,HP,Atk,Def,SA,SD,Spd,Total,Type.I,Type.II,Gender,Evolves.From,Evolves.Into,Captive,Atk_per_Def
190,191,Sunkern,30,30,30,30,30,30,180,Grass,,50/50,--,Sunflora,Captive,1.000000
12,13,Weedle,40,35,30,20,20,50,195,Bug,Poison,50/50,--,Kakuna,Captive,1.166667
9,10,Caterpie,45,30,35,20,20,45,195,Bug,,50/50,--,Metapod,Wild,0.857143
128,129,Magikarp,20,10,55,15,20,80,200,Water,,50/50,--,Gyarados,Captive,0.181818
171,172,Pichu,20,40,15,35,35,60,205,Electric,,50/50,--,Pikachu,Wild,2.666667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
247,248,Tyranitar,100,134,110,95,100,61,600,Rock,Dark,50/50,Pupitar,--,Captive,1.218182
250,251,Celebi,100,100,100,100,100,100,600,Psychic,Grass,,--,--,Captive,1.000000
149,150,Mewtwo,106,110,90,154,90,130,680,Psychic,,,--,--,Captive,1.222222
248,249,Lugia,106,90,130,90,154,110,680,Psychic,Flying,,--,--,Wild,0.692308
