In [None]:
# Introduction to python 

## Table of contents
1. [What is python?](#what_is_python)
2. [Using python for arithmetic](#arithmetic)
3. [Objects](#objects)
4. [Data types](#classes)
5. [Data structures](#structures)


## What is python? <a name="what_is_python"></a>
Python is a general purpose programming language that is also excellent for data analysis. It allows you to analyse and plot virtually any kind of data with complete flexibility. Python has many similarities to the R statistical programming language, but has important syntactical differences. Most people will write python code in an *integrated development environment* (IDE), which is an application that just gives you some additional tools to make programming easier. There are many IDEs you could use for python, but I would suggest either [Positron](https://github.com/posit-dev/positron/releases) or [VS Code](https://code.visualstudio.com/download) and following each's guidance on how to set them up with python.

I'm using VS Code and I went to File > Open Folder... and selected the folder where I cloned this repository into, so that my working directory is set to the cloned directory.


## Using python for arithmetic <a name="arithmetic"></a>
Let's start with some basic arithmetic.

In [None]:
2 + 2

In [None]:
7 - 8

In [None]:
4 * 6

In [None]:
144 / 12

Notice that raising one number to the power of another number is different than in R.

In [None]:
8 ** 9

In [None]:
2 + 2 * (2 ** 6) # this is a comment

Note that in python, anything after # on a line is a comment, and is not interpreted by python. This allows you to write human readable comments explaining what you did and why.

## Objects <a name="objects"></a>
You can store the result of some kind of computation in R as an object, so that you can call the result later in the script.

### Creating objects
In python we use `=` to assign the value on the right-hand side, to an object name on the left. To reuse the value stored in the object, we simply call the object. Note that python is case sensitive and that `a` is not the same as `A`!

In [None]:
a = 10 * 6
a

In [None]:
A = 2 + 6
A

The value of an object can be overwritten by simply assigning a different value to the same object name (no warning is given).

In [None]:
a = 0
a

### Naming objects
How you name your objects is up to you, but there are some rules and recommendations. Firstly, you cannot start an object name with a number or punctuation character, and you should avoid punctuation characters anywhere in the name, with the exception of underscores (`_`). Object names should be a reminder of what value the object stores, so `elisa_data` is a better name than `x` for the same values. When you want your object to have two words in its name, the most common approaches are to use underscores to separate the words (known as snake case), or to capitalise the first letter of each subsequent word (known as camel case). Which you use is up to you, just remember to be consistent, and that python names are case sensitive. Try the naming options below. Can you predict which will result in an error?

In [None]:
# 1object <- 3
# !object <- 3
# -object <- 3
# object1 <- 3
# object! <- 3
# my.object <- 3
# my_object <- 3
# myObject <- 3

If you want to remove an object, you can use `del(object)`, e.g. `del(a)`.

## Data types <a name="classes"></a>
Python can handle a variety of types of data. There are a few types to be aware of, because python will treat these types differently in different circumstances. These are:

- integers (whole numbers)
- floats (decimal numbers)
- string (can contain letters and special characters)
- Boolean (can only be `True` or `False`, note the difference in case to R)

Python is usually able to tell what type a particular value should be. Run the code below to see what class each of the values belongs to.

In [None]:
type(12)

In [None]:
type(12.6)

In [None]:
type('cytometry')

In [None]:
type(True)

So far we’ve only looked at data one value data time, but we usually need a way of storing and manipulating large amounts of data at the same time. We can do this using python’s data structures.

## Data structures<a name="structures"></a>
When we have multiple values or a table of values to work with, we can store them in one of a few different data structures in python. The first of which is the list.

### Lists
Python lists are used to store multiple items in a single object and are similar to lists in R in that the elements of the list don't all have to be the same type. Here are some examples below:

In [None]:
[1, 2, 3, 'hello', False]

Lists can even contain other lists.

In [None]:
[[1, 2, 3], 1, 2, 3, 'hello', False]

Sometimes you have a list of values, but you only want to use a subset of them, not the whole list. We can subset lists in python using square brackets (`[]`). We simply put the index or a list of indices inside the square brackets, for the elements we wish to extract. Python starts indexing from 0, so to get the first element of a list called `lst` we would use `lst[0]`.

In [None]:
days = ['mon', 'tue', 'wed', 'thu', 'fri']
days

In [None]:
days[0]

We can use `print()` to display some text as output. By including `f` before our string, we are able to inject the values of objects in the middle of the text by including them as curly brackets. For this reason, this is called an "f-string".

In [None]:
print(f'I play sport on {days[1]} and {days[3]}.')

We can extract a series of adjacent values using the `start:end` syntax.

In [None]:
days[1:5]

If we want to extract specific elements that are not all adjacent, we do:

In [None]:
[days[i] for i in [0, 2, 4]]

which can be read as "give me the `i`th value of `days` for each `i` in the list `[0, 2, 4]`". 

To get the value that is second from the end, we can use the following (note the last value has index -1):

In [None]:
days[-2]

If we can subset a list within a list by first subsetting for the list element we want, then further subsetting this with another set of square brackets.

In [None]:
lst = [[1, 2, 3], 4, 5, 6]
lst[0][1]

### Dictionaries
Dictionaries are used to store data in key:value pairs, allowing you to retrieve each element by its key. We create dictionaries using curly brackets (`{}`).

In [None]:
dct = {
    'a' : [1, 2, 3],
    'b' : True,
    'c' : 'laser'
}

dct

Dictionaries can be subsetted using square brackets and the key of the element you want.

In [None]:
dct['a']

For multiple entries we do the same as we did for lists earlier.

In [None]:
[dct[key] for key in ['a', 'c']]

### NumPy arrays
NumPy is a python *package* that provides us with the *NumPy array* data structure that is akin to a matrix (or multi-dimensional array) for holding data all of the same type. The Python you get out of the box has a lot of functionality, but people write extensions to Python that "package" up convenient functions and data structures for specific tasks. These extensions are called *packages* and can be installed freely on your computer. 

There are many ways to install Python packages (something that I've found confusing coming as an R user), but a very basic way of doing this is to run `pip install <some package>` in your terminal/powershell (not in Python itself). If you run `pip install numpy` and have an internet connection, your computer should automatically download and install the NumPy package for you.

Before we can access all of the functions in the NumPy package, we first need to load it into our Python session. We do this using the `import` keyword, and we can give an alias to the package that will be convenient later (many packages have a standard abbreviation like this, `np` is used for NumPy but there's no rule to say it must be this).

In [None]:
import numpy as np

Now we have the NumPy package loaded, we have access to the array data structure. In Python, whe we want to use a function or a method that belongs to a particular package, we use the `package.function()` syntax. So to make a NumPy array, we use `np.array()` (see now why that alias is a good idea?).

In [None]:
np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

In [None]:
np.arange(start = 1, stop = 10, step = 1)

In [None]:
np.array(['mon', 'tue', 'wed', 'thu', 'fri'])

To create a 2-dimensional array (i.e. a matrix) we define each new row as another list of values:

In [None]:
arr = np.array(
    [[5, 6, 7],
     [2, 3, 4],
     [-1, 0, 1]]
)

arr

We can perform arithmetic on arrays:

In [None]:
arr * 2

In [None]:
arr + arr / 2

And we can subset arrays using square brackets where the last element is the column, the second last element is the row, and so on for n-dimensional arrays (we won't go further than 1 and 2-dimensions here). Hence, the block below extracts the value in the first row and third column.

In [None]:
arr[0, 2]

If I want everything in the first row, we can use `:` in place of specifying the columns and we'll get all of them.

In [None]:
arr[0, :]

And similarly for all the rows and specific columns:

In [None]:
arr[:, [0, 2]]

### Pandas DataFrames
Pandas is a python package that provides us with the *Pandas DataFrame* data structure that is akin to a spreadsheet for holding tabular data where values within a column are all of the same type. Once again, Pandas can be installed on your machine by running `pip install pandas` in a terminal/powershell, then everytime you want to use it we must import it.

In [None]:
import pandas as pd

To create a DataFrame we use `pd.DataFrame()` giving a dictionary of key:value pairs (where the key is the column name and the values are the data).

In [None]:
resp1 = np.random.normal(25, 5, 100)
resp2 = np.random.normal(23, 5, 100)

my_data = pd.DataFrame(
    {
        'id'        : range(200), # values from 0 to 199
        'group'     : (['Vehicle'] * 100) + (['Drug'] * 100),
        'response' : np.concatenate([resp1, resp2])  
    }
)

my_data

### Summarising DataFrames
You can see that a DataFrame is shown in an intuitive, tabular format. We can quickly interrogate and summarise a DataFrame using several useful methods.

In [None]:
my_data.head()

In [None]:
my_data.head(12)

In [None]:
my_data.tail()

Note that when we use the `object.something()` syntax, we are applying a *method* to our object. A method is a function that is specific to the class of object we are applying it to. Below we extract the dimensions of our DataFrame using `my_data.shape`. Notice that there are no parentheses. This is because `shape` is an *attribute* of `my_data`.

In [None]:
my_data.shape

We can use the `info()` method to get information about the type of each column, and the `describe()` method to get summary statistics for numeric columns. The `columns` attribute returns a list of the column names.

In [None]:
my_data.info()

In [None]:
my_data.describe()

In [None]:
my_data.columns

### Subsetting DataFrames by position
If we wish to extract particular rows and/or columns from a DataFrame, we can do so using the `iloc[]` method (noting that it uses square brackets instead of round ones).If we have a DataFrame called `df`, then `df.iloc[x, y]` will subset row `x` and column `y` of the DataFrame. Just like for NumPy arrays, we can either use single values or lists of values to subset a range of rows and/or columns. If we want all the rows or all the columns, we simply use `:`.

Note that as the columns of a DataFrame must have names, you can subset by index or by name. To subset by name we just use square brackets containing just the name of the column we want. In addition, notice that `df.column` is a shorthand for `df[column]`.

In [None]:
my_data.iloc[0:3, :]

In [None]:
my_data.iloc[[0, 2, 5, 10], [0, 2]]

In [None]:
my_data['id']

In [None]:
my_data.id

### Subsetting DataFrames by value
Sometimes, rather than subsetting DataFrames for specific row indices, we might want to filter just the rows that meet certain criteria. We can do this by creating a Boolean *series* (a single column of `True` and `False` values), then use this series to subset the rows that match this criterion, with the `loc[]` method.

In [None]:
responders = my_data.response > 26
responders

In [None]:
my_data.loc[responders,:]

You can think of this process as lining up the rows of the data with the Boolean series, and keeping only the rows that have a value of `True` (indicating that row passes the criterion).

Storing a Boolean series as an object and then using this to subset is a clean way of doing this, but we can also do this “on the fly” by stating our criteria directly inside the square brackets. Note that in the example below, we use some of Python’s comparator operators:

- `==` means equal to
- `!=` means not equal to
- `&` means and
- `|` means or
- `<`, `>`, `<=`, `>=` mean smaller than, larger than, smaller than or equal to, and larger than or equal to

In [None]:
my_data.loc[(my_data.group == 'Vehicle') & (my_data.response <= 23), :]

In [None]:
my_data.loc[(my_data.group == 'Vehicle') | (my_data.response >= 21), :]

In [None]:
my_data.loc[(my_data.group != 'Vehicle') | (my_data.response > 23), :]

It’s also easy to add new columns to a DataFrame. We simply use `df[<new-column>]` to assign values to a column of our DataFrame that does not yet exist.

In [None]:
age = np.random.normal(40, 20, 200).astype('int')

for person in age: # for every value in the age array
    if person < 0: # if the value is less than 0
        person = 0 # set the value to 0

my_data['age'] = age
my_data.head()

## Functions and methods
The main way you apply complex operations on data in Python is by using functions and methods (a method is just a function that belongs to a particular object class). A function and methods takes some input, and outputs or *returns* the result of some computation. You know something is a function or method because you will use it by calling its name, followed by a pair of round brackets (e.g. `mean()`). Functions are called by giving the name of the package it comes from, follwed by `.`, followed by the function name. Methods are called on an object by giving the object, followed by `.`, followed by the method name (e.g. `my_array.mean()`). Any additional inputs, known as arguments, to the function or method go inside the brackets.

### Using functions/methods
Let's explore some important examples below.

Note that some functions/methods return single values only, while others return a value for each component of the input

In [156]:
my_values = np.array(range(100))
my_values

In [157]:
my_values.mean()

49.5

In [160]:
np.median(my_values) # it's weird median isn't a method!

49.5

In [162]:
my_values.min()

0

In [163]:
my_values.max()

99

In [164]:
my_values.sum()

4950

In [165]:
my_values.std()

28.86607004772212

In [166]:
my_values.__class__ # special "built-in" methods and attrbts use __<something>__

numpy.ndarray

In [169]:
my_values.size # actually an attribute not a method, but useful nonetheless

100

In [172]:
np.log(my_values)

  np.log(my_values)


array([      -inf, 0.        , 0.69314718, 1.09861229, 1.38629436,
       1.60943791, 1.79175947, 1.94591015, 2.07944154, 2.19722458,
       2.30258509, 2.39789527, 2.48490665, 2.56494936, 2.63905733,
       2.7080502 , 2.77258872, 2.83321334, 2.89037176, 2.94443898,
       2.99573227, 3.04452244, 3.09104245, 3.13549422, 3.17805383,
       3.21887582, 3.25809654, 3.29583687, 3.33220451, 3.36729583,
       3.40119738, 3.4339872 , 3.4657359 , 3.49650756, 3.52636052,
       3.55534806, 3.58351894, 3.61091791, 3.63758616, 3.66356165,
       3.68887945, 3.71357207, 3.73766962, 3.76120012, 3.78418963,
       3.80666249, 3.8286414 , 3.8501476 , 3.87120101, 3.8918203 ,
       3.91202301, 3.93182563, 3.95124372, 3.97029191, 3.98898405,
       4.00733319, 4.02535169, 4.04305127, 4.06044301, 4.07753744,
       4.09434456, 4.11087386, 4.12713439, 4.14313473, 4.15888308,
       4.17438727, 4.18965474, 4.20469262, 4.21950771, 4.2341065 ,
       4.24849524, 4.26267988, 4.27666612, 4.29045944, 4.30406

In [173]:
np.log10(my_values)

  np.log10(my_values)


array([      -inf, 0.        , 0.30103   , 0.47712125, 0.60205999,
       0.69897   , 0.77815125, 0.84509804, 0.90308999, 0.95424251,
       1.        , 1.04139269, 1.07918125, 1.11394335, 1.14612804,
       1.17609126, 1.20411998, 1.23044892, 1.25527251, 1.2787536 ,
       1.30103   , 1.32221929, 1.34242268, 1.36172784, 1.38021124,
       1.39794001, 1.41497335, 1.43136376, 1.44715803, 1.462398  ,
       1.47712125, 1.49136169, 1.50514998, 1.51851394, 1.53147892,
       1.54406804, 1.5563025 , 1.56820172, 1.5797836 , 1.59106461,
       1.60205999, 1.61278386, 1.62324929, 1.63346846, 1.64345268,
       1.65321251, 1.66275783, 1.67209786, 1.68124124, 1.69019608,
       1.69897   , 1.70757018, 1.71600334, 1.72427587, 1.73239376,
       1.74036269, 1.74818803, 1.75587486, 1.76342799, 1.77085201,
       1.77815125, 1.78532984, 1.79239169, 1.79934055, 1.80617997,
       1.81291336, 1.81954394, 1.8260748 , 1.83250891, 1.83884909,
       1.84509804, 1.85125835, 1.8573325 , 1.86332286, 1.86923

The output of a function/method can be stored in a new object and called.

In [174]:
my_sqrt = np.sqrt(my_values)
my_sqrt

array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ,
       3.16227766, 3.31662479, 3.46410162, 3.60555128, 3.74165739,
       3.87298335, 4.        , 4.12310563, 4.24264069, 4.35889894,
       4.47213595, 4.58257569, 4.69041576, 4.79583152, 4.89897949,
       5.        , 5.09901951, 5.19615242, 5.29150262, 5.38516481,
       5.47722558, 5.56776436, 5.65685425, 5.74456265, 5.83095189,
       5.91607978, 6.        , 6.08276253, 6.164414  , 6.244998  ,
       6.32455532, 6.40312424, 6.4807407 , 6.55743852, 6.63324958,
       6.70820393, 6.78232998, 6.8556546 , 6.92820323, 7.        ,
       7.07106781, 7.14142843, 7.21110255, 7.28010989, 7.34846923,
       7.41619849, 7.48331477, 7.54983444, 7.61577311, 7.68114575,
       7.74596669, 7.81024968, 7.87400787, 7.93725393, 8.        ,
       8.06225775, 8.1240384 , 8.18535277, 8.24621125, 8.30662386,
       8.36660027, 8.42614977, 8.48528137, 8.54400375, 8.60232

### Getting help on functions and methods
If you’re using a function or method for the first time and want to know how to use it, call the name of the function/method preceded by `?`. In the example below, I didn’t know how to use the `np.random.normal()` function. By executing `?np.random.normal` (without the brackets), the instructions are shown in the form of a *docstring*hows me that this function draws random numbers from a normal distribution. It also shows me the arguments to the function. Arguments without a default value are mandatory and we must supply them. The arguments we give to a function are matched by position, or by name. It’s common to leave the first argument unnamed, and name subsequent arguments, but you can name all of them if you wish.

In [177]:
?np.random.normal

[1;31mDocstring:[0m
normal(loc=0.0, scale=1.0, size=None)

Draw random samples from a normal (Gaussian) distribution.

The probability density function of the normal distribution, first
derived by De Moivre and 200 years later by both Gauss and Laplace
independently [2]_, is often called the bell curve because of
its characteristic shape (see the example below).

The normal distributions occurs often in nature.  For example, it
describes the commonly occurring distribution of samples influenced
by a large number of tiny, random disturbances, each with its own
unique distribution [2]_.

.. note::
    New code should use the `~numpy.random.Generator.normal`
    method of a `~numpy.random.Generator` instance instead;
    please see the :ref:`random-quick-start`.

Parameters
----------
loc : float or array_like of floats
    Mean ("centre") of the distribution.
scale : float or array_like of floats
    Standard deviation (spread or "width") of the distribution. Must be
    non-negative.
