## Introduction to Python for Data Analsyis

### Module 1: DataFrames and Python
Duration: ~ 45'

In this module, we will explore:

- What is a pandas Dataframe
- How to read and export data in pandas
- How to perform basic operations with a pandas DataFrame, such as:
    - Sorting
    - Locating by index and position
- Perform more complex operatios like grouping and filtering


### What is pandas

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language. In particular, it offers data structures and operations for manipulating numerical tables and time series. 

Its core object is called *DataFrame*; a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. Its advantage over traditional arrays (such as ones used by *numpy*) is that each column can contain information **of different type**, e.g. one columsn can containg strings (names, occupations, martial status) and adjacrnt columns can be, among others, intgeres, floats, datetimes etc.

It is built **on top** of numpy (the core numerical library of python) which makes it highly efficient Also, some core parts of pandas are written in Cython (a superset of the Python language that additionally supports calling C functions and declaring C types on variables and class attributes).

### Loading pandas

Once installed, you can import it e.g. using the alias `pd` as follows:

In [None]:
import pandas as pd

### Reading datasets with `pandas`

We are going to use the METABRIC dataset, on open source dataset which contains targeted sequencing, clinical and genomic data from patients with breast cancer. More information about the dataset can be found [here](https://ega-archive.org/studies/EGAS00000000083) and [here](https://www.kaggle.com/datasets/raghadalharbi/breast-cancer-gene-expression-profiles-metabric).

Pandas allows importing data from various file formats such as csv, xls, json, sql. They all follow the same pattern: `.read_{format_of_the_file}`, e.g. `.read_sql()`, `.read_xls()` etc.

To read a csv file, use the method `.read_csv()`:

In [None]:
metabric = pd.read_csv("../data/metabric_clinical_and_expression_data.csv")

If you forget to include `../data/` above, or if you include it but your copy of the file is saved somewhere else, you will get an error that ends with a line like this: 

`FileNotFoundError: File b'metabric_clinical_and_expression_data.csv' does not exist`

Generally, rows in a `DataFrame` are the **observations** (patients in the case of METABRIC) whereas columns are known as the observed **variables** (Cohort, Age_at_diagnosis ...). 

Looking at the column on the far left, you can see the row names of the DataFrame `metabric` assigned using the known 0-based indexing used in Python.

Note that the `.read_csv()` method is not limited to reading csv files. For example, you can also read Tab Separated Value (TSV) files by adding the argument `sep='\t'`.

### Exploring the dataset

DataFrames in pandas are classes, which means they carry specific attributes. 

**Methods** are functions that are associated with a DataFrame. Because they are functions, you do use () to call them, and can have arguments added inside the parentheses to control their behaviour. For example, the `.info()` command we executed previously was a method.


The most important methods ones to initally examine a dataset are:

- `.info()`: We can view basic information about our DataFrame object, like the type of each column and the number of missing entries.
- `.describe()`: We can quickly view some statistical meassures for our numerical variables

In [None]:
metabric.info()

In [None]:
metabric.describe()

You can access the dimensions of your DataFrame using the `.shape` attribute. The first value is the number of rows, and the second the number of columns:

In [None]:
print(metabric.shape)
print(f"Number of rows: {metabric.shape[0]}")
print(f"Number of columns: {metabric.shape[1]}")

The row and column names can be accessed using the attributes `.index` and `.columns` respectively:

In [None]:
print(metabric.index)
print(metabric.columns)

Accesing individual columns of a pandas dataframe can be achieved in two ways:

- The first is using the name of the DataFrame `metabric` followed by a `.` and then followed by the name of the column. 
- The second is using square brackets.

*Note: If your column name contains spaces, the first method is **not** applicable*

In [None]:
print(metabric.Survival_time, metabric["Survival_time"])

We can also compute metrics on specific columns or on the entire DataFrame:

*Note: Using metrics methods on whole DataFrames that contain categorical columns will ommit them aytomatically. Depending on the pandas version, this will result to an warning.*

In [None]:
print(metabric['Survival_time'].mean())
print(metabric['Survival_time'].std())
print(metabric.mean())

### Selecting columns and rows

The [pandas cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) can be very helpful for recalling basic pandas operations.

To select rows and columns in a DataFrame, we use square brackets `[ ]`. There are two ways to do this: with **positional** indexing, which uses index numbers, and **label-based** indexing which uses column or row names.

To select the first three rows using their numeric index (The colon `:` defines the range "from the beggining until"):

In [None]:
metabric[:3]

And we can combine the row and columns selection like that:

In [None]:
metabric[:3]['Mutation_count']

However the following does not work, as it would in languages like R:

In [None]:
metabric[:3,'Mutation_count']

To do **positional** indexing for both rows and columns, use `.iloc[]`. The first argument is the numeric index of the rows, and the second the numeric index of the columns:

In [None]:
metabric.iloc[:3,2]

For **label-based** indexing, use `.loc[]` with the column and row names:

In [None]:
metabric.loc[:3,"Age_at_diagnosis"]

**Note**: because the rows have numeric indices in this DataFrame, we may think that selecting rows with `.iloc[]` and `.loc[]` is same. As observed above, this is not the case.

If you'd like to select more than one column:

In [None]:
metabric.loc[:3, ['Cohort', 'Chemotherapy']]

In [None]:
metabric.loc[:3, 'Cohort':'Chemotherapy']

### Filtering rows

You can choose rows from a DataFrame that match some specified criteria. The criteria are based on values of variables and can make use of comparison operators such as `==`, `>`, `<` and `!=`.

For example, to filter `metabric` so that it only contains observations for those patients who died of breast cancer:

In [None]:
metabric[metabric.Vital_status=="Died of Disease"]

To filter based on more than one condition, you can use the operators `&` (and), `|` (or). 

In [None]:
metabric[(metabric.Vital_status=="Died of Disease") & (metabric.Age_at_diagnosis>70)]

For categorical variables e.g. `Vital_status` or `Cohort`, it may be useful to count how many occurrences there is for each category:

In [None]:
metabric['Vital_status'].unique()

In [None]:
metabric['Vital_status'].value_counts()

To filter by more than one category, use the `.isin()` method.

In [None]:
metabric[metabric.Vital_status.isin(['Died of Disease', 'Died of Other Causes'])]

### Sort data

To sort the entire DataFrame according to one of the columns, we can use the `.sort_values()` method. We can store the sorted DataFrame using a new variable name such as `metabric_sorted`:

In [None]:
metabric_sorted = metabric.sort_values('Tumour_size')
metabric_sorted

We can also sort the DataFrame in descending order:

In [None]:
metabric_sorted = metabric.sort_values('Tumour_size', ascending=False)
metabric_sorted

### Missing data

Pandas primarily uses `NaN` to represent missing data, which are by default not included in computations.

The `.info()` method shown above already gave us a way to find columns containing missing data:

In [None]:
metabric.info()

To get the locations where values are missing:

In [None]:
pd.isna(metabric)

In [None]:
metabric.isnull()

To drop any rows containing at least one column with missing data:

In [None]:
metabric.dropna()

However, from the other way around, to rather remove columns with at least one row with missing data, you need to use the 'axis' argument:

In [None]:
metabric.dropna(axis=1)

Define in which columns to look for missing values before dropping the row:

In [None]:
metabric.dropna(subset = ["Tumour_size"])

In [None]:
metabric.dropna(subset = ["Tumour_size", "Tumour_stage"])

Filling missing data:

In [None]:
metabric.fillna(value=0)

In [None]:
metabric.fillna(value={'Tumour_size':0, 'Tumour_stage':5})

### Grouping

Grouping patients by Cohort and then applying the `.mean()` function to the resulting groups:

*Note: There are more *

In [None]:
metabric.groupby('Cohort')

In [None]:
metabric.groupby('Cohort').mean()

Grouping by multiple columns forms a hierarchical index, and again we can apply the `.mean()` function:

In [None]:
metabric.groupby(['Cohort', 'Vital_status']).mean()

### Q&A

### Duration: ~10'