# pandas

<div class="admonition danger">
    <p class="admonition-title">DRAFT</p>
    <p style="padding-top: 1em">
        This page is a work in progress and is subject to change at any moment.
    </p>
</div>

Pandas has so many uses that it might make sense to list the things it can't do instead of what it can do. 
Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it.

For example, say you want to explore a dataset stored in a CSV on your computer.
Pandas will extract the data from that CSV into a DataFrame&mdash;a table, basically&mdash;then let you do things like:

-   Calculate statistics and answer questions about the data, like
    -   What's the average, median, max, or min of each column?
    -   Does column A correlate with column B?
    -   What does the distribution of data in column C look like?
- Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
- Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more. 
- Store the cleaned, transformed data back into a CSV, other file or database


Before you jump into the modeling or the complex visualizations you need to have a good understanding of the nature of your dataset and pandas is the best avenue through which to do that.

Key resources:

-   [Documentation](https://pandas.pydata.org/docs/)
-   [Learn](https://pandas.pydata.org/docs/user_guide/index.html)

First things first, let's import NumPy and pandas.

In [1]:
import numpy as np
import pandas as pd

The two main components that pandas adds are [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) and [DataFrames]().

## Series

[Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) is a one-dimensional labeled array.
On other words, it is like a column in an Excel file.

First, let us create some synthetic data to use.
The provided data encapsulates a snapshot of gene expression levels and corresponding treatments in a biological study.
Each gene, identified by a unique identifier such as `"G1"` or `"G2"`, is associated with a specific expression level, denoted in numerical values.
Simultaneously, the treatment administered to each gene is recorded, categorizing them into groups like `"Control"`, `"TreatmentA"`, or `"TreatmentB"`.

In [2]:
gene_id = np.array(["G1", "G2", "G3", "G4", "G5"])
expression_Level = np.array([2.5, 1.8, 3.2, 2.0, 3.5])
treatment = np.array(["Control", "TreatmentA", "TreatmentB", "Control", "TreatmentA"])

In [3]:
s_gene = pd.Series(gene_id)
print(s_gene)

0    G1
1    G2
2    G3
3    G4
4    G5
dtype: object


We can technically rename the index of each row.

In [4]:
s_gene_other = pd.Series(gene_id, index=["A", "B", "C", "D", "E"])
print(s_gene_other)

A    G1
B    G2
C    G3
D    G4
E    G5
dtype: object


However, we generally leave it to default to `int`.

In [5]:
s_express = pd.Series(expression_Level)
print(s_express)

0    2.5
1    1.8
2    3.2
3    2.0
4    3.5
dtype: float64


In [6]:
s_treatment = pd.Series(treatment)
print(s_treatment)

0       Control
1    TreatmentA
2    TreatmentB
3       Control
4    TreatmentA
dtype: object


## DataFrame

[`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) is a 2-dimensional labeled data structure with columns of potentially different types.
You can think of it like a spreadsheet or SQL table, or a dict of Series objects.
It is generally the most commonly used pandas object.
Like Series, DataFrame accepts many different kinds of input:

-   Dict of 1D ndarrays, lists, dicts, or Series
-   2-D numpy.ndarray
-   Structured or record ndarray
-   A Series
-   Another DataFrame

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments.
If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame.
Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

First, let's use the series we created.

In [7]:
df = pd.DataFrame(
    data={"gene_id": s_gene, "expression_level": s_express, "treatment": s_treatment}
)
print(df)

  gene_id  expression_level   treatment
0      G1               2.5     Control
1      G2               1.8  TreatmentA
2      G3               3.2  TreatmentB
3      G4               2.0     Control
4      G5               3.5  TreatmentA


It is annoying to have to create a `Series` first and then a `DataFrame`; instead, we can just use a dictionary.

In [8]:
df = pd.DataFrame(
    data={
        "gene_id": ["G1", "G2", "G3", "G4", "G5"],
        "expression_level": [2.5, 1.8, 3.2, 2.0, 3.5],
        "treatment": ["Control", "TreatmentA", "TreatmentB", "Control", "TreatmentA"],
    }
)
print(df)

  gene_id  expression_level   treatment
0      G1               2.5     Control
1      G2               1.8  TreatmentA
2      G3               3.2  TreatmentB
3      G4               2.0     Control
4      G5               3.5  TreatmentA


## Acknowledgements

This content was adapted from the following sources:

-   [LearnDataSci](https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/)
-   [Kaggle](https://www.kaggle.com/learn/pandas)
-   [DataCamp](https://www.datacamp.com/tutorial/pandas)
