# Differential gene expression - Instructor



## Obtaining the data

In computational biology, working with large datasets is a fundamental part of data analysis. One of the most common formats for storing structured data is the CSV (Comma-Separated Values) file.
A CSV file is essentially a plain text file where each row represents a record, and columns are separated by commas. This format makes it easy to share and manipulate tabular data.

In this course, we will use [Pandas](https://pandas.pydata.org/docs/), a powerful Python library designed for data manipulation and analysis, to handle our data. Before we can analyze any dataset, we must first obtain and load it into a format that allows efficient processing.

### Fetching the Dataset from GitHub

The dataset we will be working with contains gene expression data and is hosted on GitHub. To download and load this dataset into memory, we can use the `requests` library to make an HTTP request and retrieve the file. Below is a Python function that performs this task:

In [1]:
import requests
import io
import pandas as pd


def get_gene_expr_data() -> pd.DataFrame:
    csv_path = "https://github.com/oasci/pitt-biosc1540-2025s/raw/refs/heads/main/content/data/gene-expr/SSvLIS-day3/ground-day3-gene-expr-SS-and-LIS.csv"
    response = requests.get(csv_path)
    if response.status_code == 200:
        csv_text = response.text
        data = pd.read_csv(io.StringIO(csv_text))
        return data
    else:
        print(f"Failed to fetch file. Status code: {response.status_code}")

This function does the following:

1. Uses the [`requests`](https://docs.python-requests.org/en/latest/) library to send an HTTP GET request to the GitHub-hosted CSV file.
2. Checks if the request was successful by verifying the status code.
3. Reads the CSV file's content and converts it into a Pandas DataFrame using `pd.read_csv()`.
4. Returns the DataFrame for further analysis.

A DataFrame in Pandas is a two-dimensional, table-like data structure similar to an Excel spreadsheet. It consists of:

- Rows (each row represents an observation or record).
- Columns (each column represents a variable or feature).
- Indexing (a way to reference rows and columns efficiently).

You can think of a DataFrame as an enhanced Excel sheet that allows programmatic manipulation, filtering, and analysis of data. Unlike an Excel sheet, however, Pandas provides powerful tools to handle missing data, apply complex transformations, and perform statistical computations efficiently.

Once we retrieve the CSV file, we use [`pd.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) to load the text data into a DataFrame:

```python
data = pd.read_csv(io.StringIO(csv_text))
```

Here, `io.StringIO(csv_text)` treats the downloaded CSV content as if it were a file, allowing `pd.read_csv()` to parse it directly.

In [2]:
df = get_gene_expr_data()

## Exploring the data

Now that we have successfully loaded our dataset into a Pandas DataFrame, it is essential to understand how to explore, manipulate, and extract meaningful information from it. In this section, we will cover fundamental operations such as viewing data, indexing, slicing, filtering, and selecting subsets of a DataFrame.

### Inspecting the Data

After loading a dataset, the first step is to get a sense of its structure. Pandas provides several methods to examine the contents of a DataFrame.

#### Viewing the First and Last Few Rows

We can use the `.head()` and `.tail()` methods to preview the data:

In [3]:
# Display the first five rows
print(df.head())

         Gene     SS316.1     SS316.2     SS316.3     SS316.4       LIS.1  \
0  PA14_00010  248.431118  204.666946  199.842429  255.357432  434.670071   
1  PA14_00020  215.984949  203.513892  163.323000  206.023071   68.632116   
2  PA14_00030  154.966779  182.182408  119.026286  196.295169    0.000000   
3  PA14_00050  277.003118  331.502799  211.677429  301.217542  114.386861   
4  PA14_00060   50.364203   53.040448   51.397714   69.485015   22.877372   

        LIS.2       LIS.3       LIS.4  
0  281.916087  279.049317  293.841709  
1  250.592078   24.265158  159.569670  
2  178.994341    0.000000  142.055926  
3  331.139531  643.026687  301.625595  
4   31.324010   54.596606   38.919432  


In [4]:
# Display the last five rows
print(df.tail())

            Gene     SS316.1     SS316.2     SS316.3     SS316.4       LIS.1  \
5959  PA14_73370  222.280474  212.161791  194.263072  233.122227  160.141605   
5960  PA14_73390   81.357559   93.397310  109.051072  119.861652    0.000000   
5961  PA14_73400  107.023932   54.770028   93.834643  130.631829  297.405838   
5962  PA14_73410  406.303525  266.931819  330.027429  345.340527   68.632116   
5963  PA14_73420   80.873288  104.351316   83.352214   84.076869   22.877372   

           LIS.2       LIS.3       LIS.4  
5959  246.117219  430.706555  231.570618  
5960   13.424576   18.198869   72.000949  
5961  107.396605   48.530316  128.434124  
5962  259.541795  218.386422  229.624647  
5963  116.346322  103.126922   36.973460  


By default, `.head()` and `.tail()` return the first and last five rows, respectively. You can specify a different number of rows as an argument, e.g., `df.head(10)` for the first ten rows.

#### Checking the Structure of the Data

To understand the columns, data types, and non-null values, we use:

In [5]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5964 entries, 0 to 5963
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Gene     5964 non-null   object 
 1   SS316.1  5964 non-null   float64
 2   SS316.2  5964 non-null   float64
 3   SS316.3  5964 non-null   float64
 4   SS316.4  5964 non-null   float64
 5   LIS.1    5964 non-null   float64
 6   LIS.2    5964 non-null   float64
 7   LIS.3    5964 non-null   float64
 8   LIS.4    5964 non-null   float64
dtypes: float64(8), object(1)
memory usage: 419.5+ KB
None


This provides details such as:

- The number of rows and columns.
- Column names and their data types.
- The number of non-null values in each column.

#### Summarizing the Data

To obtain summary statistics of numeric columns, we use:

In [6]:
print(df.describe())

            SS316.1       SS316.2        SS316.3        SS316.4         LIS.1  \
count  5.964000e+03  5.964000e+03    5964.000000    5964.000000  5.964000e+03   
mean   8.264716e+02  6.592480e+02     424.724098     237.139402  1.315407e+03   
std    2.591986e+04  2.034955e+04   11502.870293    5155.394148  3.931560e+04   
min    1.937085e+00  0.000000e+00       0.169071       0.000000  0.000000e+00   
25%    3.583607e+01  3.271788e+01      36.012214      34.742508  0.000000e+00   
50%    6.150244e+01  6.284140e+01      62.049214      63.231364  2.287737e+01   
75%    1.079925e+02  1.129992e+02     107.360357     110.828600  9.150949e+01   
max    1.521413e+06  1.136283e+06  664108.346036  283818.494270  2.154682e+06   

              LIS.2         LIS.3          LIS.4  
count  5.964000e+03  5.964000e+03    5964.000000  
mean   3.355261e+03  9.296802e+02     392.419657  
std    1.006808e+05  2.520583e+04    9065.883674  
min    0.000000e+00  0.000000e+00       0.000000  
25%    2.684915

The `.describe()` method provides useful insights such as the mean, standard deviation, minimum, and maximum values of numerical features.

### Selecting and Indexing Data

Pandas allows you to select specific rows and columns using different methods.

#### Selecting Columns

You can select a column using bracket notation (`[]`) or the dot notation (`.`):

In [8]:
genes = df["Gene"]
print(genes)

0       PA14_00010
1       PA14_00020
2       PA14_00030
3       PA14_00050
4       PA14_00060
           ...    
5959    PA14_73370
5960    PA14_73390
5961    PA14_73400
5962    PA14_73410
5963    PA14_73420
Name: Gene, Length: 5964, dtype: object


To select multiple columns, pass a list:

In [9]:
ss_data = df[["SS316.1", "SS316.2", "SS316.3", "SS316.4"]]
print(ss_data)

         SS316.1     SS316.2     SS316.3     SS316.4
0     248.431118  204.666946  199.842429  255.357432
1     215.984949  203.513892  163.323000  206.023071
2     154.966779  182.182408  119.026286  196.295169
3     277.003118  331.502799  211.677429  301.217542
4      50.364203   53.040448   51.397714   69.485015
...          ...         ...         ...         ...
5959  222.280474  212.161791  194.263072  233.122227
5960   81.357559   93.397310  109.051072  119.861652
5961  107.023932   54.770028   93.834643  130.631829
5962  406.303525  266.931819  330.027429  345.340527
5963   80.873288  104.351316   83.352214   84.076869

[5964 rows x 4 columns]


#### Selecting Rows

To access specific rows, Pandas provides two primary methods:

- `.loc[]` (label-based selection)
- `.iloc[]` (integer index-based selection)

In [None]:
# Select a row by index label
row_5 = df.loc[5]
print(row_5)

In [None]:
# Select multiple rows
rows_5_to_10 = df.loc[5:10]
print(rows_5_to_10)

#### Slicing the DataFrame

You can slice both rows and columns using `.iloc[]`:

In [None]:
# Select rows 5 to 10 and columns 1 to 3
subset = df.iloc[5:11, 1:4]
print(subset)

This follows Python's standard slicing rules (`start:stop`, where `stop` is exclusive).

### Filtering Data

A powerful feature of Pandas is the ability to filter data based on conditions. Suppose we want to select all rows where gene expression values exceed a certain threshold:

In [12]:
# Filter rows where Gene Expression > 1000
high_expression = df[df[["SS316.1", "SS316.2", "SS316.3", "SS316.4"]] > 300]
print(high_expression)

     Gene     SS316.1     SS316.2     SS316.3     SS316.4  LIS.1  LIS.2  \
0     NaN         NaN         NaN         NaN         NaN    NaN    NaN   
1     NaN         NaN         NaN         NaN         NaN    NaN    NaN   
2     NaN         NaN         NaN         NaN         NaN    NaN    NaN   
3     NaN         NaN  331.502799         NaN  301.217542    NaN    NaN   
4     NaN         NaN         NaN         NaN         NaN    NaN    NaN   
...   ...         ...         ...         ...         ...    ...    ...   
5959  NaN         NaN         NaN         NaN         NaN    NaN    NaN   
5960  NaN         NaN         NaN         NaN         NaN    NaN    NaN   
5961  NaN         NaN         NaN         NaN         NaN    NaN    NaN   
5962  NaN  406.303525         NaN  330.027429  345.340527    NaN    NaN   
5963  NaN         NaN         NaN         NaN         NaN    NaN    NaN   

      LIS.3  LIS.4  
0       NaN    NaN  
1       NaN    NaN  
2       NaN    NaN  
3       NaN    

To apply multiple conditions, use logical operators:

- `&` (AND)
- `|` (OR)
- `~` (NOT)