# Pandas for Biological Data Analysis
This notebook explores some of the commonly used features of Pandas for data science. In the field of biology, data can be incredibly diverse, ranging from genomic sequences to ecological observations. Efficient handling of this data is crucial for meaningful analysis. This guide will cover the basics of importing, viewing, filtering, and grouping data in Pandas.

# Importing Data

The first step in data analysis is importing the data. Biological datasets often come in structured formats like CSV and Excel files. Understanding how to import these files correctly is essential for any further data manipulation and analysis.

## Reading from CSV
CSV files are commonly used in biology for storing tabular data such as gene expression levels, phenotypic data, or sequencing results. To import a CSV file in Pandas:
```python
import pandas as pd
gene_df = pd.read_csv('filename.csv')
```

## Reading from Excel

Importing data from Excel is as straightforward:
```python
clinical_data = pd.read_excel('filename.xlsx')
```

In this workbook, we'll be using the Palmer penguins data set as an example. This dataset contains information about penguins from three different species that were observed on the Palmer Archipelago, Antarctica. The data includes information about the penguins' species, island, bill length and depth, flipper length, body mass, and sex.

The data originally appeared in:

Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081

To load in the data from palmer_penguins.csv, we type:

In [1]:
import pandas as pd
penguin_df=pd.read_csv("palmer_penguins.csv")

# Viewing and Inspecting Biological Data
Once the data is imported into a Pandas DataFrame, the next step is to inspect the data. This is crucial for understanding the structure and quality of the data before proceeding to more complex analyses.

## Displaying Data
To get a quick overview of the data, you can display the first few rows using `head()` or `tail()` methods:
```python
# Display the first 5 rows
print(penguin_df.head())

# Alternatively, for a more interactive display in Jupyter Notebooks
display(panguin_df.head())
```


In [2]:
print(penguin_df.head())
# or
display(penguin_df.head())

   rowid species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0      1  Adelie  Torgersen            39.1           18.7              181.0   
1      2  Adelie  Torgersen            39.5           17.4              186.0   
2      3  Adelie  Torgersen            40.3           18.0              195.0   
3      4  Adelie  Torgersen             NaN            NaN                NaN   
4      5  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  year  
0       3750.0    male  2007  
1       3800.0  female  2007  
2       3250.0  female  2007  
3          NaN     NaN  2007  
4       3450.0  female  2007  


Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,4,Adelie,Torgersen,,,,,,2007
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


# Checking the dataframe

You can use df.shape() to check the dimensions of the dataframe, or df.info() to show the data types stored in the columns (variables).


In [3]:
#checking the data
penguin_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   rowid              344 non-null    int64  
 1   species            344 non-null    object 
 2   island             344 non-null    object 
 3   bill_length_mm     342 non-null    float64
 4   bill_depth_mm      342 non-null    float64
 5   flipper_length_mm  342 non-null    float64
 6   body_mass_g        342 non-null    float64
 7   sex                333 non-null    object 
 8   year               344 non-null    int64  
dtypes: float64(4), int64(2), object(3)
memory usage: 24.3+ KB


# Column selection

To access a single column in the dataframe, you can use the following syntax:

```python
df['column_name']
```

For example, to access the 'species' column in the Palmer Penguins dataset, you can use the following code:

```python
species = df['species']
print(species)
```
This will print the 'species' column of the dataframe. You can also access multiple columns by passing in a list of column names:

```python
columns = ['species', 'island', 'bill_length_mm']
subset = df[columns]
print(subset)
```
This will print a subset of the dataframe containing the 'species', 'island', and 'bill_length_mm' columns.

In [4]:
penguin_df['species']

0         Adelie
1         Adelie
2         Adelie
3         Adelie
4         Adelie
         ...    
339    Chinstrap
340    Chinstrap
341    Chinstrap
342    Chinstrap
343    Chinstrap
Name: species, Length: 344, dtype: object


### Accessing individual Rows
#### loc[ ]

The loc[ ] method is used to select rows and columns by label. We can use it to select specific rows and columns of the DataFrame. You can combine referencing by index (i.e. rows - in the example below indiced 0 to 4 are elected) with selecting individual columns. 

In [5]:
# Select the first 5 rows and the 'species' and 'island' columns
penguin_df.loc[0:4, ['species', 'island']]


Unnamed: 0,species,island
0,Adelie,Torgersen
1,Adelie,Torgersen
2,Adelie,Torgersen
3,Adelie,Torgersen
4,Adelie,Torgersen


### iloc[ ]
The iloc[ ] method is used to select rows and columns by integer position. We can use it to select specific rows and columns of the DataFrame.

In [6]:
# Select the first 5 rows and the first 3 columns
penguin_df.iloc[0:5, 0:3]


Unnamed: 0,rowid,species,island
0,1,Adelie,Torgersen
1,2,Adelie,Torgersen
2,3,Adelie,Torgersen
3,4,Adelie,Torgersen
4,5,Adelie,Torgersen


# Getting unique values in column: .unique()


The .unique() method in pandas is used to find the unique values from a column in a DataFrame or a Series. This function is incredibly useful for exploring and understanding your dataset, especially when dealing with categorical data. It helps in identifying the distinct categories or values present in a column.

For example, to identify all the unique entries in the 'treatment_type' column, we'd enter:


In [7]:
print( penguin_df['species'].unique() )


['Adelie' 'Gentoo' 'Chinstrap']


# .value_counts()

`value_counts()` is a method in the pandas library for counting the number of occurrences of each unique value in a column of a DataFrame or a Series.

In [8]:
# .value_counts()

penguin_df['species'].value_counts()

species
Adelie       152
Gentoo       124
Chinstrap     68
Name: count, dtype: int64

In [9]:
#Class exercise: How many male penguins and female penguins were included in the study?
penguin_df['sex'].value_counts()

sex
male      168
female    165
Name: count, dtype: int64

# .crosstab()

`.crosstab()` is a function in the pandas library for creating a cross-tabulation (or "contingency table") of two or more factors. It is used to analyze the relationship between two categorical variables.

In [10]:
cont_table=pd.crosstab(penguin_df['island'], penguin_df['species'])
print(cont_table)

species    Adelie  Chinstrap  Gentoo
island                              
Biscoe         44          0     124
Dream          56         68       0
Torgersen      52          0       0


# .query()

The query() method is used to filter rows of a DataFrame based on a query expression. The query expression is a string that can contain variables, comparison operators, and logical operators.

In [11]:
# Filter the DataFrame to select rows where the 'body_mass_g' column is greater than 4000 and the 'species' column is equal to 'Gentoo'
penguin_df.query("body_mass_g > 4000 and species == 'Gentoo'")

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
152,153,Gentoo,Biscoe,46.1,13.2,211.0,4500.0,female,2007
153,154,Gentoo,Biscoe,50.0,16.3,230.0,5700.0,male,2007
154,155,Gentoo,Biscoe,48.7,14.1,210.0,4450.0,female,2007
155,156,Gentoo,Biscoe,50.0,15.2,218.0,5700.0,male,2007
156,157,Gentoo,Biscoe,47.6,14.5,215.0,5400.0,male,2007
...,...,...,...,...,...,...,...,...,...
270,271,Gentoo,Biscoe,47.2,13.7,214.0,4925.0,female,2009
272,273,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,female,2009
273,274,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,male,2009
274,275,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,female,2009


# .groupby()

`.groupby()` allows you to group large sets of data and compute operations on these groups.

The groupby operation involves some combination of splitting the data into groups, applying a function to each group independently, and then combining the results.


In [12]:
#What is the mean beak length of each penguin species?
penguin_grouped_by_species=penguin_df.groupby('species')['bill_length_mm'].mean().reset_index()
penguin_grouped_by_species

Unnamed: 0,species,bill_length_mm
0,Adelie,38.791391
1,Chinstrap,48.833824
2,Gentoo,47.504878


In [13]:
#What is the median body mass of male and female penguins of each species in the dataset?
penguin_df.groupby(['species', 'sex'])['body_mass_g'].median().reset_index()


Unnamed: 0,species,sex,body_mass_g
0,Adelie,female,3400.0
1,Adelie,male,4000.0
2,Chinstrap,female,3550.0
3,Chinstrap,male,3950.0
4,Gentoo,female,4700.0
5,Gentoo,male,5500.0


# Exercise: Exploring Gene Expression Dataset with pandas

In this exercise, you will practice data manipulation and analysis skills using the pandas library in Python. You will explore a dataset of gene expression values using various pandas methods, including .info(), .query(), and .groupby().

Dataset:

The dataset you will be working a csv file names 'gene_expression_dataset.csv' which contains 200 records of gene expression values across different categories and treatment groups. Each record includes a gene ID, expression levels, and various categorizations.

Tasks:

Dataset Overview:
* Load the gene expression dataset into a pandas DataFrame. Use the .info() method to understand the structure of the DataFrame. Note the number of entries, the data types of each column, and whether there are any missing values.
Filtering Data with Query:
* Use the .query() method to filter out records where the Expression_Level is 'High' and the Treatment_Group is 'Treatment1'.
* Store the filtered data in a new DataFrame and display the first 5 rows.
* Grouping and Aggregating Data:
Use the .groupby() method to group the data by Treatment_Group.
Calculate the mean of 'Category_A', 'Category_B', and 'Category_C' for each treatment group.
Display the results in a well-structured format.
Deliverables:

A Jupyter Notebook containing the completed tasks, including the code and output for each step.
A brief summary of your findings from the .groupby() analysis, particularly noting any differences in mean values across treatment groups.
Hints:

Remember to import pandas before starting your analysis.
For the .query() method, ensure your query string is correctly formatted.
After grouping the data, you can use .mean() to calculate the mean for the grouped data.

Load the dataset 

Here are the answers for the exercise:

1. **Dataset Overview:**
   - The dataset contains 200 entries across 7 columns. There are no missing values. The columns are as follows:
     - `Gene_ID`: object type
     - `Category_A`: float64
     - `Category_B`: float64
     - `Category_C`: float64
     - `Expression_Level`: object
     - `Treatment_Group`: object
     - `Time_Point`: object

2. **Filtering Data with Query:**
   - After filtering the data where `Expression_Level` is 'High' and `Treatment_Group` is 'Treatment1', the first five rows are:

    | Gene_ID | Category_A | Category_B | Category_C | Expression_Level | Treatment_Group | Time_Point |
    |---------|------------|------------|------------|------------------|-----------------|------------|
    | Gene_1  | 0.422814   | 0.719947   | 0.551052   | High             | Treatment1      | T0         |
    | Gene_12 | 0.733659   | 0.710027   | 0.053329   | High             | Treatment1      | T72        |
    | Gene_16 | 0.536511   | 0.262844   | 0.359716   | High             | Treatment1      | T0         |
    | Gene_22 | 0.022283   | 0.018544   | 0.465654   | High             | Treatment1      | T24        |
    | Gene_24 | 0.993123   | 0.523973   | 0.514865   | High             | Treatment1      | T0         |

3. **Grouping and Aggregating Data:**
   - The mean of 'Category_A', 'Category_B', and 'Category_C' for each treatment group is:

    | Treatment_Group | Category_A | Category_B | Category_C |
    |-----------------|------------|------------|------------|
    | Control         | 0.533084   | 0.458343   | 0.543662   |
    | Treatment1      | 0.552448   | 0.506707   | 0.485414   |
    | Treatment2      | 0.511768   | 0.515504   | 0.498134   |

From the grouping and aggregation, we can observe the differences in mean values across different treatment groups for each category, providing insights into how treatments may influence gene expression in different categories.

In [14]:
import pandas as pd

# Load the dataset
gene_df = pd.read_csv('gene_expression_dataset.csv')

# Task 1: Dataset Overview
print("Dataset Overview:")
gene_df.info()

# Task 2: Filtering Data with Query
print("\nFiltered Data (Expression_Level = High and Treatment_Group = Treatment1):")
filtered_df = gene_df.query('Expression_Level == "High" and Treatment_Group == "Treatment1"')
print(filtered_df.head())

# Task 3: Grouping and Aggregating Data
print("\nGrouped Data Mean Values by Treatment Group:")
grouped_data = gene_df.groupby('Treatment_Group')[['Category_A', 'Category_B', 'Category_C']].mean()
print(grouped_data)


Dataset Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Gene_ID           200 non-null    object 
 1   Category_A        200 non-null    float64
 2   Category_B        200 non-null    float64
 3   Category_C        200 non-null    float64
 4   Expression_Level  200 non-null    object 
 5   Treatment_Group   200 non-null    object 
 6   Time_Point        200 non-null    object 
dtypes: float64(3), object(4)
memory usage: 11.1+ KB

Filtered Data (Expression_Level = High and Treatment_Group = Treatment1):
    Gene_ID  Category_A  Category_B  Category_C Expression_Level  \
0    Gene_1    0.422814    0.719947    0.551052             High   
11  Gene_12    0.733659    0.710027    0.053329             High   
15  Gene_16    0.536511    0.262844    0.359716             High   
21  Gene_22    0.022283    0.018544    0.465654        