### Introduction to Data Analytics in Python

Data analytics is the process of extracting insights from raw data. In Python, we use pandas, NumPy, Matplotlib, and other libraries to clean, process, and visualize data. In bioinformatics, data analytics is crucial for analyzing genomic data, AMR resistance profiles, sequencing results, and variant interpretations.

Steps in Data Analytics
When analyzing biological datasets, we typically follow these key steps:

### Data Collection & Loading
Read data from CSV, VCF, FASTA, TSV, or databases.
Load data into Python using pandas.read_csv(), open(), or json.load().

### Data Exploration & Inspection
Check dataset structure using .head(), .info(), .describe().
Identify missing values using .isnull().sum().
Understand data types (int, float, str, bool).

### Data Cleaning
Handle missing values (fillna(), dropna()).
Convert data types (astype() for numerical operations).
Remove duplicates (drop_duplicates()).

### Data Filtering & Transformation
Filter specific genes, variants, or resistance markers using .loc[].
Create new columns with calculations (df["new_col"] = df["col1"] * 2).
Aggregate and group data (groupby()).

### Data Visualization (Next Module)
Generate bar charts, histograms, box plots using matplotlib and seaborn.
Create scatter plots for gene expression.
Plot variant frequencies in sequencing data.

### Insights & Reporting (Next Module)
Identify trends and patterns in AMR datasets.
Export processed data to CSV or JSON for further analysis.
Generate summary statistics for decision-making.

# Let's dive in!

In [4]:
# Staring with importing pandas and loading my csv file.

import pandas as pd

# Load a CSV file
amr_df = pd.read_csv("data/Synthetic_AMR_Dataset.csv")

# Display the first few rows
print(amr_df.head())


  Sample_ID      Gene Resistance_Class  Detected  MIC_µg/mL
0  Sample_1  blaCTX-M      Beta-lactam      True        4.0
1  Sample_2      mecA      Methicillin      True       16.0
2  Sample_3      tetM     Tetracycline     False        NaN
3  Sample_4     aadA1   Aminoglycoside      True        8.0
4  Sample_5      ermB        Macrolide     False        NaN


In [None]:
# CODE PLAYGROUND

In [None]:
# How to load a TSV file
tsv_df = pd.read_csv("metadata.tsv", sep="\t")

# Show basic info
print(tsv_df.info())


In [None]:
# CODE PLAYGROUND

In [1]:
# How to load a fasta file. 
# WARNING!!! Fasta files are big files and might crash your notebook!!!

from Bio import SeqIO  # Requires Biopython

# Load a FASTA file
for record in SeqIO.parse("sequences.fasta", "fasta"):
    print(f"ID: {record.id}, Sequence: {record.seq[:50]}...")  # Print first 50 bases


FileNotFoundError: [Errno 2] No such file or directory: 'sequences.fasta'

In [None]:
# CODE PLAYGROUND

In [6]:
# Let's continue with amr_df

amr_df.head()

Unnamed: 0,Sample_ID,Gene,Resistance_Class,Detected,MIC_µg/mL
0,Sample_1,blaCTX-M,Beta-lactam,True,4.0
1,Sample_2,mecA,Methicillin,True,16.0
2,Sample_3,tetM,Tetracycline,False,
3,Sample_4,aadA1,Aminoglycoside,True,8.0
4,Sample_5,ermB,Macrolide,False,


In [7]:
amr_df.describe()

Unnamed: 0,MIC_µg/mL
count,7.0
mean,9.285714
std,11.294752
min,1.0
25%,2.0
50%,4.0
75%,12.0
max,32.0


In [10]:
renamed_amr_df = amr_df.rename(columns={"Gene":"AMR_Gene"})

renamed_amr_df.head()

Unnamed: 0,Sample_ID,AMR_Gene,Resistance_Class,Detected,MIC_µg/mL
0,Sample_1,blaCTX-M,Beta-lactam,True,4.0
1,Sample_2,mecA,Methicillin,True,16.0
2,Sample_3,tetM,Tetracycline,False,
3,Sample_4,aadA1,Aminoglycoside,True,8.0
4,Sample_5,ermB,Macrolide,False,


In [None]:
# Renaming all the columns at once

amr_df.columns = ["Sample", "AMR_Gene", "Resistance_Type", "Detected", "MIC"]
print(df.head())

df.to_csv("renamed_amr_data.csv", index=False)


In [None]:
# CODE PLAYGROUND

In [11]:
import pandas as pd

# Count missing values in each column
print(amr_df.isnull().sum())


Sample_ID           0
Gene                0
Resistance_Class    0
Detected            0
MIC_µg/mL           3
dtype: int64


In [13]:
# Let's find what are the rows that I'm looking for?

print(amr_df[amr_df.isnull().any(axis=1)])


  Sample_ID    Gene Resistance_Class  Detected  MIC_µg/mL
2  Sample_3    tetM     Tetracycline     False        NaN
4  Sample_5    ermB        Macrolide     False        NaN
7  Sample_8  blaOXA      Beta-lactam     False        NaN


In [15]:
# Let's check missing values in a spesific column

print(amr_df[amr_df["MIC_µg/mL"].isnull()])


  Sample_ID    Gene Resistance_Class  Detected  MIC_µg/mL
2  Sample_3    tetM     Tetracycline     False        NaN
4  Sample_5    ermB        Macrolide     False        NaN
7  Sample_8  blaOXA      Beta-lactam     False        NaN


In [16]:
# Checking if the entire dataset has missing values

print(amr_df.isnull().values.any())

# What is the output?


True


In [18]:
amr_df_cleaned = amr_df.dropna()
amr_df_cleaned.head().sum()


Sample_ID                    Sample_1Sample_2Sample_4Sample_6Sample_7
Gene                                      blaCTX-MmecAaadA1blaTEMvanA
Resistance_Class    Beta-lactamMethicillinAminoglycosideBeta-lacta...
Detected                                                            5
MIC_µg/mL                                                        62.0
dtype: object

In [19]:
amr_df.sum()

Sample_ID           Sample_1Sample_2Sample_3Sample_4Sample_5Sample...
Gene                blaCTX-MmecAtetMaadA1ermBblaTEMvanAblaOXAmphAsul1
Resistance_Class    Beta-lactamMethicillinTetracyclineAminoglycosi...
Detected                                                            7
MIC_µg/mL                                                        65.0
dtype: object

In [21]:
# rather than deleting them, can we fill them with beautiful numeric values?

amr_df["MIC_µg/mL"] = amr_df["MIC_µg/mL"].fillna(0)
amr_df.head()

Unnamed: 0,Sample_ID,Gene,Resistance_Class,Detected,MIC_µg/mL
0,Sample_1,blaCTX-M,Beta-lactam,True,4.0
1,Sample_2,mecA,Methicillin,True,16.0
2,Sample_3,tetM,Tetracycline,False,0.0
3,Sample_4,aadA1,Aminoglycoside,True,8.0
4,Sample_5,ermB,Macrolide,False,0.0


In [25]:
# Okay, now let's try to filter some of the MIC values

# First, we are trying to be sure the values are numeric

amr_df["MIC_µg/mL"] = pd.to_numeric(amr_df["MIC_µg/mL"], errors="coerce")

# Second is the filtering
filtered_mic_df = amr_df[amr_df["MIC_µg/mL"] > 8]
print(filtered_mic_df)


  Sample_ID  Gene Resistance_Class  Detected  MIC_µg/mL
1  Sample_2  mecA      Methicillin      True       16.0
6  Sample_7  vanA       Vancomycin      True       32.0


In [None]:
# CODE PLAYGROUND

In [30]:
# How can we change the type of the data? Let's see

# Convert boolean column 'Detected' to object type
amr_df["Detected"] = amr_df["Detected"].astype(object)

# Display the updated dataframe
print(amr_df.head())

print(amr_df.dtypes)



  Sample_ID      Gene Resistance_Class Detected  MIC_µg/mL
0  Sample_1  blaCTX-M      Beta-lactam     True        4.0
1  Sample_2      mecA      Methicillin     True       16.0
2  Sample_3      tetM     Tetracycline    False        0.0
3  Sample_4     aadA1   Aminoglycoside     True        8.0
4  Sample_5      ermB        Macrolide    False        0.0
Sample_ID            object
Gene                 object
Resistance_Class     object
Detected             object
MIC_µg/mL           float64
dtype: object


In [None]:
# CODE PLAYGROUND

In [None]:
# CODE PLAYGROUND

In [None]:
# CODE PLAYGROUND

In [None]:
# CODE PLAYGROUND

In [None]:
# CODE PLAYGROUND

# Problem 1: Identifying Highly Resistant AMR Genes
### Scenario

You are analyzing **Antimicrobial Resistance (AMR) data** and need to identify **genes with high resistance levels**. A gene is considered "highly resistant" if the **MIC value is greater than 16 µg/mL**.

### Task
Write a Python program that:
1. **Reads an AMR dataset (CSV file)**.
2. **Filters AMR genes where MIC_µg/mL > 16**.
3. **Writes the filtered dataset to a new CSV file**.


In [None]:
# CODE PLAYGROUND

In [None]:
# CODE PLAYGROUND


---

# Problem 2: Handling Missing Values in AMR Dataset
### Scenario
Your dataset contains missing **MIC values**, and you want to handle them by replacing missing values with `"Not Available"`.

### Task
Write a Python program that:
1. **Reads an AMR dataset (CSV file)**.
2. **Identifies missing values** in the `MIC_µg/mL` column.
3. **Replaces missing values with `"Not Available"`**.
4. **Writes the cleaned dataset to a new CSV file**.

### Example Input (AMR Dataset Sample)


In [None]:
# CODE PLAYGROUND

In [None]:
# CODE PLAYGROUND

In [None]:
# CODE PLAYGROUND

In [None]:
# CODE PLAYGROUND

In [None]:
# If you feel like you don't know what to do! Just save it as there is no tomorrow and come back.



In [None]:

## **Solution for Problem 1: Identifying Highly Resistant AMR Genes**

import pandas as pd

# Load the AMR dataset
amr_df = pd.read_csv("data/amr_dataset.csv")

# Convert MIC column to numeric (handling missing values)
amr_df["MIC_µg/mL"] = pd.to_numeric(amr_df["MIC_µg/mL"], errors="coerce")

# Filter AMR genes with MIC > 16
high_resistance_df = amr_df[amr_df["MIC_µg/mL"] > 16]

# Save filtered dataset
high_resistance_df.to_csv("high_resistance_amr.csv", index=False)

print("Filtered highly resistant AMR genes saved to 'high_resistance_amr.csv'")


In [None]:
# CODE PLAYGROUND

In [None]:
# CODE PLAYGROUND

In [None]:
## **Solution for Problem 2:

import pandas as pd

# Load the AMR dataset
amr_df = pd.read_csv("data/amr_dataset.csv")

# Replace missing values in MIC column with "Not Available"
amr_df["MIC_µg/mL"] = amr_df["MIC_µg/mL"].fillna("Not Available")

# Save the cleaned dataset
amr_df.to_csv("cleaned_amr_data.csv", index=False)

print("Cleaned AMR dataset saved to 'cleaned_amr_data.csv'")


In [None]:
# CODE PLAYGROUND

In [None]:
# CODE PLAYGROUND

## Any Questions?