1. Assignment: DNA Sequence Analysis. Task: Analyze a given DNA sequence and perform basic sequence
manipulation, including finding motifs, calculating GC content, and identifying coding regions. Deliverable: A
report summarizing the analysis results and any insights gained from the sequence.

In [None]:
import re
sequence = ""
with open("BI_1_sequence.fasta") as file:
  lines = file.readlines()
  sequence = ''.join(line.strip() for line in lines[1:]).upper()

In [None]:
g_count = sequence.count('G')
c_count = sequence.count('C')
total_count = len(sequence)
gc_percent = ((g_count + c_count) / total_count) * 100
print(f"G Count: {g_count}")
print(f"C Count: {c_count}")
print(f"Total Count: {total_count}")
print(f"GC Percent: {gc_percent:.2f}%")

G Count: 1130
C Count: 1067
Total Count: 3598
GC Percent: 61.06%


In [None]:
a_count = sequence.count('A')
t_count = sequence.count('T')
at_percent = ((a_count + t_count) / total_count) * 100
print(f"A Count: {a_count}")
print(f"T Count: {t_count}")
print(f"AT Percent: {at_percent:.2f}%")

A Count: 667
T Count: 734
AT Percent: 38.94%


In [None]:
ratio = (a_count + t_count) / (g_count + c_count)
print(f"AT/GC Ratio: {ratio:.2f}")

AT/GC Ratio: 0.64


In [None]:
def find_motifs(sequence, motif):
    print(f"\nSearching for motif: {motif}")
    matches = [match.start() for match in re.finditer(motif, sequence)]
    if matches:
        print(f"Motif '{motif}' found at positions: {matches}")
    else:
        print(f"Motif '{motif}' not found")

In [None]:
motif = "TATAA"
find_motifs(sequence, motif)


Searching for motif: TATAA
Motif 'TATAA' found at positions: [1799]


In [None]:
start_codon = 'ATG'
stop_codons = ['TAA', 'TAG', 'TGA']
coding_regions = []
start_index = sequence.find(start_codon)
print(start_index)

73


In [None]:
while start_index != -1:
    for stop_codon in stop_codons:
        stop_index = sequence.find(stop_codon, start_index + 3)
        if stop_index != -1 and (stop_index - start_index) % 3 == 0:
            coding_region = sequence[start_index:stop_index + 3]
            coding_regions.append(coding_region)
            break
    start_index = sequence.find(start_codon, start_index + 1)

In [None]:
if coding_regions:
    print("Coding Regions Found")
    for i, coding_region in enumerate(coding_regions, start=1):
        print(f"\nRegion {i}: {coding_region}\nLength: {len(coding_region)}")
else:
    print("No coding regions found")

Coding Regions Found

Region 1: ATGAGCTCAGGGGCCTCTAGAAAGAGCTGGGACCCTGGGAACCCCTGGCCTCCAGGTAGTCTCAGGAGAGCTACTCGGGGTCGGGCTTGGGGAGAGGAGGAGCGGGGGTGAGGCAAGCAGCAGGGGACTGGACCTGGGAAGGGCTGGGCAGCAGAGACGACCCGACCCGCTAGAAGGTGGGGTGGGGAGAGCAGCTGGACTGGGATGTAA
Length: 210

Region 2: ATGTAA
Length: 6

Region 3: ATGGAACACGGCGCTTAA
Length: 18

Region 4: ATGTGAAGGGAGAATGAGGAATGCGAGACTGGGACTGAGATGGAACCGGCGGTGGGGAGGGGGTGGGGGGATGGAATTTGAACCCCGGGAGAGGAAGATGGAATTTTCTATGGAGGCCGACCTGGGGATGGGGAGATAAGAGAAGACCAGGAGGGAGTTAAATAG
Length: 165

Region 5: ATGAGGAATGCGAGACTGGGACTGAGATGGAACCGGCGGTGGGGAGGGGGTGGGGGGATGGAATTTGAACCCCGGGAGAGGAAGATGGAATTTTCTATGGAGGCCGACCTGGGGATGGGGAGATAA
Length: 126

Region 6: ATGCGAGACTGGGACTGA
Length: 18

Region 7: ATGGAACCGGCGGTGGGGAGGGGGTGGGGGGATGGAATTTGAACCCCGGGAGAGGAAGATGGAATTTTCTATGGAGGCCGACCTGGGGATGGGGAGATAAGAGAAGACCAGGAGGGAGTTAAATAG
Length: 126

Region 8: ATGGAATTTGAACCCCGGGAGAGGAAGATGGAATTTTCTATGGAGGCCGACCTGGGGATGGGGAGATAA
Length: 69

Region 9: ATGGAATTTTCTATGGAGGCCGACCTGGGGATGGGGAGATAA


2. Assignment: RNA-Seq Data Analysis. Task: Analyze a provided RNA-Seq dataset and perform differential
gene expression analysis. Deliverable: A detailed report presenting the differentially expressed genes, their
functional annotations, and any potential biological interpretations

In [None]:
#R Studio

3. Assignment: Protein Structure Prediction. Task: Predict the 3D structure of a given protein sequence using
homology modeling or threading techniques. Deliverable: A report presenting the predicted protein structure,
along with an analysis of its potential functions and interactions.

In [None]:
#Tools

5. Assignment: Machine Learning for Genomic Data. Task: Apply machine learning algorithms, such as random
forests or support vector machines, to classify genomic data based on specific features or markers. Deliverable:
A comprehensive analysis report presenting the classification results, model performance evaluation, and
insights into the predictive features.

In [None]:
# Step 1: Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score, recall_score
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

In [None]:
# Step 2: Load the dataset
data = pd.read_csv('METABRIC_RNA_Mutation.csv')

In [None]:
# Step 3: Explore the dataset
print(data.head())

   patient_id  age_at_diagnosis type_of_breast_surgery    cancer_type  \
0           0             75.65             MASTECTOMY  Breast Cancer   
1           2             43.19      BREAST CONSERVING  Breast Cancer   
2           5             48.87             MASTECTOMY  Breast Cancer   
3           6             47.68             MASTECTOMY  Breast Cancer   
4           8             76.97             MASTECTOMY  Breast Cancer   

                        cancer_type_detailed cellularity  chemotherapy  \
0           Breast Invasive Ductal Carcinoma         NaN             0   
1           Breast Invasive Ductal Carcinoma        High             0   
2           Breast Invasive Ductal Carcinoma        High             1   
3  Breast Mixed Ductal and Lobular Carcinoma    Moderate             1   
4  Breast Mixed Ductal and Lobular Carcinoma        High             1   

  pam50_+_claudin-low_subtype  cohort er_status_measured_by_ihc  ... mtap_mut  \
0                 claudin-low     1

In [None]:
data.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 715 entries, 0 to 714
Data columns (total 693 columns):
 #    Column                          Dtype  
---   ------                          -----  
 0    patient_id                      int64  
 1    age_at_diagnosis                float64
 2    type_of_breast_surgery          object 
 3    cancer_type                     object 
 4    cancer_type_detailed            object 
 5    cellularity                     object 
 6    chemotherapy                    int64  
 7    pam50_+_claudin-low_subtype     object 
 8    cohort                          float64
 9    er_status_measured_by_ihc       object 
 10   er_status                       object 
 11   neoplasm_histologic_grade       float64
 12   her2_status_measured_by_snp6    object 
 13   her2_status                     object 
 14   tumor_other_histologic_subtype  object 
 15   hormone_therapy                 int64  
 16   inferred_menopausal_state       object 
 17   integrative_cl

In [None]:
columns = data.columns

Index(['patient_id', 'age_at_diagnosis', 'type_of_breast_surgery',
       'cancer_type', 'cancer_type_detailed', 'cellularity', 'chemotherapy',
       'pam50_+_claudin-low_subtype', 'cohort', 'er_status_measured_by_ihc',
       ...
       'mtap_mut', 'ppp2cb_mut', 'smarcd1_mut', 'nras_mut', 'ndfip1_mut',
       'hras_mut', 'prps2_mut', 'smarcb1_mut', 'stmn2_mut', 'siah1_mut'],
      dtype='object', length=693)

In [None]:
for i, column in enumerate(columns, start=1):
  print(f"{i}: {column} has unique values: {data[column].nunique()}")

1: patient_id has unique values: 715
2: age_at_diagnosis has unique values: 664
3: type_of_breast_surgery has unique values: 2
4: cancer_type has unique values: 2
5: cancer_type_detailed has unique values: 6
6: cellularity has unique values: 3
7: chemotherapy has unique values: 2
8: pam50_+_claudin-low_subtype has unique values: 7
9: cohort has unique values: 2
10: er_status_measured_by_ihc has unique values: 2
11: er_status has unique values: 2
12: neoplasm_histologic_grade has unique values: 3
13: her2_status_measured_by_snp6 has unique values: 3
14: her2_status has unique values: 2
15: tumor_other_histologic_subtype has unique values: 8
16: hormone_therapy has unique values: 2
17: inferred_menopausal_state has unique values: 2
18: integrative_cluster has unique values: 11
19: primary_tumor_laterality has unique values: 2
20: lymph_nodes_examined_positive has unique values: 27
21: mutation_count has unique values: 19
22: nottingham_prognostic_index has unique values: 198
23: oncotree

In [None]:
data = data.set_index('patient_id')
df_expression = data.iloc[:,
30:519].join(data['overall_survival'], how='inner')
df_expression

Unnamed: 0_level_0,brca1,brca2,palb2,pten,tp53,atm,cdh1,chek2,nbn,nf1,...,srd5a2,srd5a3,st7,star,tnk2,tulp4,ugt2b15,ugt2b17,ugt2b7,overall_survival
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,-1.3990,-0.5738,-1.6217,1.4524,0.3504,1.1517,0.0348,0.1266,-0.8361,-0.8578,...,-0.0194,-1.6345,-0.2142,-0.5698,-1.1741,-1.4779,-0.5954,-0.8847,-0.3354,1
2,-1.3800,0.2777,-1.2154,0.5296,-0.0136,-0.2659,1.3594,0.7961,0.5419,-2.6059,...,0.4534,0.4068,0.7634,0.0231,0.9121,-0.9538,-0.2264,0.5398,-0.8920,1
5,0.0670,-0.8426,0.2114,-0.3326,0.5141,-0.0803,1.1398,0.4187,-0.4030,-1.1305,...,0.0668,0.8344,1.7227,0.4024,-3.7172,-1.5538,1.3701,-0.1078,0.3655,0
6,0.6744,-0.5428,-1.6592,0.6369,1.6708,-0.8880,1.2491,-1.1889,-0.4174,-0.6165,...,-0.7078,0.8228,0.6819,-0.1948,-2.3286,-0.9924,-0.3154,0.2320,-0.4828,1
8,1.2932,-0.9039,-0.7219,0.2168,0.3484,0.3897,0.9131,0.9356,0.7675,-0.2940,...,-0.3544,-1.0150,2.2961,0.1817,-0.1572,0.0427,5.0048,3.8476,1.3223,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3436,-0.2133,0.3584,-0.6478,0.1600,0.3098,-1.2561,1.9327,-0.8945,-0.4816,-0.6254,...,-0.5488,0.5662,0.9458,0.6918,-1.1666,0.3349,1.2187,1.6076,0.4282,0
3437,0.4405,-0.3679,-1.1658,-0.5324,-0.1581,0.8400,1.0522,-0.1539,-0.9428,0.3777,...,0.8002,0.8992,2.0359,0.2968,0.3446,3.4423,-0.8026,0.0751,-0.9454,1
3439,-0.5298,0.7814,-0.0255,-0.6881,0.3107,0.4589,0.3527,0.2621,-0.7245,-0.0428,...,0.7607,1.0186,1.0176,0.7558,-0.9423,0.1702,1.0170,1.0133,0.6045,1
3450,0.4089,-2.0066,0.7001,0.6006,2.0187,-1.2255,1.2207,0.8971,0.2744,-0.4220,...,0.8414,0.3233,0.1921,2.2530,-0.0245,-0.4587,0.4001,0.0177,3.0080,1


In [None]:
# Step 4: Preprocess the data (Convert categorical features to numerical using Label Encoding)
# Create a LabelEncoder object
label_encoder = LabelEncoder()

In [None]:
# Apply Label Encoding to each column
for col in data.columns:
    data[col] = label_encoder.fit_transform(data[col])

In [None]:
# Step 5: Define Features (X) and Target (y)
X = data.drop(columns=['type_of_breast_surgery'])  # Features
y = data['type_of_breast_surgery']

In [None]:
# Step 6: Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Step 7: Train the Random Forest Classifier
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)

In [None]:
# Step 8: Make predictions on the test set
y_pred = rfc.predict(X_test)

In [None]:
# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.23      0.36        78
           1       0.63      0.96      0.76       110
           2       0.00      0.00      0.00         3

    accuracy                           0.65       191
   macro avg       0.48      0.40      0.37       191
weighted avg       0.70      0.65      0.58       191



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
# Step 9: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
error_rate = 1 - accuracy
print(f"Model Accuracy: {accuracy * 100:.2f}%")
print(f"Precision: {precision * 100:.2f}%")
print(f"Recall: {recall * 100:.2f}%")
print(f"Error Rate: {error_rate * 100:.2f}%")

ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

In [None]:
# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)


Confusion Matrix:
[[ 18  60   0]
 [  4 106   0]
 [  0   3   0]]
