<a href="https://colab.research.google.com/github/lillycodebreaker/Data-Analytics-Visualization/blob/main/oncology_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
pip install scispacy

Collecting spacy<3.8.0,>=3.7.0 (from scispacy)
  Using cached spacy-3.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting thinc<8.3.0,>=8.2.2 (from spacy<3.8.0,>=3.7.0->scispacy)
  Using cached thinc-8.2.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Using cached spacy-3.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.0 MB)
Using cached thinc-8.2.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (922 kB)
Installing collected packages: thinc, spacy
  Attempting uninstall: thinc
    Found existing installation: thinc 8.0.17
    Uninstalling thinc-8.0.17:
      Successfully uninstalled thinc-8.0.17
  Attempting uninstall: spacy
    Found existing installation: spacy 3.2.6
    Uninstalling spacy-3.2.6:
      Successfully uninstalled spacy-3.2.6
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the f

In [8]:
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_md-0.5.0.tar.gz


Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_md-0.5.0.tar.gz
  Using cached https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_md-0.5.0.tar.gz (120.2 MB)
[31mERROR: Operation cancelled by user[0m[31m
[0m

In [1]:
# data extraction from unstructured text, data integration, and basic analysis.
# This example assumes you have a dataset containing both structured and unstructured data related to oncology.

### 1. **Extract Information from Unstructured Data:**

import spacy
from spacy import displacy
from spacy.tokens import DocBin

# Load a pre-trained NLP model for medical text
nlp = spacy.load("en_core_sci_md")  # Example: Use a specialized model for scientific/medical text

# Example unstructured text (a clinical note)
clinical_note = """
The patient presents with a mass in the left lung. Differential diagnoses include lung carcinoma,
tuberculosis, or pneumonia. An oncology panel is ordered, and the patient shows symptoms of dyspnea
and hemoptysis. The ICD code for lung carcinoma is C34.90.
"""

# Process the text
doc = nlp(clinical_note)

# Extract named entities
for ent in doc.ents:
    print(ent.text, ent.label_)

# Optional: Visualize entities
displacy.render(doc, style="ent")



patient ENTITY
mass ENTITY
left lung ENTITY
Differential diagnoses ENTITY
lung carcinoma ENTITY
pneumonia ENTITY
oncology panel ENTITY
ordered ENTITY
patient ENTITY
symptoms ENTITY
dyspnea
 ENTITY
hemoptysis ENTITY
ICD ENTITY
code ENTITY
lung carcinoma ENTITY
C34.90 ENTITY


In [2]:
### 2. **Integrate Structured and Unstructured Data:**
# Combining extracted data with structured patient records.

import pandas as pd

# Example structured data
patient_data = pd.DataFrame({
    'patient_id': [1, 2],
    'age': [65, 70],
    'gender': ['M', 'F'],
    'diagnosis': ['Lung Cancer', 'Breast Cancer'],
    'icd_code': ['C34.90', 'C50.919']
})

# Add extracted information to the patient data
# Assuming extracted entities and ICD codes are stored in variables
extracted_data = {
    'patient_id': 1,
    'extracted_icd_code': 'C34.90',
    'symptoms': ['dyspnea', 'hemoptysis'],
    'differential_diagnoses': ['lung carcinoma', 'tuberculosis', 'pneumonia']
}

# Convert to DataFrame and merge
extracted_df = pd.DataFrame([extracted_data])
combined_data = pd.merge(patient_data, extracted_df, on='patient_id', how='left')

print(combined_data)

   patient_id  age gender      diagnosis icd_code extracted_icd_code  \
0           1   65      M    Lung Cancer   C34.90             C34.90   
1           2   70      F  Breast Cancer  C50.919                NaN   

                symptoms                     differential_diagnoses  
0  [dyspnea, hemoptysis]  [lung carcinoma, tuberculosis, pneumonia]  
1                    NaN                                        NaN  


In [3]:
### 3. **Perform Differential Diagnosis Analysis:**
#Using simple keyword matching for differential diagnosis.

def check_differential_diagnosis(diagnosis, differential_list):
    if isinstance(differential_list, float): # Check if differential_list is a float (NaN)
        return False
    return diagnosis.lower() in [d.lower() for d in differential_list]

# Example usage
combined_data['correct_diagnosis'] = combined_data.apply(
    lambda row: check_differential_diagnosis(row['diagnosis'], row['differential_diagnoses']),
    axis=1
)

print(combined_data[['patient_id', 'diagnosis', 'correct_diagnosis']])

   patient_id      diagnosis  correct_diagnosis
0           1    Lung Cancer              False
1           2  Breast Cancer              False


In [4]:
### 4. **Order and Analyze Test Panels:**
# Simulate ordering test panels based on patient diagnosis.

# Define test panels based on diagnosis
test_panels = {
    'Lung Cancer': ['CT Scan', 'Biopsy', 'Oncology Panel'],
    'Breast Cancer': ['Mammogram', 'Biopsy', 'Oncology Panel']
}

# Add test panel data to patient records
combined_data['ordered_tests'] = combined_data['diagnosis'].map(test_panels)

# Simulate test results
import random

def simulate_test_results(tests):
    results = {}
    for test in tests:
        results[test] = random.choice(['Positive', 'Negative', 'Inconclusive'])
    return results

combined_data['test_results'] = combined_data['ordered_tests'].apply(simulate_test_results)

print(combined_data[['patient_id', 'diagnosis', 'ordered_tests', 'test_results']])

   patient_id      diagnosis                        ordered_tests  \
0           1    Lung Cancer    [CT Scan, Biopsy, Oncology Panel]   
1           2  Breast Cancer  [Mammogram, Biopsy, Oncology Panel]   

                                        test_results  
0  {'CT Scan': 'Inconclusive', 'Biopsy': 'Negativ...  
1  {'Mammogram': 'Inconclusive', 'Biopsy': 'Negat...  


In [5]:
### 5. **Analyze and Report Results:**
# Create a summary of the analysis.

# Summary report
summary = combined_data[['patient_id', 'diagnosis', 'correct_diagnosis', 'ordered_tests', 'test_results']]
print(summary)

### Additional Considerations:
#- **Data Storage:** Ensure you have a database or other storage solution for handling patient data securely.
#- **Model Training:** If more complex models are required, you can train machine learning models using libraries like `scikit-learn`, `TensorFlow`, or `PyTorch`.
#- **Visualization:** For complex data, consider using `Plotly`, `Matplotlib`, or `Seaborn` to create interactive visualizations.

### Final Note:
#This code provides a basic framework and can be expanded based on specific requirements. Depending on the complexity of your data and the specifics of your analysis, more advanced techniques and tools might be needed.

   patient_id      diagnosis  correct_diagnosis  \
0           1    Lung Cancer              False   
1           2  Breast Cancer              False   

                         ordered_tests  \
0    [CT Scan, Biopsy, Oncology Panel]   
1  [Mammogram, Biopsy, Oncology Panel]   

                                        test_results  
0  {'CT Scan': 'Inconclusive', 'Biopsy': 'Negativ...  
1  {'Mammogram': 'Inconclusive', 'Biopsy': 'Negat...  
