<a href="https://colab.research.google.com/github/maradeben/oop-tutorial-mit/blob/main/Data_science_notebook_mit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **The Complete Companion Guide for the Healthcare AI & Data Literacy Challenge**

![welcome](https://imgs.search.brave.com/6AUCT3KxHZPH8bXlACEFDmV5kIHmrWvE5uBt4H4GtSA/rs:fit:500:0:1:0/g:ce/aHR0cHM6Ly90NC5m/dGNkbi5uZXQvanBn/LzEyLzQ3LzU1LzU5/LzM2MF9GXzEyNDc1/NTU5MDNfMTN6dW4y/Nmt5d0dsR1Q2SmFY/UVUwTHB1NFJGRDhu/YksuanBn)

**Introduction**

Welcome to the Healthcare AI & Data Literacy Challenge\! Think of this guide as a bridge between your DataCamp courses and your own clinical world. Its goal is to translate the core ideas of data science into practical, relatable healthcare scenarios and then show you how to bring them to life with Python.

We'll start with the "why" and the "what" in Part I, and then Part II will guide you through the "how" with hands-on code.

-----

### **Part I: Understanding Data Science in a Clinical Context**

#### **What Exactly is Data Science? 🧬**

At its heart, data science is the art of finding meaningful patterns in data to make better predictions. In a hospital, you're surrounded by an ocean of data: Electronic Health Records (EHRs), lab results, medical images, and patient feedback. Data science provides the tools to transform this raw information into life-saving insights.

  * **Clinical Analogy:** Imagine you could analyze the records of every patient who ever developed sepsis in your hospital. By identifying subtle, early warning signs that humans might miss, you could build a system that flags at-risk patients the moment they're admitted. That's the power of data science—turning historical data into a proactive tool for patient care.

-----

#### **The Data Science Team in a Hospital**

Data science isn't a one-person job; it's a team effort, much like a clinical care team.

  * **The Data Engineer (The Architect):** This is the person who builds the hospital's data plumbing. They create the systems (**pipelines**) that safely and efficiently collect data from the EHR, pharmacy, and lab systems, ensuring it's clean and organized for everyone else to use.
  * **The Data Analyst (The Storyteller):** The analyst takes the organized data and tells a story with it. They create the charts and dashboards you might see in a hospital command center—like a real-time graph of ER wait times—that help administrators understand what's happening *right now*.
  * **The Data Scientist (The Forecaster):** The scientist uses advanced statistics and AI to predict the future. They build the models that can forecast next month's ICU demand or analyze a tumor's genetic data to recommend a personalized treatment plan.

-----

#### **Medical Data: Sources and Types**

  * **Data Sources**:
      * **Internal Data**: Your hospital's own information, like EHRs, billing records, and patient satisfaction surveys.
      * **Open Data**: Freely available public datasets from organizations like the World Health Organization (WHO) or research databases like The Cancer Genome Atlas.
      * **APIs (Application Programming Interfaces)**: Think of these as secure data taps. They allow systems to get real-time data from other sources, like a patient's continuous glucose monitor or a public health database.
  * **Data Types**:
      * **Quantitative (Numerical)**: Anything you can measure with a number, like a patient's heart rate, blood pressure, or creatinine level.
      * **Qualitative (Categorical)**: Descriptive information, like a doctor's clinical notes, a diagnosis code, or patient-reported symptoms.
      * **Other Forms**: Healthcare is full of complex data, including **images** (X-rays, CT scans), **text** (clinical notes, research papers), and **geospatial data** (for mapping disease outbreaks).

-----

#### **The Data Science Workflow: A Step-by-Step Clinical Project**

1.  **Data Collection & Storage**: First, data is gathered from all sources. Structured data (like lab values) is often stored in **Relational Databases**, while unstructured data (like doctor's notes) goes into **Document Databases**.
2.  **Data Preparation (The "Scrub-In")**: Real-world clinical data is messy. Just as a surgeon scrubs in before a procedure, a data scientist must "clean" the data. This means correcting typos, standardizing units, and addressing missing values.
3.  **Exploration & Visualization (EDA)**: This is the diagnostic phase. Before building a model, you must explore the data to understand its patterns and limitations. This involves creating charts and summary statistics. A **dashboard** is a common tool here, providing a high-level view of key hospital metrics.
4.  **Experimentation & Prediction**: This is where you generate your key findings.
      * **A/B Testing**: This is the data science version of a Randomized Controlled Trial (RCT). For instance, a clinic could test two different appointment reminders to see which one reduces no-shows more effectively.
      * **Time Series Forecasting**: Using past data to predict the future. A classic example is forecasting seasonal flu cases to ensure proper staffing and supplies.
      * **Supervised Machine Learning**: Training an AI model on labeled data. You could feed a model thousands of retinal scans labeled "diabetic retinopathy" or "healthy" to teach it to diagnose the condition on its own.
      * **Unsupervised Machine Learning**: Finding hidden patterns in unlabeled data. This can be used to discover new patient subgroups (phenotypes) who might respond differently to a medication, even if you didn't know those groups existed beforehand.

-----

#### **Frontiers in Healthcare AI**

  * **Advanced Predictive Modeling**: Beyond basic models, data scientists often compare powerful algorithms like **XGBoost** and **Random Forests** to find the most accurate one. This requires careful data prep, including **scaling** features so the model treats them equally.
  * **Explainable AI (XAI)**: A major goal in clinical AI is avoiding the "black box" problem. Tools like **SHAP (SHapley Additive exPlanations)** help explain *why* a model made a decision—for example, by showing that a high HbA1C value was the main reason it flagged a patient for diabetes risk. For image-based models, techniques like **Grad-CAM** create heatmaps that highlight which parts of an image (like a specific region on a chest X-ray) the model focused on to make its diagnosis. These methods are crucial for building trust and ensuring clinical adoption.
  * **Natural Language Processing (NLP)**: This is AI that understands human language. In healthcare, it's used for **sentiment analysis** (e.g., is a drug review positive or negative?) and creating advanced visualizations with tools like **Scattertext** to explore the relationships between medications and the conditions they treat.

-----

### **Part II: Applying Data Science in Python - A Coding Reference**

This section serves as a practical, hands-on guide to the coding concepts discussed in the course materials. It is designed to be followed in a Google Colab or Jupyter Notebook environment.

#### **1. Foundational Coding Concepts**

These are the essential building blocks for working with data in Python.

##### **A. Setting Up and Loading Data**

First, we import the necessary libraries and create a simulated patient DataFrame using **pandas**. In a real project, you would typically load this data from a CSV file.

## Foundational Coding Concepts 🐍

These concepts cover the basics of loading, inspecting, filtering, and visualizing data, primarily using the **pandas** and **matplotlib** libraries.

### Data Loading and Inspection (pandas)
* **`pd.read_csv()`**: Loads data from a CSV file into a DataFrame.
* **`.head()`**: Displays the first few rows of a DataFrame to quickly inspect the data.
* **`.info()`**: Provides a technical summary of a DataFrame, including data types and non-null value counts for each column.

### Data Manipulation and Selection (pandas)
* **Column Selection**: Accessing a specific column of data using bracket notation (e.g., `df['hba1c']`) or dot notation (e.g., `df.hba1c`).
* **Logical Filtering**: Selecting rows that meet a specific condition (e.g., `patient_df[patient_df['hba1c'] >= 6.5]`).

### Data Visualization (matplotlib) 📊
* **`plt.scatter()`**: Creates a scatter plot to visualize the relationship between two numerical variables.
* **`plt.hist()`**: Creates a histogram to show the distribution of a single numerical variable.
* **`plt.bar()`**: Creates a bar chart to compare numerical values across different categories.
* **`plt.title()`**, **`plt.xlabel()`**, **`plt.ylabel()`**: Functions used to add a title and axis labels to a plot for clarity.

***

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

print("Libraries imported successfully!")

# Create a dictionary with our simulated patient data relevant to Type II Diabetes
data = {
    'age': [50, 65, 45, 72, 38, 55, 61, 48, 79, 31, 58, 68],
    'bmi': [30.1, 33.5, 28.0, 35.2, 25.5, 29.8, 31.0, 27.3, 36.1, 23.1, 32.4, 34.0],
    'hba1c': [7.1, 8.2, 5.5, 9.1, 5.1, 6.9, 7.5, 6.0, 9.5, 4.9, 7.8, 8.5],
    'blood_glucose_mg_dl': [155, 180, 105, 210, 95, 140, 165, 120, 230, 88, 170, 190],
    'has_diabetes': [1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1]
}
patient_df = pd.DataFrame(data)

##### **B. Data Inspection and Selection**

Before analysis, we inspect our data with `.head()` to see the first few rows and `.info()` for a technical summary. We can then select specific columns for analysis or filter rows based on logical conditions.

In [None]:
# Display the first 5 rows and technical info
print("First 5 patient records:")
print(patient_df.head())
print("\nTechnical information about the dataset:")
patient_df.info()

# Filter for patients with HbA1c >= 6.5%, a diagnostic criterion for diabetes
diabetic_patients = patient_df[patient_df['hba1c'] >= 6.5]
print("\nPatients meeting HbA1c criteria for diabetes:")
print(diabetic_patients)

##### **C. Basic Data Visualization**

Visualizations help uncover insights. A **scatter plot** can show relationships between two numerical variables, a **histogram** shows the distribution of a single variable, and a **bar chart** compares values across categories.

In [None]:
# Scatter plot of BMI vs. HbA1c
plt.style.use('seaborn-v0_8-whitegrid')
plt.scatter(patient_df['bmi'], patient_df['hba1c'], color='red')
plt.title('Patient BMI vs. HbA1c Level')
plt.xlabel('Body Mass Index (BMI)')
plt.ylabel('Glycated Hemoglobin (HbA1c %)')
plt.show()

#### **2. Advanced Coding Concepts**

This reference covers a more complete machine learning workflow, from advanced preprocessing to model explanation and NLP.

  * **Data Cleaning and Preprocessing**
      * `.str.strip()`: A pandas method used to remove leading and trailing spaces from text data, which is useful for cleaning inconsistent labels.
      * `LabelEncoder`: A tool from **scikit-learn** to convert categorical text labels (e.g., "Male", "Female") into numerical values that can be used by a model.
      * `MinMaxScaler`: A **scikit-learn** tool that scales all numerical features to a common range (typically 0 to 1), which improves the performance of many machine learning models.
  * **Exploratory Data Analysis (EDA)**
      * **Seaborn Heatmap**: A visualization used to show the correlation matrix of numerical features, making it easy to spot relationships between variables.
  * **Machine Learning Modeling**
      * `train_test_split`: A function from **scikit-learn** to divide a dataset into training and testing sets, which is essential for evaluating a model's performance on unseen data.
      * **Model Training**: Implementing and training various classification algorithms, such as Logistic Regression, K-Nearest-Neighbor, and a powerful algorithm called **XGBoost Classifier**.
      * `classification_report`: A **scikit-learn** function that provides key evaluation metrics like precision, recall, and F1-score for each class in a classification problem.
      * `confusion_matrix`: A **scikit-learn** tool that gives a detailed breakdown of a model's correct and incorrect predictions for each class.
  * **Model Explainability and NLP**
      * **SHAP (SHapley Additive exPlanations)**: A library used to explain the output of any machine learning model by quantifying the contribution of each feature to a specific prediction.
      * **Hugging Face `transformers`**: A powerful library for NLP tasks. It can be used to load pre-trained AI models to perform **sentiment analysis** on text data like drug reviews.
      * **Word Clouds**: A visualization technique used to display the most frequent words in a body of text, helping to quickly identify prominent medications or conditions.
      * **Scattertext**: A Python library for creating interactive visualizations that reveal relationships and distinguishing terms between different categories of text data.

-----

### **Detailed Healthcare Use Cases in Practice**

#### **Use Case: Diabetes Prediction using Machine Learning 🩺**

This use case demonstrates how to build a model to predict Type II diabetes using a patient dataset. Diabetes is a chronic disease characterized by high blood sugar, and its diagnosis can be based on metrics like fasting plasma glucose, oral glucose tolerance tests, or an HbA1C value of 6.5% or greater.

The machine learning workflow involves several key steps:

  * **Data Preprocessing**: The data is loaded into a pandas DataFrame. Initial cleaning includes fixing inconsistent labels by stripping extra spaces from values. Categorical data like gender and age ranges are converted into numerical codes using `LabelEncoder` from scikit-learn. Numerical features, which are on different scales, are normalized to a range between zero and one using `MinMaxScaler`.
  * **Model Training and Comparison**: The dataset is split, with 20% reserved for testing. Seven different machine learning algorithms, including Logistic Regression, K-Nearest-Neighbor, and XGBoost Classifier, are compared using cross-validation to find the best performer. In this scenario, **XGBoost** achieves the highest accuracy score.
  * **Evaluation**: The trained XGBoost model achieves a high accuracy score on the test set. The model's performance is further detailed using a classification report, which provides precision, recall, and F1-scores, and a confusion matrix that compares the true labels versus the predicted labels.
  * **Explainability**: To understand the model's decisions, **SHAP (SHapley Additive exPlanations)** is used. This technique quantifies how much each feature contributed to the prediction, confirming that HbA1C and BMI were key parameters the model learned for diagnosing diabetes, which aligns with clinical knowledge.

[Image of a machine learning workflow diagram]

-----

#### **Use Case: Sentiment Analysis of Drug Reviews 📝**

This use case applies sentiment analysis to a drug review dataset to classify patient feedback as positive, negative, or neutral.

The process uses **pre-trained transformer models** from the Hugging Face library to analyze the text.

  * **Model Comparison**: Three different models (`bio_clinicbert`, `rubert_classifier`, and `roberta_classifier`) are compared on sample reviews.
  * **Application**: After comparison, the best-performing models are applied to the full dataset of reviews. The output for each review includes a sentiment label (e.g., positive, negative) and a confidence score for that prediction. An interesting next step is to compare the model's predicted sentiment with the numerical ratings provided by the original reviewers.

-----

#### **Use Case: Text Visualization for Diseases and Medications 📊**

This use case focuses on using text visualization as a tool for exploratory data analysis to understand the complex relationships between diseases and medications. The same medication can often treat multiple diseases, and one disease can be treated by many medications.

Two primary visualization techniques are used:

  * **Word Clouds**: Word clouds are generated to quickly visualize the most prominent medications and conditions within the dataset. For example, the visualization might show that medications related to birth control and anxiolytics are highly frequent, as are conditions like depression and high blood pressure.
  * **Scattertext**: This Python library is used to create interactive visualizations that reveal relationships between text data. It can generate a plot showing which conditions are most frequently associated with a specific medication. For example, it can confirm that Metoclopramide is strongly associated with conditions like "Migraine" and "nausea," aligning with its clinical use as an antiemetic.

Specialty-Specific Module: Application Examples

Nursing:

Data Cleaning: Standardizing notes on patient fall risk assessments.

Dashboarding: Creating a ward dashboard to track patient-to-nurse ratios and workloads.

General/Family Medicine:

Supervised ML: Building a model to predict a patient's 10-year risk of cardiovascular disease based on their current health metrics.

Data Exploration: Analyzing patient demographics to identify populations that are overdue for preventative screenings.

Surgery:

Time Series Forecasting: Predicting the demand for specific surgical instruments to optimize inventory.

A/B Testing: Comparing post-operative pain scores between two different pain management protocols.

Obstetrics and Gynecology:

Supervised ML: Developing an algorithm to detect signs of pre-eclampsia from routine monitoring data.

Data Visualization: Plotting fetal growth charts against population averages.

Pediatrics:

Time Series Forecasting: Predicting seasonal peaks in RSV infections to staff clinics appropriately.

Clustering: Grouping children with developmental delays based on symptom profiles to identify distinct phenotypes.

Anesthesiology:

Data Pipelines: Creating real-time data streams from operating room monitors into a central research database.

Supervised ML: Predicting which patients are most likely to experience post-operative nausea.

Pharmacy:

Data Cleaning: Standardizing medication names from different EHR systems (e.g., "Tylenol" vs. "Acetaminophen").

A/B Testing: Testing two different patient-facing leaflets about a new medication to see which one improves adherence.

Medical Laboratory Science:

Data Cleaning: Identifying and flagging outlier or erroneous lab results automatically.

Time Series Forecasting: Predicting the demand for specific lab reagents to prevent shortages.

Dentistry:

Supervised ML: Training a model on X-rays to automatically detect and flag potential cavities for review.

Data Exploration: Analyzing patient records to find correlations between oral hygiene habits and periodontal disease.

Radiology/Medical Imaging:

Supervised ML: Building a model to classify chest X-rays for signs of pneumonia.

Clustering: Grouping brain MRI scans to identify different patterns of tumor growth.

Oncology:

Clustering: Analyzing genomic data from tumors to identify distinct molecular subtypes that respond differently to therapies.

Supervised ML: Predicting patient response to a specific chemotherapy regimen.

Cardiology:

Time Series Forecasting: Analyzing ECG data to predict the onset of atrial fibrillation.

Data Visualization: Creating scatter plots to visualize the relationship between blood pressure and age.

Public Health:

Geospatial Data Analysis: Mapping COVID-19 cases to identify hotspots and inform public policy.

Time Series Forecasting: Predicting the spread of a seasonal flu outbreak.

Community Health:

Data Exploration: Analyzing local data to identify neighborhoods with low access to healthcare services.

Dashboarding: Creating a community dashboard showing vaccination rates by zip code.

Mental Health:

Supervised ML (NLP): Analyzing text from therapy session transcripts to predict patient outcomes.

Clustering: Grouping patients based on their responses to psychiatric questionnaires to identify different depression subtypes.

Biomedical Engineering:

Data Pipelines: Building systems to stream data from new medical devices for clinical trials.

Time Series Analysis: Analyzing sensor data from a prosthetic limb to improve its performance.

Physiotherapy:

Time Series Analysis: Tracking a patient's range of motion over time using wearable sensors.

Data Visualization: Plotting a patient's recovery progress against their rehabilitation goals.

Nutrition and Dietetics:

Clustering: Grouping patients based on their dietary logs to identify common eating patterns.

A/B Testing: Comparing weight loss outcomes between two different diet plans.

Radiography:

Supervised ML: Developing AI tools to automatically position patients for optimal X-ray imaging.

Data Exploration: Analyzing image metadata to identify factors that contribute to low-quality scans.

Physiology:

Time Series Analysis: Studying high-frequency data from physiological experiments to understand cellular responses.

Data Visualization: Plotting the relationship between oxygen saturation and heart rate during exercise.

Optometry:

Supervised ML: Training an algorithm on retinal fundus images to screen for glaucoma.

Data Exploration: Analyzing patient data to find risk factors for age-related macular degeneration.

Environmental Health Science:

Geospatial Data Analysis: Mapping the correlation between air pollution levels and hospital admissions for asthma.

Data Exploration: Analyzing data to link exposure to certain chemicals with health outcomes.

Health Information Management:

Data Cleaning: Leading projects to de-duplicate patient records and ensure data integrity in the EHR.

Data Pipelines: Overseeing the flow of data between different hospital IT systems.

Occupational Therapy:

Data Visualization: Charting a patient's progress in activities of daily living (ADLs) over time.

Time Series Analysis: Using sensor data to analyze the ergonomics of a worker's movements.

Medical Physics:

Supervised ML: Building models to optimize radiation therapy plans for cancer patients.

Image Data Analysis: Developing new algorithms to improve the quality of MRI scans.

Part II: Advanced Use Cases & End-to-End Workflows
Introduction

Welcome to the advanced section! In Part I, we covered the basic building blocks. Here, we'll walk through more complete, real-world workflows that mirror the process a data scientist in a hospital or research center would follow. We will explore three specific use cases: building a diabetes prediction model, analyzing sentiment in drug reviews, and visualizing relationships in text data.

Use Case 1: End-to-End Diabetes Prediction Model

Our goal is to build a machine learning model that can predict whether a patient has Type II diabetes based on their clinical data. We will go through a more realistic workflow that includes data cleaning, preprocessing, model training, and a crucial final step: model explanation.



1. Advanced Data Cleaning & Preprocessing

Real-world data often has formatting issues. For instance, categorical data might have extra spaces that need to be removed. Furthermore, most machine learning models require all input to be numerical.


Label Encoding: We convert categorical columns (like 'gender') into numbers.


Scaling: Clinical measurements are on different scales (e.g., BMI vs. HbA1C). We use a

MinMaxScaler to transform all numerical features to a common range of 0 to 1, which helps the model perform better.

In [None]:
# This is a conceptual code block to illustrate the steps.
# In a real scenario, you'd have more data. We'll use our small DataFrame to demonstrate.
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

# --- Cleaning ---
# Imagine a 'gender' column was added with extra spaces
patient_df['gender'] = [' Male', 'Female', 'Female ', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female', ' Male', 'Male', 'Female']
print("Gender before cleaning:\n", patient_df['gender'].unique())
patient_df['gender'] = patient_df['gender'].str.strip()
print("Gender after cleaning:\n", patient_df['gender'].unique())

# --- Label Encoding ---
le = LabelEncoder()
patient_df['gender_encoded'] = le.fit_transform(patient_df['gender'])
print("\nDataFrame with encoded gender:")
print(patient_df.head())

# --- Scaling ---
# Select only numerical columns for scaling
numerical_cols = ['age', 'bmi', 'hba1c', 'blood_glucose_mg_dl']
scaler = MinMaxScaler()
patient_df[numerical_cols] = scaler.fit_transform(patient_df[numerical_cols])
print("\nDataFrame after scaling numerical features:")
print(patient_df.head())

2. EDA with a Correlation Heatmap
A heatmap is a great way to quickly see which variables are correlated. Lighter shades indicate a stronger positive correlation. For example, we'd expect HbA1C and BMI to be correlated

In [None]:
import seaborn as sns

# Re-create a simple numerical DataFrame for the heatmap
data = {'age': [50, 65, 45, 72], 'bmi': [30.1, 33.5, 28.0, 35.2], 'hba1c': [7.1, 8.2, 5.5, 9.1], 'blood_glucose_mg_dl': [155, 180, 105, 210]}
corr_df = pd.DataFrame(data)

# Calculate the correlation matrix
correlation_matrix = corr_df.corr()

# Create the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Clinical Features')
plt.show()

3. Model Training & Evaluation
We'll now train an

XGBoost Classifier, a powerful and popular algorithm that performed best in the example analysis. We first split our data into a training set (for the model to learn from) and a test set (to evaluate its performance on unseen data)

In [None]:
# This is a conceptual code block. XGBoost requires installation.
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Define features (X) and the target label (y)
features = ['age', 'bmi', 'hba1c', 'blood_glucose_mg_dl', 'gender_encoded']
target = 'has_diabetes'

X = patient_df[features]
y = patient_df[target]

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the XGBoost model
model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
model.fit(X_train, y_train)

# Make predictions on the test data
predictions = model.predict(X_test)

# Evaluate the model
print("--- Model Evaluation ---")
# A confusion matrix shows a comparison of true vs. predicted labels [cite: 56]
print("Confusion Matrix:\n", confusion_matrix(y_test, predictions))
# A classification report shows precision, recall, and F1-score for each class [cite: 55]
print("\nClassification Report:\n", classification_report(y_test, predictions))

Use Case 2: Sentiment Analysis on Drug Reviews

Analyzing patient reviews can provide valuable insights. Using pre-trained

transformer models from libraries like Hugging Face, we can perform sentiment analysis to automatically classify text as positive, negative, or neutral

In [None]:
# This code requires installing the transformers library: !pip install transformers
from transformers import pipeline

# Load a pre-trained sentiment analysis model
sentiment_classifier = pipeline('sentiment-analysis')

# Example drug review
review = "I was hesitant at first, but this medicine has completely changed my life for the better. I have no side effects."

# Get the sentiment prediction
result = sentiment_classifier(review)
print(f"Review: '{review}'")
print(f"Predicted Sentiment: {result}")

Use Case 3: Visualizing Text Data Relationships

When dealing with thousands of reviews, we need ways to explore the data quickly.

1. Word Clouds
A word cloud is a simple yet powerful visualization where the size of each word is proportional to its frequency in the text. This can help us quickly identify the most commonly discussed medications or conditions in a large dataset.

In [None]:
# This code requires installing the wordcloud library: !pip install wordcloud
from wordcloud import WordCloud

# Sample text data of conditions mentioned in reviews
text_data = "birth control pain depression anxiety high blood pressure pain birth control acne pain depression birth control"

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text_data)

# Display the image
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()