# Introduction to AI for DOST Researchers

**DOST-ITDI AI Training Workshop**  
**Day 1 - Session 1: Introduction to Python, Machine Learning**

---

## Welcome!

This notebook will introduce you to:
- Google Colab environment
- Python programming basics (comprehensive)
- Machine Learning fundamentals
- AI applications in scientific research

## Learning Objectives
1. Navigate and use Google Colab effectively
2. Understand Python syntax and data structures
3. Learn fundamental Machine Learning concepts
4. Explore AI applications relevant to chemistry and materials science

## 1. Google Colab Basics

### What is Google Colab?
- Free cloud-based Jupyter notebook environment
- No setup required, runs in your browser
- Free access to GPUs and TPUs
- Easy sharing and collaboration

### Key Features:
- **Code cells**: Execute Python code
- **Text cells**: Write documentation using Markdown
- **Runtime**: Virtual machine that executes your code

### Keyboard Shortcuts:
- `Shift + Enter`: Run cell and move to next
- `Ctrl + Enter`: Run cell and stay on same cell
- `Ctrl + M B`: Insert cell below
- `Ctrl + M A`: Insert cell above
- `Ctrl + M D`: Delete cell

In [None]:
# Your first Python code - Run this cell by pressing Shift+Enter
print("Hello, DOST Researchers!")
print("Welcome to AI Training Workshop")

## 2. Python Basics for Data Science

### 2.1 Variables and Data Types

In [None]:
# Numbers
temperature = 25.5  # float (decimal number)
ph_level = 7  # integer (whole number)

# Strings (text)
compound_name = "Ethanol"
chemical_formula = "C2H5OH"

# Boolean (True/False)
is_reactive = True
is_toxic = False

print(f"Compound: {compound_name}")
print(f"Formula: {chemical_formula}")
print(f"Temperature: {temperature}°C")
print(f"pH Level: {ph_level}")
print(f"Reactive: {is_reactive}")

In [None]:
# Basic arithmetic operations
a = 10
b = 3

print(f"Addition: {a} + {b} = {a + b}")
print(f"Subtraction: {a} - {b} = {a - b}")
print(f"Multiplication: {a} * {b} = {a * b}")
print(f"Division: {a} / {b} = {a / b}")
print(f"Integer Division: {a} // {b} = {a // b}")
print(f"Modulus (remainder): {a} % {b} = {a % b}")
print(f"Power: {a} ** {b} = {a ** b}")

### 2.2 Lists - Storing Multiple Values

In [None]:
# List of molecular weights (g/mol)
molecular_weights = [18.015, 46.07, 58.44, 180.16]  # H2O, Ethanol, NaCl, Glucose
compounds = ["Water", "Ethanol", "Sodium Chloride", "Glucose"]

# Accessing elements (indexing starts at 0)
print(f"First compound: {compounds[0]}")
print(f"Second compound: {compounds[1]}")
print(f"Last compound: {compounds[-1]}")
print(f"Second to last: {compounds[-2]}")

# Slicing (getting a range)
print(f"First two compounds: {compounds[0:2]}")
print(f"From second onwards: {compounds[1:]}")

# List operations
print(f"\nNumber of compounds: {len(compounds)}")
print(f"Average molecular weight: {sum(molecular_weights)/len(molecular_weights):.2f} g/mol")

In [None]:
# Adding and removing items from lists
experiments = ["Sample A", "Sample B"]
print("Original:", experiments)

# Add item to end
experiments.append("Sample C")
print("After append:", experiments)

# Insert at specific position
experiments.insert(1, "Sample X")
print("After insert:", experiments)

# Remove item
experiments.remove("Sample X")
print("After remove:", experiments)

# Remove by index
last_item = experiments.pop()
print(f"Removed: {last_item}")
print("Final list:", experiments)

### 2.3 Dictionaries - Storing Key-Value Pairs

In [None]:
# Dictionary for a chemical compound
aspirin = {
    "name": "Aspirin",
    "formula": "C9H8O4",
    "molecular_weight": 180.16,
    "melting_point": 135,  # °C
    "boiling_point": 140,  # °C
    "use": "Analgesic"
}

# Accessing values
print(f"Compound: {aspirin['name']}")
print(f"Formula: {aspirin['formula']}")
print(f"Molecular Weight: {aspirin['molecular_weight']} g/mol")

# Adding new key-value pair
aspirin['solubility'] = 'Low in water'
print(f"Solubility: {aspirin['solubility']}")

# Get all keys and values
print("\nAll keys:", list(aspirin.keys()))
print("All values:", list(aspirin.values()))

### 2.4 Conditional Statements (if/elif/else)

In [None]:
# pH classification
ph = 6.5

if ph < 7:
    print(f"pH {ph} is ACIDIC")
elif ph == 7:
    print(f"pH {ph} is NEUTRAL")
else:
    print(f"pH {ph} is BASIC")

In [None]:
# Temperature safety check
temperature = 85
pressure = 2.5  # atm

if temperature > 100 and pressure > 2:
    print("⚠️ WARNING: High temperature AND high pressure!")
elif temperature > 100 or pressure > 2:
    print("⚠️ CAUTION: Either temperature or pressure is high")
else:
    print("✓ Conditions are safe")

# Comparison operators: ==, !=, <, >, <=, >=
# Logical operators: and, or, not

### 2.5 Loops - Repeating Operations

In [None]:
# For loop - iterate through a list
temperatures = [20, 25, 30, 35, 40]
print("Converting Celsius to Fahrenheit:")

for temp_c in temperatures:
    temp_f = (temp_c * 9/5) + 32
    print(f"{temp_c}°C = {temp_f}°F")

In [None]:
# For loop with range
print("Trial numbers:")
for i in range(1, 6):  # range(start, stop) - stop is not included
    print(f"Trial {i}")

print("\nEven numbers from 0 to 10:")
for i in range(0, 11, 2):  # range(start, stop, step)
    print(i, end=" ")

In [None]:
# While loop - repeat until condition is false
concentration = 1.0  # mol/L
dilution_factor = 0.5
count = 0

print("Serial Dilution:")
while concentration > 0.01:
    count += 1
    print(f"Step {count}: {concentration:.4f} mol/L")
    concentration *= dilution_factor

In [None]:
# Loop with enumerate (get index and value)
samples = ["Control", "Test A", "Test B", "Test C"]

for index, sample in enumerate(samples, start=1):
    print(f"Sample {index}: {sample}")

### 2.6 Functions - Reusable Code

In [None]:
# Simple function
def celsius_to_fahrenheit(celsius):
    """Convert Celsius to Fahrenheit"""
    fahrenheit = (celsius * 9/5) + 32
    return fahrenheit

# Using the function
temp_c = 25
temp_f = celsius_to_fahrenheit(temp_c)
print(f"{temp_c}°C = {temp_f}°F")

In [None]:
# Function with multiple parameters and default values
def calculate_molarity(moles, volume_ml, unit="M"):
    """
    Calculate molarity from moles and volume

    Parameters:
    - moles: number of moles
    - volume_ml: volume in milliliters
    - unit: output unit (default is M for molar)
    """
    volume_l = volume_ml / 1000  # convert mL to L
    molarity = moles / volume_l
    return f"{molarity:.4f} {unit}"

# Using the function
result = calculate_molarity(0.5, 250)
print(f"Molarity: {result}")

In [None]:
# Function returning multiple values
def analyze_ph(ph):
    """Classify pH and return category and hydrogen ion concentration"""
    if ph < 7:
        category = "Acidic"
    elif ph == 7:
        category = "Neutral"
    else:
        category = "Basic"

    h_concentration = 10 ** (-ph)

    return category, h_concentration

# Unpacking multiple return values
ph_value = 5.5
classification, h_ion = analyze_ph(ph_value)
print(f"pH {ph_value} is {classification}")
print(f"[H+] = {h_ion:.2e} M")

### 2.7 List Comprehension - Compact Way to Create Lists

In [None]:
# Traditional way
celsius_temps = [0, 10, 20, 30, 40]
fahrenheit_temps = []

for temp in celsius_temps:
    fahrenheit_temps.append((temp * 9/5) + 32)

print("Traditional way:", fahrenheit_temps)

# List comprehension (more concise)
fahrenheit_temps = [(temp * 9/5) + 32 for temp in celsius_temps]
print("List comprehension:", fahrenheit_temps)

In [None]:
# List comprehension with condition
all_ph_values = [3.5, 6.8, 7.0, 7.2, 9.1, 5.5, 8.3]

# Get only acidic pH values (< 7)
acidic_values = [ph for ph in all_ph_values if ph < 7]
print("Acidic pH values:", acidic_values)

# Get only basic pH values (> 7)
basic_values = [ph for ph in all_ph_values if ph > 7]
print("Basic pH values:", basic_values)

### 2.8 String Operations

In [None]:
# String methods
compound = "sodium chloride"

print(f"Original: {compound}")
print(f"Uppercase: {compound.upper()}")
print(f"Title case: {compound.title()}")
print(f"Replace: {compound.replace('sodium', 'potassium')}")

# String splitting and joining
formula = "C6H12O6"
elements = "C,H,O,N"

print(f"\nFormula length: {len(formula)}")
print(f"Starts with C: {formula.startswith('C')}")
print(f"Split elements: {elements.split(',')}")

In [None]:
# String formatting (multiple ways)
name = "Glucose"
mw = 180.156

# f-strings (recommended)
print(f"{name} has molecular weight of {mw:.2f} g/mol")

# .format() method
print("{} has molecular weight of {:.2f} g/mol".format(name, mw))

# % formatting (older style)
print("%s has molecular weight of %.2f g/mol" % (name, mw))

### 2.9 Error Handling (Try/Except)

In [None]:
# Handling potential errors
def safe_divide(a, b):
    try:
        result = a / b
        return result
    except ZeroDivisionError:
        print("Error: Cannot divide by zero!")
        return None
    except TypeError:
        print("Error: Invalid data type!")
        return None

# Test cases
print(safe_divide(10, 2))   # Normal case
print(safe_divide(10, 0))   # Division by zero
print(safe_divide(10, "a")) # Type error

## 3. Essential Python Libraries for Data Science

### 3.1 NumPy - Numerical Computing

In [None]:
import numpy as np

# Create arrays of experimental data
concentrations = np.array([0.1, 0.2, 0.3, 0.4, 0.5])  # mol/L
absorbances = np.array([0.15, 0.29, 0.44, 0.58, 0.73])

print("Concentrations:", concentrations)
print("Absorbances:", absorbances)
print(f"\nMean absorbance: {np.mean(absorbances):.3f}")
print(f"Std deviation: {np.std(absorbances):.3f}")
print(f"Min value: {np.min(absorbances)}")
print(f"Max value: {np.max(absorbances)}")

In [None]:
# NumPy array operations
temps_celsius = np.array([20, 25, 30, 35, 40])

# Vectorized operations (apply to all elements at once)
temps_fahrenheit = (temps_celsius * 9/5) + 32
temps_kelvin = temps_celsius + 273.15

print("Celsius:", temps_celsius)
print("Fahrenheit:", temps_fahrenheit)
print("Kelvin:", temps_kelvin)

In [None]:
# Creating arrays with NumPy
zeros = np.zeros(5)  # Array of zeros
ones = np.ones(5)    # Array of ones
range_array = np.arange(0, 10, 2)  # Array with range
linspace = np.linspace(0, 1, 5)  # 5 evenly spaced numbers between 0 and 1

print("Zeros:", zeros)
print("Ones:", ones)
print("Range:", range_array)
print("Linspace:", linspace)

### 3.2 Pandas - Data Analysis

In [None]:
import pandas as pd

# Create a DataFrame (like an Excel spreadsheet)
data = {
    'Compound': ['Benzene', 'Toluene', 'Ethanol', 'Methanol', 'Acetone'],
    'Formula': ['C6H6', 'C7H8', 'C2H5OH', 'CH3OH', 'C3H6O'],
    'MW': [78.11, 92.14, 46.07, 32.04, 58.08],
    'BP_C': [80.1, 110.6, 78.4, 64.7, 56.5],
    'Density': [0.879, 0.867, 0.789, 0.792, 0.791]
}

df = pd.DataFrame(data)
print(df)

In [None]:
# Basic DataFrame operations
print("First 3 rows:")
print(df.head(3))

print("\nDataFrame info:")
print(df.info())

print("\nStatistical summary:")
print(df.describe())

In [None]:
# Selecting columns
print("Compound names:")
print(df['Compound'])

print("\nMultiple columns:")
print(df[['Compound', 'MW', 'BP_C']])

In [None]:
# Filtering data
print("Compounds with BP > 70°C:")
high_bp = df[df['BP_C'] > 70]
print(high_bp)

print("\nCompounds with MW between 40 and 80:")
medium_mw = df[(df['MW'] > 40) & (df['MW'] < 80)]
print(medium_mw)

In [None]:
# Adding new columns
df['BP_F'] = (df['BP_C'] * 9/5) + 32  # Convert to Fahrenheit
df['BP_K'] = df['BP_C'] + 273.15      # Convert to Kelvin

print(df[['Compound', 'BP_C', 'BP_F', 'BP_K']])

In [None]:
# Sorting data
print("Sorted by Boiling Point (ascending):")
print(df.sort_values('BP_C')[['Compound', 'BP_C']])

print("\nSorted by Molecular Weight (descending):")
print(df.sort_values('MW', ascending=False)[['Compound', 'MW']])

### 3.3 Matplotlib - Data Visualization

In [None]:
import matplotlib.pyplot as plt

# Simple line plot
concentrations = np.array([0.1, 0.2, 0.3, 0.4, 0.5])
absorbances = np.array([0.15, 0.29, 0.44, 0.58, 0.73])

plt.figure(figsize=(10, 6))
plt.plot(concentrations, absorbances, 'o-', linewidth=2, markersize=8)
plt.xlabel('Concentration (mol/L)', fontsize=12)
plt.ylabel('Absorbance', fontsize=12)
plt.title('Beer-Lambert Law: Concentration vs Absorbance', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(df['MW'], df['BP_C'], s=150, alpha=0.6, edgecolors='black', linewidth=2)

# Add labels for each point
for i, txt in enumerate(df['Compound']):
    plt.annotate(txt, (df['MW'].iloc[i], df['BP_C'].iloc[i]),
                 xytext=(5, 5), textcoords='offset points', fontsize=9)

plt.xlabel('Molecular Weight (g/mol)', fontsize=12)
plt.ylabel('Boiling Point (°C)', fontsize=12)
plt.title('Molecular Weight vs Boiling Point', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Bar plot
plt.figure(figsize=(10, 6))
plt.bar(df['Compound'], df['BP_C'], color='steelblue', edgecolor='black', alpha=0.7)
plt.xlabel('Compound', fontsize=12)
plt.ylabel('Boiling Point (°C)', fontsize=12)
plt.title('Boiling Points of Common Solvents', fontsize=14, fontweight='bold')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

In [None]:
# Multiple subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# First subplot
axes[0].scatter(df['MW'], df['BP_C'], s=100, alpha=0.6, color='red')
axes[0].set_xlabel('Molecular Weight')
axes[0].set_ylabel('Boiling Point (°C)')
axes[0].set_title('MW vs BP')
axes[0].grid(True, alpha=0.3)

# Second subplot
axes[1].scatter(df['MW'], df['Density'], s=100, alpha=0.6, color='blue')
axes[1].set_xlabel('Molecular Weight')
axes[1].set_ylabel('Density (g/mL)')
axes[1].set_title('MW vs Density')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 4. Introduction to Machine Learning

### What is Machine Learning?

Machine Learning is a subset of AI that enables computers to learn from data without being explicitly programmed.

**Traditional Programming:**
```
Input Data + Rules → Output
```

**Machine Learning:**
```
Input Data + Output → Rules (Model)
```

### Types of Machine Learning

1. **Supervised Learning**
   - Learn from labeled data
   - Examples: Predicting molecular properties, classifying compounds
   - Algorithms: Linear Regression, Decision Trees, Neural Networks

2. **Unsupervised Learning**
   - Find patterns in unlabeled data
   - Examples: Clustering similar molecules, dimensionality reduction
   - Algorithms: K-Means, PCA

3. **Reinforcement Learning**
   - Learn through trial and error
   - Examples: Optimizing chemical reactions, drug discovery

### Key ML Terminology

- **Features (X)**: Input variables (e.g., molecular descriptors)
- **Target (y)**: Output variable (e.g., boiling point, toxicity)
- **Training**: Process of learning patterns from data
- **Model**: Mathematical representation of learned patterns
- **Prediction**: Using the model on new data

### Simple ML Example: Linear Regression

Let's predict boiling point based on molecular weight

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

# Sample data: Alkanes (CH4, C2H6, C3H8, C4H10, C5H12)
molecular_weights = np.array([16, 30, 44, 58, 72]).reshape(-1, 1)
boiling_points = np.array([-161, -89, -42, -0.5, 36])

# Create and train the model
model = LinearRegression()
model.fit(molecular_weights, boiling_points)

# Make predictions
predictions = model.predict(molecular_weights)

# Calculate metrics
r2 = r2_score(boiling_points, predictions)
mae = mean_absolute_error(boiling_points, predictions)

print(f"Model Coefficient (slope): {model.coef_[0]:.3f}")
print(f"Model Intercept: {model.intercept_:.3f}")
print(f"R² Score: {r2:.4f}")
print(f"Mean Absolute Error: {mae:.2f}°C")

# Predict new value
new_mw = np.array([[86]])  # C6H14 (Hexane)
predicted_bp = model.predict(new_mw)
print(f"\nPredicted BP for MW={new_mw[0][0]}: {predicted_bp[0]:.1f}°C")
print(f"Actual BP of Hexane: 69°C")

In [None]:
# Visualize the model
plt.figure(figsize=(10, 6))
plt.scatter(molecular_weights, boiling_points, s=120, alpha=0.7,
            label='Actual Data', color='blue', edgecolors='black', linewidth=2)
plt.plot(molecular_weights, predictions, 'r-', linewidth=3,
         label='ML Model Prediction', alpha=0.7)

# Add the new prediction
plt.scatter(new_mw, predicted_bp, s=200, marker='*', color='gold',
            edgecolors='black', linewidth=2, label='Prediction', zorder=5)

plt.xlabel('Molecular Weight (g/mol)', fontsize=12)
plt.ylabel('Boiling Point (°C)', fontsize=12)
plt.title('Predicting Boiling Point from Molecular Weight', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

## 5. AI Applications in Chemistry and Materials Science

### Current Applications:

1. **Drug Discovery**
   - Predicting molecular properties (ADME, toxicity)
   - Virtual screening of compounds
   - Optimizing drug candidates
   - Example: Atomwise, Insilico Medicine

2. **Materials Science**
   - Predicting material properties
   - Discovering new materials (batteries, catalysts)
   - Optimizing synthesis conditions
   - Example: Materials Project, Citrine Informatics

3. **Chemical Reactions**
   - Reaction outcome prediction
   - Retrosynthesis planning
   - Yield optimization
   - Example: IBM RXN for Chemistry

4. **Spectroscopy**
   - Spectrum interpretation (NMR, IR, MS)
   - Compound identification
   - Quality control
   - Example: NMRShiftDB, Mass spectra prediction

5. **Process Optimization**
   - Parameter optimization
   - Quality prediction
   - Fault detection
   - Example: Automated experimental design

### Real-World Examples:

- **DeepMind's AlphaFold**: Protein structure prediction (Nobel Prize 2024)
- **IBM RXN**: Chemical reaction prediction and synthesis planning
- **Materials Project**: 150,000+ computed materials properties
- **Schrödinger & BenevolentAI**: AI-powered drug discovery

## 6. Hands-on Exercises

### Exercise 1: Working with Chemical Data

In [None]:
# Exercise: Create your own compound database
# TODO: Add 3 more compounds with their properties

compounds_db = [
    {"name": "Water", "formula": "H2O", "mw": 18.015, "bp": 100, "mp": 0},
    {"name": "Ethanol", "formula": "C2H5OH", "mw": 46.07, "bp": 78.4, "mp": -114},
    # Add your compounds here (example: Acetone, Benzene, etc.)
]

# Convert to DataFrame
my_df = pd.DataFrame(compounds_db)
print(my_df)

# Calculate statistics
print(f"\nAverage Molecular Weight: {my_df['mw'].mean():.2f} g/mol")
print(f"Average Boiling Point: {my_df['bp'].mean():.2f}°C")

### Exercise 2: Temperature Converter Function

In [None]:
# TODO: Complete this function to convert between temperature units

def convert_temperature(value, from_unit, to_unit):
    """
    Convert temperature between Celsius, Fahrenheit, and Kelvin

    Parameters:
    - value: temperature value
    - from_unit: 'C', 'F', or 'K'
    - to_unit: 'C', 'F', or 'K'
    """
    # Your code here
    pass

# Test your function
# print(convert_temperature(25, 'C', 'F'))  # Should be 77
# print(convert_temperature(298.15, 'K', 'C'))  # Should be 25

### Exercise 3: Data Visualization

In [None]:
# TODO: Create a bar chart showing the melting points of your compounds
# Use the my_df DataFrame created in Exercise 1

# Your code here

## Summary

In this notebook, you learned:
- ✓ Google Colab environment
- ✓ Python fundamentals (variables, lists, dictionaries, loops, functions)
- ✓ Control flow (if/elif/else, try/except)
- ✓ Essential libraries (NumPy, Pandas, Matplotlib)
- ✓ Data visualization techniques
- ✓ Machine Learning basics
- ✓ AI applications in chemistry

## Next Steps

In the next notebook (Session 2), we'll:
- Work with real chemistry datasets
- Perform exploratory data analysis (EDA)
- Visualize molecular properties
- Prepare data for machine learning
- Use chemical informatics libraries (RDKit)

---

## Additional Resources

### Python Learning:
- [Python.org Official Tutorial](https://docs.python.org/3/tutorial/)
- [Real Python](https://realpython.com/)
- [W3Schools Python](https://www.w3schools.com/python/)

### Data Science:
- [NumPy Documentation](https://numpy.org/doc/)
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Matplotlib Gallery](https://matplotlib.org/stable/gallery/)

### Machine Learning:
- [Scikit-learn Documentation](https://scikit-learn.org/)
- [Google's ML Crash Course](https://developers.google.com/machine-learning/crash-course)

### Chemistry + AI:
- [RDKit](https://www.rdkit.org/) - Cheminformatics library
- [DeepChem](https://deepchem.io/) - Deep learning for chemistry
- [MoleculeNet](http://moleculenet.org/) - Chemistry datasets
- [ChemBERTa](https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1) - Transformer models for chemistry

---

**Questions? Feel free to ask during the workshop!**