# Modern Data Science Tools

This notebook introduces cutting-edge tools and libraries that are shaping modern data science workflows.
We'll explore tools for data manipulation, visualization, machine learning, and deployment.

## 1. Advanced Data Manipulation with Polars

Polars is a fast DataFrame library implemented in Rust with a Python interface, offering significant performance improvements over pandas.

In [None]:
# Install and import polars (if not already installed)
try:
    import polars as pl
    print(f"Polars version: {pl.__version__}")
except ImportError:
    print("Installing Polars...")
    import subprocess
    import sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "polars"])
    import polars as pl

import numpy as np
import time

# Create a large dataset for comparison
n_rows = 1_000_000
data = {
    'id': range(n_rows),
    'category': np.random.choice(['A', 'B', 'C', 'D'], n_rows),
    'value1': np.random.randn(n_rows),
    'value2': np.random.randn(n_rows),
    'timestamp': pl.date_range(start=pl.datetime(2023, 1, 1), end=pl.datetime(2023, 12, 31), n=n_rows)
}

# Create Polars DataFrame
df_pl = pl.DataFrame(data)
print(f"Polars DataFrame shape: {df_pl.shape}")
print("\nFirst few rows:")
print(df_pl.head())

In [None]:
# Demonstrate Polars' lazy evaluation and performance
print("=== Polars Lazy API ===")

# Lazy operations - no computation yet
lazy_df = (
    df_pl.lazy()
    .filter(pl.col('category') == 'A')
    .group_by('category')
    .agg([
        pl.col('value1').mean().alias('mean_value1'),
        pl.col('value2').std().alias('std_value2'),
        pl.count().alias('count')
    ])
    .sort('mean_value1', descending=True)
)

# Execute the lazy query
start_time = time.time()
result = lazy_df.collect()
polars_time = time.time() - start_time

print("Result:")
print(result)
print(f"\nPolars execution time: {polars_time:.4f} seconds")

## 2. Interactive Visualization with Plotly

Plotly creates interactive, publication-quality visualizations that can be embedded in web applications.

In [None]:
try:
    import plotly.express as px
    import plotly.graph_objects as go
    from plotly.subplots import make_subplots
    print(f"Plotly version: {px.__version__}")
except ImportError:
    print("Installing Plotly...")
    import subprocess
    import sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "plotly"])
    import plotly.express as px
    import plotly.graph_objects as go
    from plotly.subplots import make_subplots

import pandas as pd

# Create sample data for visualization
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=365, freq='D')
sales_data = pd.DataFrame({
    'date': dates,
    'sales': np.cumsum(np.random.randn(365) * 100 + 500),
    'category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home'], 365),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 365)
})

# Interactive line plot
fig = px.line(sales_data, x='date', y='sales', 
              title='Daily Sales Trend (Interactive)',
              labels={'sales': 'Sales ($)', 'date': 'Date'})
fig.update_layout(hovermode='x unified')
fig.show()

In [None]:
# Interactive scatter plot with faceting
fig = px.scatter(sales_data, x='date', y='sales', 
                 color='category', facet_col='region',
                 title='Sales by Category and Region',
                 labels={'sales': 'Sales ($)', 'date': 'Date'})
fig.update_layout(height=400)
fig.show()

## 3. Machine Learning with XGBoost and LightGBM

Gradient boosting libraries that often outperform traditional algorithms in terms of accuracy and speed.

In [None]:
try:
    import xgboost as xgb
    import lightgbm as lgb
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.model_selection import train_test_split, cross_val_score
    from sklearn.metrics import mean_squared_error, r2_score
    print(f"XGBoost version: {xgb.__version__}")
    print(f"LightGBM version: {lgb.__version__}")
except ImportError:
    print("Installing gradient boosting libraries...")
    import subprocess
    import sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "xgboost", "lightgbm"])
    import xgboost as xgb
    import lightgbm as lgb
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.model_selection import train_test_split, cross_val_score
    from sklearn.metrics import mean_squared_error, r2_score

# Create a regression dataset
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, 
                       noise=0.1, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")

In [None]:
# Compare different models
models = {
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'XGBoost': xgb.XGBRegressor(n_estimators=100, random_state=42, eval_metric='rmse'),
    'LightGBM': lgb.LGBMRegressor(n_estimators=100, random_state=42, verbose=-1)
}

results = {}

for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # Train model
    start_time = time.time()
    model.fit(X_train, y_train)
    training_time = time.time() - start_time
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')
    
    results[name] = {
        'MSE': mse,
        'R²': r2,
        'CV R²': cv_scores.mean(),
        'Training Time': training_time
    }
    
    print(f"  MSE: {mse:.4f}")
    print(f"  R²: {r2:.4f}")
    print(f"  CV R²: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
    print(f"  Training Time: {training_time:.4f}s")

# Display comparison
import pandas as pd
results_df = pd.DataFrame(results).T
print("\n=== Model Comparison ===")
print(results_df.round(4))

## 4. Feature Engineering with Featuretools

Automated feature engineering library that creates features from temporal and relational datasets.

In [None]:
try:
    import featuretools as ft
    print(f"Featuretools version: {ft.__version__}")
except ImportError:
    print("Installing Featuretools...")
    import subprocess
    import sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "featuretools"])
    import featuretools as ft

# Create sample entity set
es = ft.EntitySet(id="customer_data")

# Create customer data
customers_df = pd.DataFrame({
    'customer_id': range(100),
    'age': np.random.randint(18, 80, 100),
    'gender': np.random.choice(['M', 'F'], 100),
    'signup_date': pd.date_range('2020-01-01', periods=100, freq='D')
})

# Create transaction data
transactions_df = pd.DataFrame({
    'transaction_id': range(500),
    'customer_id': np.random.randint(0, 100, 500),
    'amount': np.random.uniform(10, 1000, 500),
    'transaction_date': pd.date_range('2020-01-01', periods=500, freq='6H')
})

# Add entities to entity set
es = es.add_dataframe(
    dataframe_name='customers',
    dataframe=customers_df,
    index='customer_id',
    time_index='signup_date'
)

es = es.add_dataframe(
    dataframe_name='transactions',
    dataframe=transactions_df,
    index='transaction_id',
    time_index='transaction_date'
)

# Add relationship
es = es.add_relationship('customers', 'customer_id', 'transactions', 'customer_id')

print("Entity Set created:")
print(es)

In [None]:
# Generate automated features
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name='customers',
    max_depth=2,
    verbose=True
)

print(f"\nGenerated {len(feature_defs)} features")
print("\nFeature matrix shape:", feature_matrix.shape)
print("\nSample features:")
print(feature_matrix.head())

## 5. Model Interpretation with SHAP

SHAP (SHapley Additive exPlanations) explains machine learning model predictions.

In [None]:
try:
    import shap
    print(f"SHAP version: {shap.__version__}")
except ImportError:
    print("Installing SHAP...")
    import subprocess
    import sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "shap"])
    import shap

# Use the best performing model from earlier
best_model = models['XGBoost']
best_model.fit(X_train, y_train)

# Create SHAP explainer
explainer = shap.Explainer(best_model)
shap_values = explainer(X_test)

# Summary plot
print("SHAP Summary Plot:")
shap.summary_plot(shap_values, X_test, plot_type="bar")

In [None]:
# Detailed explanation for a single prediction
print("SHAP Waterfall Plot for Single Prediction:")
shap.waterfall_plot(shap_values[0])

## 6. MLflow for Experiment Tracking

MLflow tracks experiments, reproduces runs, and deploys models.

In [None]:
try:
    import mlflow
    import mlflow.sklearn
    print(f"MLflow version: {mlflow.__version__}")
except ImportError:
    print("Installing MLflow...")
    import subprocess
    import sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "mlflow"])
    import mlflow
    import mlflow.sklearn

# Set up MLflow experiment
mlflow.set_experiment("Data Science Tools Comparison")

# Log an experiment
with mlflow.start_run(run_name="Random_Forest_Experiment") as run:
    # Log parameters
    mlflow.log_param("model_type", "RandomForest")
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 10)
    
    # Train and evaluate
    rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    
    # Log metrics
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    mlflow.log_metric("mse", mse)
    mlflow.log_metric("r2_score", r2)
    
    # Log the model
    mlflow.sklearn.log_model(rf, "random_forest_model")
    
    print(f"Experiment logged with run ID: {run.info.run_id}")
    print(f"MSE: {mse:.4f}, R²: {r2:.4f}")

## 7. Streamlit for Web Applications

Streamlit turns data scripts into shareable web apps in minutes.

In [None]:
# Example Streamlit app code (save as app.py to run)
streamlit_code = '''
import streamlit as st
import pandas as pd
import numpy as np
import plotly.express as px

st.title("Data Science Tools Dashboard")
st.write("Interactive dashboard built with Streamlit")

# Sidebar controls
st.sidebar.header("Controls")
data_size = st.sidebar.slider("Data Size", 100, 1000, 500)
noise_level = st.sidebar.slider("Noise Level", 0.0, 1.0, 0.1)

# Generate data
np.random.seed(42)
x = np.linspace(0, 10, data_size)
y = np.sin(x) + noise_level * np.random.randn(data_size)

# Create DataFrame
df = pd.DataFrame({"x": x, "y": y})

# Display data
st.subheader("Generated Data")
st.dataframe(df.head())

# Interactive plot
st.subheader("Interactive Plot")
fig = px.line(df, x="x", y="y", title="Sine Wave with Noise")
st.plotly_chart(fig, use_container_width=True)

# Statistics
st.subheader("Statistics")
col1, col2 = st.columns(2)
with col1:
    st.metric("Mean Y", f"{np.mean(y):.3f}")
    st.metric("Std Y", f"{np.std(y):.3f}")
with col2:
    st.metric("Min Y", f"{np.min(y):.3f}")
    st.metric("Max Y", f"{np.max(y):.3f}")
'''

print("Streamlit app code generated. Save this as 'app.py' and run with: streamlit run app.py")
print("\nSample app code:")
print(streamlit_code[:500] + "...")

## 8. Modern Python Data Science Stack

### Essential Libraries for 2024+:

**Data Manipulation:**
- `polars` - Fast DataFrames with lazy evaluation
- `pandas` - Still the standard for many workflows
- `dask` - Parallel computing with pandas-like API

**Machine Learning:**
- `scikit-learn` - Traditional ML algorithms
- `xgboost` - Gradient boosting
- `lightgbm` - Fast gradient boosting
- `catboost` - Gradient boosting with categorical support

**Deep Learning:**
- `tensorflow`/`keras` - Production-ready deep learning
- `pytorch` - Research-friendly deep learning
- `jax` - High-performance numerical computing

**Visualization:**
- `plotly` - Interactive visualizations
- `altair` - Declarative statistical visualization
- `seaborn` - Statistical plots
- `matplotlib` - Foundation plotting library

**MLOps:**
- `mlflow` - Experiment tracking
- `dvc` - Data version control
- `bentoml` - Model serving
- `streamlit` - Rapid app development

**Feature Engineering:**
- `featuretools` - Automated feature engineering
- `tsfresh` - Time series feature extraction
- `category_encoders` - Advanced categorical encoding

## Best Practices for Modern Data Science

1. **Use lazy evaluation** when possible (Polars, Dask)
2. **Leverage GPU acceleration** for deep learning and large datasets
3. **Track experiments** systematically with MLflow or similar tools
4. **Automate feature engineering** to reduce manual effort
5. **Interpret models** using SHAP or LIME for transparency
6. **Build interactive dashboards** for stakeholder communication
7. **Use version control** for both code and data
8. **Containerize environments** with Docker for reproducibility
9. **Monitor model performance** in production
10. **Stay updated** with rapidly evolving tools and techniques

## Resources for Learning Modern Tools

- [Polars Documentation](https://pola.rs/docs/)
- [Plotly Documentation](https://plotly.com/python/)
- [XGBoost Guide](https://xgboost.readthedocs.io/)
- [MLflow Tracking](https://mlflow.org/docs/latest/tracking.html)
- [Streamlit Documentation](https://docs.streamlit.io/)
- [Featuretools Guide](https://featuretools.alteryx.com/)
- [SHAP Documentation](https://shap.readthedocs.io/)

These tools represent the cutting edge of data science and will help you build more efficient, scalable, and impactful solutions.