# Logging in Data Science Projects

In this notebook, we'll explore the importance of logging in data science projects, how to implement logging effectively using Python's built-in logging module, and best practices for maintaining comprehensive and useful logs. Proper logging helps in debugging, monitoring, and understanding the behavior of your code over time, which is crucial for data science workflows that often involve long-running processes and complex data transformations.

**Table of Contents**
1. [Why Logging is Important](#1)
2. [Setting Up Logging in Python](#2)
3. [Logging Best Practices](#3)
4. [Step-by-Step Example](#4)
5. [Exercise](#5)

---
## 1. Why Logging is Important <a id="1"></a>

Logging is a critical component of any robust data science project for several reasons:

- **Debugging**: Logs provide insights into the code's execution flow and help identify where errors occur.
- **Monitoring**: Logs allow you to monitor the performance and behavior of your code over time.
- **Audit Trail**: Logs create a historical record of what your code did, which is essential for reproducibility and accountability.
- **Communication**: Logs can help communicate the state and progress of your code to team members and stakeholders.

---
## 2. Setting Up Logging in Python <a id="2"></a>

Python's built-in logging module provides a flexible framework for emitting log messages from Python programs. It is part of the standard library, so you don't need to install anything extra to use it.


### Basic Logging Configuration

Here is a basic example of how to set up logging in Python:

In [None]:
import logging

# Set up basic configuration
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Example log messages
logging.debug("This is a debug message")
logging.info("This is an info message")
logging.warning("This is a warning message")
logging.error("This is an error message")
logging.critical("This is a critical message")


### Logging to a File

To log messages to a file instead of the console, you can modify the configuration:

In [None]:
import logging

# Set up logging to a file
logging.basicConfig(filename='example.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Example log messages
logging.info("This is an info message logged to a file")

---
## 3. Logging Best Practices <a id="3"></a>

- **Use Appropriate Log Levels**: Use the appropriate log levels (DEBUG, INFO, WARNING, ERROR, CRITICAL) to categorize the importance and type of log messages.
- **Log with Context**: Include context in your log messages to make them more informative. For example, log variable values and the state of your application.
- **Avoid Logging Sensitive Information**: Be mindful of logging sensitive information such as passwords, personal data, or proprietary information.
- **Configure Rotating Logs**: For long-running applications, configure log rotation to prevent log files from growing indefinitely.
- **Use Structured Logging**: Use structured logging (e.g., JSON format) for better log parsing and analysis.

---
## 4. Step-by-Step Example <a id="4"></a>

Let's create a more detailed example of logging in a data science project. We'll log the steps of data loading, preprocessing, and model training.

In [None]:
import logging
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Step 1: Load Data
logging.info("Loading data...")
data = pd.read_csv('data/raw/sample_data.csv')
logging.info(f"Data loaded with shape {data.shape}")

# Step 2: Preprocess Data
logging.info("Preprocessing data...")
data.dropna(inplace=True)
X = data.drop('target', axis=1)
y = data['target']
logging.info(f"Data preprocessed. Features shape: {X.shape}, Target shape: {y.shape}")

# Step 3: Split Data
logging.info("Splitting data into training and test sets...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
logging.info(f"Data split. Training set shape: {X_train.shape}, Test set shape: {X_test.shape}")

# Step 4: Train Model
logging.info("Training model...")
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
logging.info("Model training completed")

# Step 5: Evaluate Model
logging.info("Evaluating model...")
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
logging.info(f"Model accuracy: {accuracy}")

# Step 6: Save Model
logging.info("Saving model...")
import joblib
joblib.dump(model, 'models/random_forest_model.pkl')
logging.info("Model saved successfully")

---
## 5. Exercise <a id="5"></a>

In this exercise, you'll implement a logging system for a data science workflow. Here is a sample code without any logging:

**Sample Code (Without Logging)**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib

# Step 1: Load Data
data = pd.read_csv('data/raw/sample_data.csv')

# Step 2: Preprocess Data
data.dropna(inplace=True)
X = data.drop('target', axis=1)
y = data['target']

# Step 3: Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train Model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Step 5: Evaluate Model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy}")

# Step 6: Save Model
joblib.dump(model, 'models/random_forest_model.pkl')
print("Model saved successfully")

**Task**
1. Set Up Logging: Configure a logging system that logs messages to both the console and a file.
2. Implement Logging: Add logging to the sample code to log each step of the data loading, preprocessing, splitting, training, evaluating, and saving processes.
3. Review Logs: Run your code and review the generated logs to ensure they provide a clear and comprehensive overview of the code execution.

**Your Code (With Logging)**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib

# Step 1: Load Data
data = pd.read_csv('data/raw/sample_data.csv')

# Step 2: Preprocess Data
data.dropna(inplace=True)
X = data.drop('target', axis=1)
y = data['target']

# Step 3: Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train Model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Step 5: Evaluate Model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy}")

# Step 6: Save Model
joblib.dump(model, 'models/random_forest_model.pkl')
print("Model saved successfully")

><details>
><summary>Do you need some help?</summary>
>
> Here is a working solution:
>
>
>```python
>
>import logging
>import pandas as pd
>from sklearn.model_selection import train_test_split
>from sklearn.ensemble import RandomForestClassifier
>from sklearn.metrics import accuracy_score
>import joblib
>
># Set up logging
>logging.basicConfig(filename='project.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
>
># Step 1: Load Data
>logging.info("Loading data...")
>data = pd.read_csv('data/raw/sample_data.csv')
>logging.info(f"Data loaded with shape {data.shape}")
>
># Step 2: Preprocess Data
>logging.info("Preprocessing data...")
>data.dropna(inplace=True)
>X = data.drop('target', axis=1)
>y = data['target']
>logging.info(f"Data preprocessed. Features shape: {X.shape}, Target shape: {y.shape}")
>
># Step 3: Split Data
>logging.info("Splitting data into training and test sets...")
>X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>logging.info(f"Data split. Training set shape: {X_train.shape}, Test set shape: {X_test.shape}")
>
># Step 4: Train Model
>logging.info("Training model...")
>model = RandomForestClassifier(random_state=42)
>model.fit(X_train, y_train)
>logging.info("Model training completed")
>
># Step 5: Evaluate Model
>logging.info("Evaluating model...")
>y_pred = model.predict(X_test)
>accuracy = accuracy_score(y_test, y_pred)
>logging.info(f"Model accuracy: {accuracy}")
>
># Step 6: Save Model
>logging.info("Saving model...")
>joblib.dump(model, 'models/random_forest_model.pkl')
>logging.info("Model saved successfully")
>```