# Data Analysis and Feature Engineering

This notebook implements a comprehensive data analysis and feature engineering for the challenge
.

## Table of Contents
1. [Setup and Data Loading](#Setup-and-Data-Loading)
2. [Exploratory Data Analysis (EDA)](#Exploratory-Data-Analysis-EDA)
3. [Feature Engineering](#Feature-Engineering)
4. [Data Preparation for Modeling](#Data-Preparation-for-Modeling)

## Step 1: Setup and Data Loading

Import necessary libraries and load the challenge data.

In [2]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set up visualization style
plt.style.use('default')
sns.set_palette('husl')

# Configure pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

print("Libraries imported successfully!")

Libraries imported successfully!


In [3]:
# Load the CSV file into a pandas DataFrame
df = pd.read_csv('challenge_data-18-ago.csv')

print(f"Data loaded successfully! Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

ParserError: Error tokenizing data. C error: Expected 14 fields in line 9, saw 30


## Step 2: Exploratory Data Analysis (EDA)

Perform initial data inspection to understand the dataset structure and characteristics.

In [None]:
# View the first few rows
print("First 5 rows of the dataset:")
df.head()

In [None]:
# Get DataFrame information
print("\nDataFrame Information:")
df.info()

In [None]:
# Get descriptive statistics
print("\nDescriptive Statistics:")
df.describe()

In [None]:
# Check for missing values
print("\nMissing Values Count:")
missing_values = df.isnull().sum()
print(missing_values)

print(f"\nTotal missing values: {missing_values.sum()}")
print(f"Percentage of missing data: {(missing_values.sum() / (df.shape[0] * df.shape[1])) * 100:.2f}%")

### Data Visualization

Create various visualizations to understand data distributions and relationships.

In [None]:
# Histograms for numerical features
numerical_cols = df.select_dtypes(include=[np.number]).columns

if len(numerical_cols) > 0:
    print(f"Creating histograms for {len(numerical_cols)} numerical columns...")
    
    # Calculate grid dimensions
    n_cols = 3
    n_rows = (len(numerical_cols) + n_cols - 1) // n_cols
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(20, 5*n_rows))
    if n_rows == 1:
        axes = axes.reshape(1, -1)
    
    for i, col in enumerate(numerical_cols):
        row_idx = i // n_cols
        col_idx = i % n_cols
        
        if n_rows > 1:
            ax = axes[row_idx, col_idx]
        else:
            ax = axes[col_idx]
        
        df[col].hist(bins=30, ax=ax)
        ax.set_title(f'Distribution of {col}')
        ax.set_xlabel(col)
        ax.set_ylabel('Frequency')
    
    plt.tight_layout()
    plt.show()
else:
    print("No numerical columns found for histograms.")

In [None]:
# Box plots to identify outliers
if len(numerical_cols) > 0:
    print("Creating box plots to identify outliers...")
    
    plt.figure(figsize=(15, 8))
    sns.boxplot(data=df[numerical_cols])
    plt.title('Box Plots for Numerical Features')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
else:
    print("No numerical columns found for box plots.")

In [None]:
# Correlation matrix heatmap
if len(numerical_cols) > 1:
    print("Creating correlation matrix heatmap...")
    
    plt.figure(figsize=(12, 8))
    correlation_matrix = df[numerical_cols].corr()
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', 
                linewidths=0.5, cbar_kws={'shrink': 0.8})
    plt.title('Correlation Matrix Heatmap')
    plt.tight_layout()
    plt.show()
    
    # Print highly correlated pairs
    print("\nHighly correlated pairs (|correlation| > 0.7):")
    high_corr_pairs = []
    for i in range(len(correlation_matrix.columns)):
        for j in range(i+1, len(correlation_matrix.columns)):
            if abs(correlation_matrix.iloc[i, j]) > 0.7:
                high_corr_pairs.append({
                    'Variable 1': correlation_matrix.columns[i],
                    'Variable 2': correlation_matrix.columns[j],
                    'Correlation': correlation_matrix.iloc[i, j]
                })
    
    if high_corr_pairs:
        high_corr_df = pd.DataFrame(high_corr_pairs)
        print(high_corr_df)
    else:
        print("No highly correlated pairs found.")
else:
    print("Insufficient numerical columns for correlation analysis.")

In [None]:
# Pair plots for smaller datasets
# Note: Pair plots can be computationally expensive for large datasets
max_pairplot_cols = 5  # Limit to prevent performance issues

if len(numerical_cols) <= max_pairplot_cols and len(numerical_cols) > 1:
    print(f"Creating pair plots for {len(numerical_cols)} numerical variables...")
    sns.pairplot(df[numerical_cols], diag_kind='kde')
    plt.suptitle('Pair Plot of Numerical Variables', y=1.02)
    plt.show()
elif len(numerical_cols) > max_pairplot_cols:
    print(f"Dataset has {len(numerical_cols)} numerical columns. Pair plot skipped to avoid performance issues.")
    print(f"Consider sampling or selecting specific columns for pair plot analysis.")
else:
    print("Insufficient numerical columns for pair plots.")

## Step 3: Feature Engineering

Based on the EDA results, perform feature engineering operations.

In [None]:
# Create a copy of the dataframe for feature engineering
df_engineered = df.copy()

print("Starting feature engineering process...")
print(f"Original dataset shape: {df_engineered.shape}")

In [None]:
# Handle Missing Values
print("\nHandling missing values...")

# Get columns with missing values
missing_cols = df_engineered.columns[df_engineered.isnull().any()]

if len(missing_cols) > 0:
    for col in missing_cols:
        missing_count = df_engineered[col].isnull().sum()
        missing_pct = (missing_count / len(df_engineered)) * 100
        
        print(f"\nColumn '{col}': {missing_count} missing values ({missing_pct:.2f}%)")
        
        if df_engineered[col].dtype in ['int64', 'float64']:
            # Numerical column - use median
            median_val = df_engineered[col].median()
            df_engineered[col].fillna(median_val, inplace=True)
            print(f"  -> Filled with median: {median_val}")
        else:
            # Categorical column - use mode
            mode_val = df_engineered[col].mode()[0]
            df_engineered[col].fillna(mode_val, inplace=True)
            print(f"  -> Filled with mode: {mode_val}")
else:
    print("No missing values found in the dataset.")

print(f"\nAfter handling missing values, shape: {df_engineered.shape}")

In [None]:
# Encode Categorical Variables
print("\nEncoding categorical variables...")

categorical_cols = df_engineered.select_dtypes(include=['object', 'category']).columns
print(f"Found {len(categorical_cols)} categorical columns: {list(categorical_cols)}")

if len(categorical_cols) > 0:
    from sklearn.preprocessing import LabelEncoder
    
    # Apply label encoding to categorical columns
    label_encoders = {}
    
    for col in categorical_cols:
        unique_values = df_engineered[col].nunique()
        print(f"\nColumn '{col}': {unique_values} unique values")
        
        # For columns with 2 unique values, use label encoding
        if unique_values == 2:
            le = LabelEncoder()
            df_engineered[col] = le.fit_transform(df_engineered[col])
            label_encoders[col] = le
            print(f"  -> Applied label encoding")
        
        # For columns with more than 2 unique values, use one-hot encoding
        elif unique_values <= 10:  # Reasonable limit for one-hot encoding
            # Create dummy variables
            dummies = pd.get_dummies(df_engineered[col], prefix=col, drop_first=True)
            df_engineered = pd.concat([df_engineered, dummies], axis=1)
            df_engineered.drop(col, axis=1, inplace=True)
            print(f"  -> Applied one-hot encoding, created {dummies.shape[1]} new columns")
        
        else:
            print(f"  -> Column has too many unique values ({unique_values}), consider grouping or alternative encoding")
    
    print(f"\nAfter categorical encoding, shape: {df_engineered.shape}")
else:
    print("No categorical columns found.")

In [None]:
# Create New Features (Domain-specific)
print("\nCreating new features...")

# Get numerical columns for feature creation
numerical_cols = df_engineered.select_dtypes(include=[np.number]).columns
print(f"Available numerical columns for feature engineering: {list(numerical_cols)}")

new_features_created = 0

# Example feature engineering based on common patterns
if len(numerical_cols) >= 2:
    # Create interaction features (ratios, products, differences)
    for i, col1 in enumerate(numerical_cols):
        for col2 in numerical_cols[i+1:]:
            # Avoid division by zero
            if df_engineered[col2].min() > 0:
                # Ratio feature
                ratio_name = f"{col1}_div_{col2}"
                df_engineered[ratio_name] = df_engineered[col1] / df_engineered[col2]
                new_features_created += 1
                
                # Product feature
                product_name = f"{col1}_times_{col2}"
                df_engineered[product_name] = df_engineered[col1] * df_engineered[col2]
                new_features_created += 1
                
                # Difference feature
                diff_name = f"{col1}_minus_{col2}"
                df_engineered[diff_name] = df_engineered[col1] - df_engineered[col2]
                new_features_created += 1
    
    print(f"Created {new_features_created} new interaction features")
else:
    print("Insufficient numerical columns for interaction feature creation.")

# Create statistical features if we have enough columns
if len(numerical_cols) >= 3:
    # Row-wise statistics
    df_engineered['row_mean'] = df_engineered[numerical_cols].mean(axis=1)
    df_engineered['row_std'] = df_engineered[numerical_cols].std(axis=1)
    df_engineered['row_min'] = df_engineered[numerical_cols].min(axis=1)
    df_engineered['row_max'] = df_engineered[numerical_cols].max(axis=1)
    new_features_created += 4
    print("Created row-wise statistical features (mean, std, min, max)")

print(f"\nFeature engineering completed. Final shape: {df_engineered.shape}")

In [None]:
# Feature Scaling
print("\nApplying feature scaling...")

# Separate numerical columns for scaling
numerical_cols_final = df_engineered.select_dtypes(include=[np.number]).columns
print(f"Scaling {len(numerical_cols_final)} numerical columns")

if len(numerical_cols_final) > 0:
    from sklearn.preprocessing import StandardScaler, MinMaxScaler
    
    # Create copies for different scaling methods
    df_standardized = df_engineered.copy()
    df_normalized = df_engineered.copy()
    
    # Standardization (Z-score normalization)
    print("Applying standardization (Z-score normalization)...")
    scaler_standard = StandardScaler()
    df_standardized[numerical_cols_final] = scaler_standard.fit_transform(df_standardized[numerical_cols_final])
    
    # Normalization (Min-Max scaling)
    print("Applying normalization (Min-Max scaling)...")
    scaler_minmax = MinMaxScaler()
    df_normalized[numerical_cols_final] = scaler_minmax.fit_transform(df_normalized[numerical_cols_final])
    
    print("\nScaling completed successfully!")
    print(f"Standardized data shape: {df_standardized.shape}")
    print(f"Normalized data shape: {df_normalized.shape}")
else:
    print("No numerical columns found for scaling.")

## Step 4: Data Preparation for Modeling

Prepare the engineered datasets for machine learning modeling.

In [None]:
# Summary of all datasets
print("=== DATA ANALYSIS AND FEATURE ENGINEERING SUMMARY ===\n")

print(f"Original dataset shape: {df.shape}")
print(f"Engineered dataset shape: {df_engineered.shape}")
if 'df_standardized' in locals():
    print(f"Standardized dataset shape: {df_standardized.shape}")
if 'df_normalized' in locals():
    print(f"Normalized dataset shape: {df_normalized.shape}")

print(f"\nNew features created: {new_features_created}")

# Display final column types
print("\nFinal dataset column types:")
column_types = df_engineered.dtypes.value_counts()
for dtype, count in column_types.items():
    print(f"  {dtype}: {count} columns")

# Check for any remaining issues
remaining_missing = df_engineered.isnull().sum().sum()
if remaining_missing == 0:
    print("\n✅ No missing values remain in the dataset")
else:
    print(f"\n⚠️  {remaining_missing} missing values still remain")

print("\n=== DATASETS READY FOR MODELING ===")
print("Available datasets:")
print("1. df_engineered - Original engineered dataset")
print("2. df_standardized - Standardized (Z-score normalized) dataset")
print("3. df_normalized - Normalized (Min-Max scaled) dataset")

In [None]:
# Export processed datasets (optional)
print("\nExporting processed datasets...")

try:
    df_engineered.to_csv('engineered_data.csv', index=False)
    print("✅ Engineered dataset exported to 'engineered_data.csv'")
    
    if 'df_standardized' in locals():
        df_standardized.to_csv('standardized_data.csv', index=False)
        print("✅ Standardized dataset exported to 'standardized_data.csv'")
    
    if 'df_normalized' in locals():
        df_normalized.to_csv('normalized_data.csv', index=False)
        print("✅ Normalized dataset exported to 'normalized_data.csv'")
        
    print("\nAll datasets have been successfully exported!")
except Exception as e:
    print(f"\n⚠️  Error exporting datasets: {e}")

In [None]:
# Final recommendations for modeling
print("\n=== MODELING RECOMMENDATIONS ===")
print("Based on the analysis, here are recommendations for next steps:")
print("\n1. Choose appropriate dataset based on your ML algorithm:")
print("   - Use standardized_data.csv for algorithms sensitive to feature scales (SVM, KNN, PCA)")
print("   - Use normalized_data.csv for neural networks and algorithms requiring bounded inputs")
print("   - Use engineered_data.csv for tree-based algorithms (Random Forest, XGBoost)")

print("\n2. Consider feature selection techniques:")
print("   - Check correlation matrix for multicollinearity")
print("   - Use feature importance from tree-based models")
print("   - Apply dimensionality reduction (PCA) if needed")

print("\n3. Next modeling steps:")
print("   - Split data into training and testing sets")
print("   - Train baseline models")
print("   - Evaluate and tune model performance")
print("   - Consider ensemble methods for improved accuracy")

print("\n🎉 Data analysis and feature engineering pipeline completed successfully!")