# Linear Regression - Latihan Praktis

Notebook ini berisi latihan praktis untuk memahami dan menerapkan Linear Regression.

## 📚 Learning Objectives
Setelah menyelesaikan latihan ini, Anda akan dapat:
1. Mengimplementasikan Linear Regression dari scratch
2. Menggunakan Linear Regression dengan scikit-learn
3. Melakukan exploratory data analysis (EDA)
4. Mengevaluasi model regression
5. Memahami asumsi Linear Regression

## 🛠️ Setup dan Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.datasets import make_regression
import warnings
warnings.filterwarnings('ignore')

# Set style untuk plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")

## 📊 Latihan 1: Simple Linear Regression dari Scratch

**Tugas**: Implementasikan Simple Linear Regression menggunakan Normal Equation

In [None]:
# TODO: Implementasikan Simple Linear Regression Class
class SimpleLinearRegression:
    def __init__(self):
        self.slope = None
        self.intercept = None
    
    def fit(self, X, y):
        """
        TODO: Implementasikan fitting menggunakan normal equation
        Formula: slope = Σ((xi - x̄)(yi - ȳ)) / Σ((xi - x̄)²)
                intercept = ȳ - slope * x̄
        """
        # Your code here
        pass
    
    def predict(self, X):
        """
        TODO: Implementasikan prediksi
        Formula: y = slope * X + intercept
        """
        # Your code here
        pass
    
    def score(self, X, y):
        """
        TODO: Hitung R² score
        Formula: R² = 1 - (SS_res / SS_tot)
        """
        # Your code here
        pass

### Test Implementation

In [None]:
# Generate test data
np.random.seed(42)
X_simple = np.linspace(0, 10, 50)
y_simple = 2 * X_simple + 3 + np.random.normal(0, 2, 50)

# Test your implementation
model = SimpleLinearRegression()
model.fit(X_simple, y_simple)

print(f"Slope: {model.slope:.4f}")
print(f"Intercept: {model.intercept:.4f}")
print(f"R² Score: {model.score(X_simple, y_simple):.4f}")

# Visualize results
plt.figure(figsize=(10, 6))
plt.scatter(X_simple, y_simple, alpha=0.7, label='Data points')
y_pred = model.predict(X_simple)
plt.plot(X_simple, y_pred, 'r-', linewidth=2, label='Regression Line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Simple Linear Regression - Your Implementation')
plt.legend()
plt.grid(True)
plt.show()

## 📊 Latihan 2: Multiple Linear Regression dengan Real Dataset

**Tugas**: Gunakan dataset California Housing untuk memprediksi harga rumah

In [None]:
# Load California Housing dataset
from sklearn.datasets import fetch_california_housing

# TODO: Load dan explore dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Create DataFrame for easier handling
df = pd.DataFrame(X, columns=housing.feature_names)
df['price'] = y

print("Dataset Info:")
print(f"Shape: {df.shape}")
print(f"\nFeatures: {housing.feature_names}")
print(f"\nTarget: House prices in hundreds of thousands of dollars")

# TODO: Display basic statistics
df.head()

### Exploratory Data Analysis

In [None]:
# TODO: Perform EDA
# 1. Check basic statistics
print("Basic Statistics:")
df.describe()

# 2. Check for missing values
print(f"\nMissing values: {df.isnull().sum().sum()}")

# 3. Visualize target distribution
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.hist(y, bins=50, alpha=0.7)
plt.title('Distribution of House Prices')
plt.xlabel('Price (in 100k$)')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
plt.boxplot(y)
plt.title('Box Plot of House Prices')
plt.ylabel('Price (in 100k$)')

plt.tight_layout()
plt.show()

In [None]:
# TODO: Correlation analysis
plt.figure(figsize=(10, 8))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

# Print correlations with target
print("Correlation with house prices:")
price_corr = correlation_matrix['price'].abs().sort_values(ascending=False)
for feature, corr in price_corr.items():
    if feature != 'price':
        print(f"{feature:15}: {corr:.4f}")

### Model Training dan Evaluasi

In [None]:
# TODO: Split data dan train model
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Make predictions
y_train_pred = lr_model.predict(X_train)
y_test_pred = lr_model.predict(X_test)

# Evaluate model
print("Model Performance:")
print(f"Training R²: {r2_score(y_train, y_train_pred):.4f}")
print(f"Test R²: {r2_score(y_test, y_test_pred):.4f}")
print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_test_pred)):.4f}")
print(f"Test MAE: {mean_absolute_error(y_test, y_test_pred):.4f}")

In [None]:
# TODO: Feature importance analysis
feature_importance = pd.DataFrame({
    'feature': housing.feature_names,
    'coefficient': lr_model.coef_,
    'abs_coefficient': np.abs(lr_model.coef_)
}).sort_values('abs_coefficient', ascending=False)

print("Feature Importance (based on coefficients):")
print(feature_importance)

# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'], feature_importance['coefficient'])
plt.xlabel('Coefficient Value')
plt.title('Feature Coefficients in Linear Regression')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### Model Diagnostics

In [None]:
# TODO: Create diagnostic plots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Predictions vs Actual
axes[0, 0].scatter(y_test, y_test_pred, alpha=0.6)
axes[0, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0, 0].set_xlabel('Actual Values')
axes[0, 0].set_ylabel('Predicted Values')
axes[0, 0].set_title('Predictions vs Actual Values')
axes[0, 0].grid(True)

# 2. Residuals vs Predicted
residuals = y_test - y_test_pred
axes[0, 1].scatter(y_test_pred, residuals, alpha=0.6)
axes[0, 1].axhline(y=0, color='r', linestyle='--')
axes[0, 1].set_xlabel('Predicted Values')
axes[0, 1].set_ylabel('Residuals')
axes[0, 1].set_title('Residuals vs Predicted Values')
axes[0, 1].grid(True)

# 3. Residuals Distribution
axes[1, 0].hist(residuals, bins=30, alpha=0.7)
axes[1, 0].set_xlabel('Residuals')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Distribution of Residuals')
axes[1, 0].grid(True)

# 4. Q-Q plot
from scipy import stats
stats.probplot(residuals, dist="norm", plot=axes[1, 1])
axes[1, 1].set_title('Q-Q Plot of Residuals')
axes[1, 1].grid(True)

plt.tight_layout()
plt.show()

## 📊 Latihan 3: Cross Validation dan Model Comparison

**Tugas**: Bandingkan performa model dengan dan tanpa feature scaling

In [None]:
# TODO: Compare models with and without scaling
from sklearn.pipeline import Pipeline

# Model 1: Without scaling
model_no_scaling = LinearRegression()

# Model 2: With scaling
model_with_scaling = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

# Perform cross-validation
cv_scores_no_scaling = cross_val_score(model_no_scaling, X, y, cv=5, scoring='r2')
cv_scores_with_scaling = cross_val_score(model_with_scaling, X, y, cv=5, scoring='r2')

print("Cross-Validation Results (5-fold):")
print(f"\nNo Scaling:")
print(f"  Mean R²: {cv_scores_no_scaling.mean():.4f} (+/- {cv_scores_no_scaling.std() * 2:.4f})")
print(f"  Scores: {cv_scores_no_scaling}")

print(f"\nWith Scaling:")
print(f"  Mean R²: {cv_scores_with_scaling.mean():.4f} (+/- {cv_scores_with_scaling.std() * 2:.4f})")
print(f"  Scores: {cv_scores_with_scaling}")

# Visualize CV results
plt.figure(figsize=(10, 6))
plt.boxplot([cv_scores_no_scaling, cv_scores_with_scaling], 
            labels=['No Scaling', 'With Scaling'])
plt.ylabel('R² Score')
plt.title('Cross-Validation Comparison')
plt.grid(True, alpha=0.3)
plt.show()

## 📊 Latihan 4: Advanced - Polynomial Regression

**Tugas**: Eksplorasi Polynomial Regression untuk data non-linear

In [None]:
# TODO: Create synthetic non-linear data
from sklearn.preprocessing import PolynomialFeatures

np.random.seed(42)
X_poly = np.linspace(-2, 2, 100).reshape(-1, 1)
y_poly = 0.5 * X_poly.ravel() ** 3 + X_poly.ravel() ** 2 - 2 * X_poly.ravel() + np.random.normal(0, 0.5, 100)

# Split data
X_train_poly, X_test_poly, y_train_poly, y_test_poly = train_test_split(
    X_poly, y_poly, test_size=0.2, random_state=42
)

# Try different polynomial degrees
degrees = [1, 2, 3, 4, 5]
results = []

plt.figure(figsize=(15, 12))

for i, degree in enumerate(degrees):
    # Create polynomial pipeline
    poly_model = Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('regressor', LinearRegression())
    ])
    
    # Fit model
    poly_model.fit(X_train_poly, y_train_poly)
    
    # Make predictions
    y_train_pred_poly = poly_model.predict(X_train_poly)
    y_test_pred_poly = poly_model.predict(X_test_poly)
    
    # Calculate scores
    train_r2 = r2_score(y_train_poly, y_train_pred_poly)
    test_r2 = r2_score(y_test_poly, y_test_pred_poly)
    
    results.append({
        'degree': degree,
        'train_r2': train_r2,
        'test_r2': test_r2
    })
    
    # Plot
    plt.subplot(3, 2, i + 1)
    plt.scatter(X_train_poly, y_train_poly, alpha=0.6, label='Training data')
    plt.scatter(X_test_poly, y_test_poly, alpha=0.6, color='red', label='Test data')
    
    X_plot = np.linspace(-2, 2, 100).reshape(-1, 1)
    y_plot = poly_model.predict(X_plot)
    plt.plot(X_plot, y_plot, 'g-', linewidth=2, label=f'Degree {degree}')
    
    plt.xlabel('X')
    plt.ylabel('y')
    plt.title(f'Polynomial Degree {degree}\nTrain R²: {train_r2:.3f}, Test R²: {test_r2:.3f}')
    plt.legend()
    plt.grid(True)

plt.tight_layout()
plt.show()

# Display results table
results_df = pd.DataFrame(results)
print("\nPolynomial Regression Results:")
print(results_df)

### Bias-Variance Tradeoff Analysis

In [None]:
# TODO: Visualize bias-variance tradeoff
plt.figure(figsize=(10, 6))
plt.plot(results_df['degree'], results_df['train_r2'], 'o-', label='Training R²', linewidth=2)
plt.plot(results_df['degree'], results_df['test_r2'], 'o-', label='Test R²', linewidth=2)
plt.xlabel('Polynomial Degree')
plt.ylabel('R² Score')
plt.title('Bias-Variance Tradeoff in Polynomial Regression')
plt.legend()
plt.grid(True)
plt.xticks(degrees)
plt.show()

# Find optimal degree
optimal_degree = results_df.loc[results_df['test_r2'].idxmax(), 'degree']
print(f"\nOptimal polynomial degree based on test R²: {optimal_degree}")

## 🎯 Tugas Tambahan (Challenge)

**Tugas**: Implementasikan dan bandingkan Linear Regression dengan regularization

In [None]:
# TODO: Compare Linear Regression variants
from sklearn.linear_model import Ridge, Lasso, ElasticNet

# Use California housing data
models = {
    'Linear Regression': LinearRegression(),
    'Ridge (α=1.0)': Ridge(alpha=1.0),
    'Lasso (α=1.0)': Lasso(alpha=1.0),
    'ElasticNet (α=1.0)': ElasticNet(alpha=1.0, random_state=42)
}

results_comparison = []

for name, model in models.items():
    # Perform cross-validation
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
    
    results_comparison.append({
        'Model': name,
        'Mean_CV_R2': cv_scores.mean(),
        'Std_CV_R2': cv_scores.std()
    })

# Display results
comparison_df = pd.DataFrame(results_comparison)
print("Model Comparison (5-fold CV):")
print(comparison_df)

# Visualize comparison
plt.figure(figsize=(10, 6))
plt.bar(comparison_df['Model'], comparison_df['Mean_CV_R2'], 
        yerr=comparison_df['Std_CV_R2'], capsize=5, alpha=0.7)
plt.ylabel('R² Score')
plt.title('Model Comparison - Cross Validation Performance')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 📝 Ringkasan dan Kesimpulan

### Key Learnings:
1. **Linear Regression** cocok untuk data dengan hubungan linear
2. **Feature scaling** tidak mempengaruhi performa Linear Regression tetapi penting untuk interpretasi
3. **Polynomial features** dapat menangkap hubungan non-linear tapi rentan overfitting
4. **Regularization** (Ridge, Lasso) membantu mengurangi overfitting
5. **Cross-validation** penting untuk evaluasi model yang robust

### Next Steps:
- Pelajari **Logistic Regression** untuk classification problems
- Eksplorasi **Regularization techniques** lebih dalam
- Praktikkan dengan **real-world datasets** lainnya

---
**🎉 Selamat!** Anda telah menyelesaikan latihan Linear Regression. 

Lanjutkan ke algoritma berikutnya: **Logistic Regression**