# Assignment 2: Life Satisfaction Prediction üåçüòä

## üìö Learning Objectives
- Load and visualize real-world data.
- Train and compare **Linear Regression** and **K-Nearest Neighbors (KNN)** models.
- Interpret model outputs and discuss generalization.

## Part 1: Data Exploration and Visualization (20 marks)

### Q1 (5 marks)
Load the dataset `lifesat.csv` and display the first 5 rows.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Try to load from local file, if not found, create a sample dataset for demonstration
try:
    df = pd.read_csv('lifesat.csv')
except FileNotFoundError:
    print("‚ö†Ô∏è 'lifesat.csv' not found. Creating a sample dataset for demonstration purposes.")
    data = {
        'Country': ['Russia', 'Turkey', 'Hungary', 'Poland', 'Slovakia', 'Estonia', 'Greece', 'Portugal', 'Slovenia', 'Spain', 'Korea', 'Italy', 'Israel', 'New Zealand', 'France', 'Belgium', 'Germany', 'Finland', 'Canada', 'Netherlands', 'Austria', 'United Kingdom', 'Sweden', 'Iceland', 'Australia', 'Ireland', 'Denmark', 'United States', 'Switzerland', 'Norway', 'Luxembourg'],
        'GDP per capita': [9054.914, 9437.372, 12239.894, 12495.334, 15991.736, 17288.083, 18064.288, 19121.592, 20732.482, 25864.721, 27195.197, 29866.581, 35343.336, 37044.891, 37675.006, 40106.632, 40996.511, 41973.988, 43331.961, 43603.115, 43724.031, 43770.688, 51109.804, 50854.583, 50961.865, 51350.744, 52114.165, 55805.204, 80675.308, 74822.106, 101994.093],
        'Life satisfaction': [6.0, 5.6, 4.9, 5.8, 6.1, 5.6, 4.8, 5.1, 5.7, 6.5, 5.8, 6.0, 7.4, 7.3, 6.5, 6.9, 7.0, 7.4, 7.3, 7.3, 6.9, 6.8, 7.2, 7.5, 7.3, 7.0, 7.5, 7.2, 7.5, 7.4, 6.9]
    }
    df = pd.DataFrame(data)

df.head()

### Q2 (5 marks)
Print the dataset's basic information and summary statistics.

In [None]:
print("--- Info ---")
df.info()

print("\n--- Summary Statistics ---")
df.describe()

### Q3 (10 marks)
Plot GDP per capita vs. Life Satisfaction. Add appropriate labels and discuss the observed relationship.

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='GDP per capita', y='Life satisfaction', data=df, s=100, color='teal')
plt.title('GDP per Capita vs. Life Satisfaction', fontsize=16)
plt.xlabel('GDP per Capita (USD)', fontsize=12)
plt.ylabel('Life Satisfaction', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

**Observation:**
There appears to be a **positive linear relationship** between GDP per capita and Life Satisfaction. As GDP increases, life satisfaction generally tends to increase, although there are some outliers and the relationship might not be perfectly linear at very high income levels.

## Part 2: Linear Regression Model (30 marks)

### Q4 (5 marks)
Extract the feature matrix `X` and target vector `y`. Print their shapes.

In [None]:
X = df[['GDP per capita']].values
y = df['Life satisfaction'].values

print(f"Shape of X: {X.shape}")
print(f"Shape of y: {y.shape}")

### Q5 (10 marks)
Train a Linear Regression model. Display the model's coefficient and intercept.

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)

print(f"Intercept (theta_0): {lin_reg.intercept_:.4f}")
print(f"Coefficient (theta_1): {lin_reg.coef_[0]:.8f}")

### Q6 (10 marks)
Plot the regression line over the scatter plot of the data.

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='GDP per capita', y='Life satisfaction', data=df, s=100, color='teal', label='Data')

# Plot regression line
X_range = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
y_pred_line = lin_reg.predict(X_range)
plt.plot(X_range, y_pred_line, color='red', linewidth=2, label='Linear Regression')

plt.title('Linear Regression: GDP vs Life Satisfaction', fontsize=16)
plt.xlabel('GDP per Capita (USD)')
plt.ylabel('Life Satisfaction')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

### Q7 (5 marks)
Predict Life Satisfaction for a GDP of $37,655.2. Comment on the result.

In [None]:
X_new = [[37655.2]]
pred_lin = lin_reg.predict(X_new)[0]

print(f"Predicted Life Satisfaction for GDP $37,655.2: {pred_lin:.2f}")

**Comment:**
The model predicts a life satisfaction score based on the linear trend. This value represents the "average" expected satisfaction for a country with that specific GDP, assuming the linear relationship holds true.

## Part 3: K-Nearest Neighbors Regression (25 marks)

### Q8 (5 marks)
Train a `KNeighborsRegressor` with `n_neighbors=3`.

In [None]:
from sklearn.neighbors import KNeighborsRegressor

knn_reg = KNeighborsRegressor(n_neighbors=3)
knn_reg.fit(X, y)

### Q9 (10 marks)
Predict Life Satisfaction for a GDP of $37,655.2 using the KNN model and compare it with the Linear Regression prediction.

In [None]:
pred_knn = knn_reg.predict(X_new)[0]

print(f"Linear Regression Prediction: {pred_lin:.2f}")
print(f"KNN (k=3) Prediction:       {pred_knn:.2f}")
print(f"Difference:                 {abs(pred_lin - pred_knn):.2f}")

### Q10 (10 marks)
Evaluate the model with `n_neighbors` values of [1, 3, 5, 10]. Plot the results to compare performance.

In [None]:
k_values = [1, 3, 5, 10]
predictions = []

plt.figure(figsize=(12, 8))
sns.scatterplot(x='GDP per capita', y='Life satisfaction', data=df, s=100, color='teal', label='Data', alpha=0.6)

X_range = np.linspace(X.min(), X.max(), 500).reshape(-1, 1)

colors = ['orange', 'green', 'purple', 'brown']

for k, color in zip(k_values, colors):
    model = KNeighborsRegressor(n_neighbors=k)
    model.fit(X, y)
    y_pred = model.predict(X_range)
    plt.plot(X_range, y_pred, label=f'k={k}', linewidth=2, color=color)
    
    # Predict for our specific point
    pred = model.predict(X_new)[0]
    predictions.append(pred)
    print(f"Prediction for k={k}: {pred:.2f}")

plt.title('KNN Regression with Different k Values', fontsize=16)
plt.xlabel('GDP per Capita')
plt.ylabel('Life Satisfaction')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()