# Linear Regression Demo: Training Data Size vs Training Time

This notebook explains **Linear Regression using a real-world tech example**.

**Objective:** Predict training time based on training dataset size.

Designed for beginners and suitable for Kaggle.

## Step 1: Load Dataset

In [None]:
import pandas as pd
df = pd.read_csv('training_data_vs_training_time.csv')
df.head()

## Step 2: Understand the Data

In [None]:
df.info()

df.describe()

## Step 3: Scatter Plot (Reality Check)

In [None]:
import matplotlib.pyplot as plt
plt.scatter(df['training_data_size_mb'], df['training_time_minutes'])
plt.xlabel('Training Data Size (MB)')
plt.ylabel('Training Time (Minutes)')
plt.title('Training Time vs Training Data Size')
plt.show()

## Step 4: Train Linear Regression Model

In [None]:
from sklearn.linear_model import LinearRegression
X = df[['training_data_size_mb']]
y = df['training_time_minutes']
model = LinearRegression()
model.fit(X, y)
print('Slope:', model.coef_)
print('Intercept:', model.intercept_)

## Step 5: Plot Best Fit Line

In [None]:
y_pred = model.predict(X)
plt.scatter(X, y, label='Actual')
plt.plot(X, y_pred, color='red', label='Regression Line')
plt.xlabel('Training Data Size (MB)')
plt.ylabel('Training Time (Minutes)')
plt.legend()
plt.show()

## Step 6: Model Evaluation

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
print('Mean Squared Error:', mean_squared_error(y, y_pred))
print('RÂ² Score:', r2_score(y, y_pred))

## Step 7: Prediction Proof

In [None]:
sample = [[2500]]  # 2500 MB training data
print('Predicted Training Time:', model.predict(sample))

## Final Notes
- Linear Regression works well for this problem
- Noise reflects real-world system variability
- Suitable as a beginner ML project
- Ready to publish on Kaggle