In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Employee Salary Prediction Analysis\n",
    "## IBM SkillBuild AI Internship Project\n",
    "\n",
    "**Author:** [Your Name]  \n",
    "**Date:** [Current Date]  \n",
    "**Objective:** Predict employee salaries based on years of experience using Linear Regression\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Import Libraries and Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import necessary libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Set plotting style\n",
    "plt.style.use('default')\n",
    "sns.set_palette(\"husl\")\n",
    "\n",
    "print(\"Libraries imported successfully!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Load and Explore the Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the dataset\n",
    "dataset = pd.read_csv('../data/Salary_Data.csv')\n",
    "\n",
    "print(\"Dataset loaded successfully!\")\n",
    "print(f\"Dataset shape: {dataset.shape}\")\n",
    "print(\"\\nColumn names:\")\n",
    "print(dataset.columns.tolist())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display first few rows\n",
    "print(\"First 10 rows of the dataset:\")\n",
    "dataset.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Basic statistics\n",
    "print(\"Dataset Statistics:\")\n",
    "dataset.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check for missing values\n",
    "print(\"Missing values:\")\n",
    "print(dataset.isnull().sum())\n",
    "\n",
    "print(\"\\nData types:\")\n",
    "print(dataset.dtypes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Data Visualization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create initial visualization\n",
    "plt.figure(figsize=(12, 8))\n",
    "\n",
    "# Subplot 1: Scatter plot\n",
    "plt.subplot(2, 2, 1)\n",
    "plt.scatter(dataset['YearsExperience'], dataset['Salary'], color='blue', alpha=0.7)\n",
    "plt.title('Salary vs Years of Experience')\n",
    "plt.xlabel('Years of Experience')\n",
    "plt.ylabel('Salary ($)')\n",
    "plt.grid(True, alpha=0.3)\n",
    "\n",
    "# Subplot 2: Distribution of Experience\n",
    "plt.subplot(2, 2, 2)\n",
    "plt.hist(dataset['YearsExperience'], bins=10, color='green', alpha=0.7, edgecolor='black')\n",
    "plt.title('Distribution of Years of Experience')\n",
    "plt.xlabel('Years of Experience')\n",
    "plt.ylabel('Frequency')\n",
    "plt.grid(True, alpha=0.3)\n",
    "\n",
    "# Subplot 3: Distribution of Salary\n",
    "plt.subplot(2, 2, 3)\n",
    "plt.hist(dataset['Salary'], bins=10, color='orange', alpha=0.7, edgecolor='black')\n",
    "plt.title('Distribution of Salary')\n",
    "plt.xlabel('Salary ($)')\n",
    "plt.ylabel('Frequency')\n",
    "plt.grid(True, alpha=0.3)\n",
    "\n",
    "# Subplot 4: Box plot of Salary\n",
    "plt.subplot(2, 2, 4)\n",
    "plt.boxplot(dataset['Salary'])\n",
    "plt.title('Salary Box Plot')\n",
    "plt.ylabel('Salary ($)')\n",
    "plt.grid(True, alpha=0.3)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Data Preparation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prepare features (X) and target (y)\n",
    "X = dataset.iloc[:, :-1].values  # All columns except the last one\n",
    "y = dataset.iloc[:, -1].values   # Last column only\n",
    "\n",
    "print(f\"Features shape: {X.shape}\")\n",
    "print(f\"Target shape: {y.shape}\")\n",
    "\n",
    "# Display first few samples\n",
    "print(\"\\nFirst 5 feature samples:\")\n",
    "print(X[:5])\n",
    "print(\"\\nFirst 5 target samples:\")\n",
    "print(y[:5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split the data into training and testing sets\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
    "\n",
    "print(f\"Training set size: {len(X_train)} samples\")\n",
    "print(f\"Testing set size: {len(X_test)} samples\")\n",
    "print(f\"Training set percentage: {len(X_train)/len(X)*100:.1f}%\")\n",
    "print(f\"Testing set percentage: {len(X_test)/len(X)*100:.1f}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Model Training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create and train the Linear Regression model\n",
    "regressor = LinearRegression()\n",
    "regressor.fit(X_train, y_train)\n",
    "\n",
    "print(\"Model training completed!\")\n",
    "print(f\"Model coefficient (slope): {regressor.coef_[0]:.4f}\")\n",
    "print(f\"Model intercept: {regressor.intercept_:.2f}\")\n",
    "print(f\"\\nLinear equation: Salary = {regressor.coef_[0]:.2f} × Experience + {regressor.intercept_:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Model Prediction and Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make predictions on the test set\n",
    "y_pred = regressor.predict(X_test)\n",
    "\n",
    "# Calculate performance metrics\n",
    "mse = mean_squared_error(y_test, y_pred)\n",
    "rmse = np.sqrt(mse)\n",
    "mae = mean_absolute_error(y_test, y_pred)\n",
    "r2 = r2_score(y_test, y_pred)\n",
    "\n",
    "print(\"MODEL PERFORMANCE METRICS:\")\n",
    "print(\"=\" * 30)\n",
    "print(f\"Mean Squared Error (MSE): {mse:.2f}\")\n",
    "print(f\"Root Mean Squared Error (RMSE): {rmse:.2f}\")\n",
    "print(f\"Mean Absolute Error (MAE): {mae:.2f}\")\n",
    "print(f\"R-squared (R²) Score: {r2:.4f}\")\n",
    "print(f\"Model Accuracy: {r2:.2%}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display predictions vs actual values\n",
    "results_df = pd.DataFrame({\n",
    "    'Experience': X_test.flatten(),\n",
    "    'Actual_Salary': y_test,\n",
    "    'Predicted_Salary': y_pred,\n",
    "    'Difference': y_test - y_pred\n",
    "})\n",
    "\n",
    "results_df['Absolute_Error'] = np.abs(results_df['Difference'])\n",
    "results_df = results_df.sort_values('Experience')\n",
    "\n",
    "print(\"PREDICTION RESULTS:\")\n",
    "print(results_df.round(2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Model Visualization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create comprehensive visualization\n",
    "plt.figure(figsize=(15, 10))\n",
    "\n",
    "# Plot 1: Training set with regression line\n",
    "plt.subplot(2, 3, 1)\n",
    "plt.scatter(X_train, y_train, color='red', alpha=0.7, label='Training Data')\n",
    "plt.plot(X_train, regressor.predict(X_train), color='blue', linewidth=2, label='Regression Line')\n",
    "plt.title('Training Set: Salary vs Experience')\n",
    "plt.xlabel('Years of Experience')\n",
    "plt.ylabel('Salary ($)')\n",
    "plt.legend()\n",
    "plt.grid(True, alpha=0.3)\n",
    "\n",
    "# Plot 2: Test set predictions\n",
    "plt.subplot(2, 3, 2)\n",
    "plt.scatter(X_test, y_test, color='red', alpha=0.7, label='Actual Salary')\n",
    "plt.scatter(X_test, y_pred, color='green', alpha=0.7, label='Predicted Salary')\n",
    "plt.title('Test Set: Actual vs Predicted')\n",
    "plt.xlabel('Years of Experience')\n",
    "plt.ylabel('Salary ($)')\n",
    "plt.legend()\n",
    "plt.grid(True, alpha=0.3)\n",
    "\n",
    "# Plot 3: Actual vs Predicted scatter\n",
    "plt.subplot(2, 3, 3)\n",
    "plt.scatter(y_test, y_pred, alpha=0.7, color='purple')\n",
    "plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2)\n",
    "plt.title('Actual vs Predicted Values')\n",
    "plt.xlabel('Actual Salary ($)')\n",
    "plt.ylabel('Predicted Salary ($)')\n",
    "plt.grid(True, alpha=0.3)\n",
    "\n",
    "# Plot 4: Residuals plot\n",
    "plt.subplot(2, 3, 4)\n",
    "residuals = y_test - y_pred\n",
    "plt.scatter(y_pred, residuals, alpha=0.7, color='orange')\n",
    "plt.axhline(y=0, color='red', linestyle='--')\n",
    "plt.title('Residuals Plot')\n",
    "plt.xlabel('Predicted Salary ($)')\n",
    "plt.ylabel('Residuals')\n",
    "plt.grid(True, alpha=0.3)\n",
    "\n",
    "# Plot 5: Residuals distribution\n",
    "plt.subplot(2, 3, 5)\n",
    "plt.hist(residuals, bins=8, alpha=0.7, color='cyan', edgecolor='black')\n",
    "plt.title('Distribution of Residuals')\n",
    "plt.xlabel('Residuals')\n",
    "plt.ylabel('Frequency')\n",
    "plt.grid(True, alpha=0.3)\n",
    "\n",
    "# Plot 6: Error metrics visualization\n",
    "plt.subplot(2, 3, 6)\n",
    "metrics = ['MSE', 'RMSE', 'MAE', 'R²']\n",
    "values = [mse/1000, rmse/100, mae/100, r2]  # Scaled for better visualization\n",
    "colors = ['red', 'orange', 'yellow', 'green']\n",
    "bars = plt.bar(metrics, values, color=colors, alpha=0.7)\n",
    "plt.title('Model Performance Metrics')\n",
    "plt.ylabel('Scaled Values')\n",
    "\n",
    "# Add value labels on bars\n",
    "for bar, metric, original_value in zip(bars, metrics, [mse, rmse, mae, r2]):\n",
    "    height = bar.get_height()\n",
    "    plt.text(bar.get_x() + bar.get_width()/2., height + 0.01,\n",
    "             f'{original_value:.3f}', ha='center', va='bottom', fontsize=8)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Making Sample Predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make predictions for different experience levels\n",
    "sample_experiences = [1, 2.5, 5, 7.5, 10, 12]\n",
    "\n",
    "print(\"SAMPLE PREDICTIONS:\")\n",
    "print(\"=\" * 40)\n",
    "\n",
    "predictions_list = []\n",
    "for exp in sample_experiences:\n",
    "    predicted_salary = regressor.predict([[exp]])[0]\n",
    "    predictions_list.append([exp, predicted_salary])\n",
    "    print(f\"Experience: {exp:4.1f} years → Predicted Salary: ${predicted_salary:,.2f}\")\n",
    "\n",
    "# Create a DataFrame for better visualization\n",
    "pred_df = pd.DataFrame(predictions_list, columns=['Experience', 'Predicted_Salary'])\n",
    "print(\"\\nPredictions DataFrame:\")\n",
    "print(pred_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Interactive prediction function\n",
    "def predict_salary(years):\n",
    "    \"\"\"Function to predict salary for given years of experience\"\"\"\n",
    "    prediction = regressor.predict([[years]])[0]\n",
    "    return prediction\n",
    "\n",
    "# Example usage\n",
    "user_experience = 6.5  # You can change this value\n",
    "predicted = predict_salary(user_experience)\n",
    "print(f\"\\nCustom Prediction:\")\n",
    "print(f\"For {user_experience} years of experience: ${predicted:,.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. Model Analysis and Insights"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Analyze the linear relationship\n",
    "slope = regressor.coef_[0]\n",
    "intercept = regressor.intercept_\n",
    "\n",
    "print(\"MODEL ANALYSIS:\")\n",
    "print(\"=\" * 30)\n",
    "print(f\"Linear Equation: Salary = {slope:.2f} × Experience + {intercept:.2f}\")\n",
    "print(f\"\\nInterpretation:\")\n",
    "print(f\"• Base salary (0 years experience): ${intercept:,.2f}\")\n",
    "print(f\"• Salary increase per year of experience: ${slope:,.2f}\")\n",
    "print(f\"• The model explains {r2:.2%} of the salary variance\")\n",
    "\n",
    "# Calculate salary range predictions\n",
    "min_exp = dataset['YearsExperience'].min()\n",
    "max_exp = dataset['YearsExperience'].max()\n",
    "min_salary_pred = predict_salary(min_exp)\n",
    "max_salary_pred = predict_salary(max_exp)\n",
    "\n",
    "print(f\"\\nSalary Range Predictions:\")\n",
    "print(f\"• Minimum experience ({min_exp} years): ${min_salary_pred:,.2f}\")\n",
    "print(f\"• Maximum experience ({max_exp} years): ${max_salary_pred:,.2f}\")\n",
    "print(f\"• Total range: ${max_salary_pred - min_salary_pred:,.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 10. Conclusions and Recommendations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Key Findings:\n",
    "\n",
    "1. **Strong Linear Relationship**: The model shows a strong positive linear relationship between years of experience and salary.\n",
    "\n",
    "2. **High Accuracy**: With an R² score of approximately 95%, the model explains most of the variance in salary data.\n",
    "\n",
    "3. **Practical Insights**: \n",
    "   - Each additional year of experience increases salary by approximately $9,000-10,000\n",
    "   - The base salary (theoretical 0 years) is around $25,000-30,000\n",
    "\n",
    "4. **Model Limitations**:\n",
    "   - Simple linear model may not capture complex real-world factors\n",
    "   - Limited to the experience range in training data\n",
    "   - Assumes linear relationship continues beyond observed data\n",
    "\n",
    "### Recommendations:\n",
    "\n",
    "1. **For HR Applications**: This model can be used as a baseline for salary negotiations and budgeting\n",
    "2. **For Further Development**: Consider adding more features like education, location, industry, etc.\n",
    "3. **For Validation**: Test the model with more diverse datasets\n",
    "4. **For Improvement**: Explore polynomial regression or other algorithms for better accuracy\n",
    "\n",
    "### Project Success:\n",
    "This project successfully demonstrates:\n",
    "- Data loading and preprocessing\n",
    "- Model training and evaluation\n",
    "- Visualization and interpretation\n",
    "- Practical application of machine learning concepts\n",
    "\n",
    "**This completes the Employee Salary Prediction analysis for the IBM SkillBuild AI Internship project.**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}