<a href="https://colab.research.google.com/github/nowknowing/text-classification/blob/main/Copy_of_01_Regression_Capstone_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://weclouddata.s3.amazonaws.com/images/logos/wcd_logo_new_2.png"  width='25%'>  

Developed by WeCloudData
<br></br>

# Capstone Project: Your First Regression Model

Welcome to your first capstone project in machine learning! In this notebook, you will build a complete regression pipeline using Python and the Kaggle House Prices dataset. You will learn how to:

- Load and explore the dataset
- Clean and preprocess the data
- Engineer new features
- Split the data into training and testing sets
- Train various regression models
- Evaluate model performance

After following along with the House Prices example, you'll be asked to apply these steps on a dataset of your choice from Kaggle.

Let's get started!

## Step 1: Choosing a Dataset

For this project, we are using the **House Prices: Advanced Regression Techniques** dataset from Kaggle. Download the dataset from the following link:

[Kaggle House Prices Competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)

Make sure to download the `train.csv` file and place it in the same folder as this notebook.

Once you are comfortable with these steps, try choosing another Kaggle regression dataset and follow the same process.

### Other Kaggle Regression Dataset Suggestions

Here are some other interesting Kaggle regression datasets that you can implement using similar steps:

- **Bike Sharing Demand:** Predict the count of bike rentals.
  - [Kaggle Bike Sharing Demand](https://www.kaggle.com/c/bike-sharing-demand)

- **Wine Quality:** Predict the quality of red or white wines.
  - [Kaggle Wine Quality Dataset](https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009)

- **Medical Cost Personal Datasets:** Predict individual medical costs billed by health insurance.
  - [Kaggle Medical Cost Personal Datasets](https://www.kaggle.com/mirichoi0218/insurance)

Feel free to use one of these datasets for your own project once you have completed the House Prices example.

In [None]:
# Example: Loading the House Prices dataset
import pandas as pd

# Load the House Prices dataset (ensure train.csv is in your working directory)
df = pd.read_csv('train.csv')

print('Example Dataset: House Prices')
display(df.head())

### Exercise

1. **Download a dataset from Kaggle** of your choice (for example, one of the suggestions above).
2. Replace the example code above with code to load your chosen dataset into a Pandas DataFrame.
3. Ensure your dataset has a target column for regression (for House Prices, the target is `SalePrice`).

## Step 2: Loading and Exploring the Data

In this step, we will explore the House Prices dataset by:

- Viewing the first few rows
- Getting summary statistics using `.describe()`
- Checking the data structure with `.info()`
- Looking for missing values

This will help you understand the dataset before diving into cleaning and modeling.

In [None]:
# View the first few rows
print('First 5 rows of the House Prices dataset:')
display(df.head())

# Summary statistics
print('Summary statistics:')
display(df.describe())

# Data structure and missing values
print('Dataset info:')
display(df.info())

print('Missing values in each column:')
display(df.isnull().sum())

### Exercise

Using your own regression dataset from Kaggle, perform the same exploratory steps:

- Print the first few rows
- Display summary statistics and dataset info
- Check for missing values

## Step 3: Data Cleaning and Preprocessing

Real-world data is often messy. In this step, you'll learn to clean the House Prices dataset by:

- Handling missing values
- Converting categorical variables into numerical format
- Scaling numerical features (if needed)

Let's implement these preprocessing steps. In this example, we will fill missing numeric values with the median and categorical values with the mode, and then convert categorical variables to dummy variables.

In [None]:
# Create a copy of the DataFrame for cleaning
df_clean = df.copy()

# Fill missing numeric columns with the median
numeric_cols = df_clean.select_dtypes(include=["float64", "int64"]).columns
for col in numeric_cols:
    df_clean[col].fillna(df_clean[col].median(), inplace=True)

# Fill missing categorical columns with the mode
categorical_cols = df_clean.select_dtypes(include=["object"]).columns
for col in categorical_cols:
    df_clean[col].fillna(df_clean[col].mode()[0], inplace=True)

# Convert categorical variables into dummy/indicator variables
df_clean = pd.get_dummies(df_clean, drop_first=True)

print('Cleaned House Prices dataset:')
display(df_clean.head())

# Verify that there are no missing values
print('Missing values after cleaning:')
display(df_clean.isnull().sum())

### Exercise

Apply similar cleaning and preprocessing steps to your own regression dataset:

1. Handle any missing values (either fill or drop them).
2. Convert categorical variables to numeric (using encoding methods such as get_dummies).
3. Scale numerical features if necessary.

## Step 4: Feature Engineering

Feature engineering is the process of creating new features that might help improve your model's performance. For the House Prices dataset, a common new feature is **TotalSF** (total square footage), which can be calculated as:

```
TotalSF = TotalBsmtSF + 1stFlrSF + 2ndFlrSF
```

Let's create this new feature and visualize its distribution.

In [None]:
# Create a new feature 'TotalSF'
df_clean['TotalSF'] = df_clean['TotalBsmtSF'] + df_clean['1stFlrSF'] + df_clean['2ndFlrSF']

print('Dataset with new feature (TotalSF):')
display(df_clean[['TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'TotalSF']].head())

# Plot the distribution of TotalSF
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(6,4))
sns.histplot(df_clean['TotalSF'], kde=True)
plt.title('Distribution of TotalSF')
plt.xlabel('Total Square Footage')
plt.show()

### Exercise

For your dataset, try to create at least one new feature. For example, consider combining existing features or computing a ratio/difference that might be useful for predicting the target variable.

Add your new feature(s) to your DataFrame and visualize its/their distribution.

## Step 5: Splitting the Data

Before training your models, split your dataset into a training set and a testing set. The training set is used to build your model, while the testing set evaluates its performance on unseen data.

For the House Prices dataset, our target variable is `SalePrice`.

In [None]:
from sklearn.model_selection import train_test_split

# Define features (X) and target (y). For House Prices, target is 'SalePrice'
X = df_clean.drop(['SalePrice'], axis=1)
y = df_clean['SalePrice']

# Split the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print('Training set shape:', X_train.shape)
print('Testing set shape:', X_test.shape)

### Exercise

Using your own regression dataset, perform the train-test split:

- Define your features (X) and target (y).
- Split the data into training and testing sets (80/20 or as appropriate).

## Step 6: Training Different Regression Models

Now it's time to build some models! In this step, we'll train several regression algorithms using the House Prices training data. We'll work with the following models:

- **Linear Regression**
- **Decision Tree Regressor**
- **Random Forest Regressor**
- **k-Nearest Neighbors Regressor (KNN)**
- **Support Vector Regressor (SVR)**

Let's train these models using our preprocessed House Prices dataset.

## Step 6.5: Setting Hyperparameters

Before training the regression models, it's important to configure their hyperparameters. Adjusting these values can significantly affect your model's performance. Here are some example settings:

- **Decision Tree Regressor:**  
  - *max_depth:* Limits the maximum depth of the tree (e.g., `max_depth=5`).  
  - *min_samples_split:* Minimum number of samples required to split an internal node (e.g., `min_samples_split=10`).

- **Random Forest Regressor:**  
  - *n_estimators:* The number of trees in the forest (e.g., `n_estimators=100`).  
  - *max_depth:* Maximum depth of each tree (e.g., `max_depth=7`).  
  - *random_state:* A seed value for reproducibility (e.g., `random_state=42`).

- **K-Nearest Neighbors Regressor:**  
  - *n_neighbors:* The number of neighbors to consider (e.g., `n_neighbors=5`).

- **Support Vector Regressor (SVR):**  
  - *kernel:* The type of kernel to use (e.g., `'rbf'`).  
  - *C:* Regularization parameter controlling the trade-off between fitting the training data and smoothness of the model (e.g., `C=1.0`).  
  - *epsilon:* Specifies the epsilon-tube within which no penalty is associated in the training loss function (e.g., `epsilon=0.1`).

Feel free to experiment with these hyperparameters to see how they impact the performance of your regression models.


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR

# Set hyperparameters for each model
lr_model = LinearRegression()

dt_model = DecisionTreeRegressor(max_depth=5, min_samples_split=10)
rf_model = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=42)
knn_model = KNeighborsRegressor(n_neighbors=5)
svr_model = SVR(kernel='rbf', C=1.0, epsilon=0.1)

# Initialize the models with the hyperparameters
models = {
    'Linear Regression': lr_model,
    'Decision Tree': dt_model,
    'Random Forest': rf_model,
    'K-Nearest Neighbors': knn_model,
    'SVR': svr_model
}

# Train each model and store predictions
predictions = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    predictions[name] = pred
    print(f"{name} model trained.")

print('\nAll models have been trained on the House Prices dataset!')


### Exercise

Repeat this step using your own regression dataset:

1. Initialize similar regression models.
2. Train each model on the training set of your dataset.
3. Save the predictions for later evaluation.

## Step 7: Evaluating Model Performance

After training your models, evaluate how well they perform on the testing data. For regression tasks, common metrics include:

- **Mean Squared Error (MSE)**
- **Mean Absolute Error (MAE)**
- **R² Score**

Let's compute these metrics for the models trained on the House Prices dataset.

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

for name, pred in predictions.items():
    mse = mean_squared_error(y_test, pred)
    mae = mean_absolute_error(y_test, pred)
    r2 = r2_score(y_test, pred)
    print(f"\nModel: {name}")
    print(f"Mean Squared Error (MSE): {mse:.2f}")
    print(f"Mean Absolute Error (MAE): {mae:.2f}")
    print(f"R² Score: {r2:.2f}")
    print('---------------------------')

### Exercise

For your dataset, evaluate your trained regression models using similar metrics:

1. Calculate and review metrics such as MSE, MAE, and R² Score for each model.
2. Compare the performance of the different models.

## Final Thoughts and Next Steps

Great job! You have now built a complete regression pipeline using the House Prices dataset:

1. **Choosing a Dataset:** Downloaded the House Prices dataset from Kaggle.
2. **Loading and Exploring the Data:** Loaded and explored the dataset.
3. **Data Cleaning and Preprocessing:** Cleaned and prepared the data.
4. **Feature Engineering:** Created new features (e.g., TotalSF).
5. **Splitting the Data:** Divided the data into training and testing sets.
6. **Training Models:** Built several regression models.
7. **Evaluating Performance:** Evaluated the models on unseen data.

### Next Steps

- Experiment with other Kaggle regression datasets and try replicating these steps.
- Explore additional preprocessing techniques and feature engineering ideas.
- Once comfortable, try incorporating more advanced techniques such as hyperparameter tuning.

Keep experimenting and enjoy your journey into machine learning!

## References and Further Reading

Here are some useful resources for the modules and functions used in this notebook:

- **Pandas:** [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/)
- **NumPy:** [NumPy Documentation](https://numpy.org/doc/stable/)
- **Matplotlib:** [Matplotlib API Reference](https://matplotlib.org/stable/api/index.html)
- **Seaborn:** [Seaborn API Reference](https://seaborn.pydata.org/api.html)
- **Scikit-Learn:** [Scikit-Learn Documentation](https://scikit-learn.org/stable/documentation.html)
  - [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
  - [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
  - [DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)
  - [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
  - [KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)
  - [SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html)
  - [mean_squared_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)
  - [mean_absolute_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html)
  - [r2_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html)

These resources will help you learn more about the functions and libraries used throughout the notebook.