<a href="https://colab.research.google.com/github/icollier77/cam-data-science/blob/main/CAM_DS_C101_Demo_4_1_2_Inessa_PRACTICE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Demonstration 4.1.2 Identifying types of data and features

A radio station deployed an online questionnaire to determine listener satisfaction with a new radio show. The questionnaire consisted of three categories: discussion topics, the popularity of the presenter, and music choice. A total of 100 questionnaires were completed. Each category had five questions, and an average was captured for analysis.

Follow the demonstration to explore the first step in feature engineering – identifying and selecting features/inputs that are relevant to the problem. In this demonstration, you will learn:

* the benefit of identifying the relevant input features
* the impact of identifying the wrong input features.

In [1]:
# Import the necessary libraries
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

In [2]:
# Import the identify_features.csv file (data set) from GitHub with a url.
url = "https://raw.githubusercontent.com/fourthrevlxd/cam_dsb/main/identify_features.csv"

# Read the CSV file into a new dataframe
df_read = pd.read_csv(url)

# Display the first five rows of the dataframe
df_read.head()

Unnamed: 0,x1,x2,x3,y
0,2.133727,0.088822,0.863038,5.932389
1,2.950237,2.546523,0.579258,15.258863
2,4.25996,1.9868,1.654105,17.519463
3,4.848423,2.490406,4.351971,28.220118
4,2.813239,1.299174,0.470025,10.409317


In [3]:
# Specify the independent (X) and dependent (y) variables.
X = df_read
y = X['y']

# Drop the 'y' column from X
X.drop('y', axis=1, inplace=True)

# View the dataframe
X.head()

Unnamed: 0,x1,x2,x3
0,2.133727,0.088822,0.863038
1,2.950237,2.546523,0.579258
2,4.25996,1.9868,1.654105
3,4.848423,2.490406,4.351971
4,2.813239,1.299174,0.470025


> **Variables:**
> - `y`: Target or dependent representing listener satisfaction
> - `X`: Features representing discussion topics (`x1`), popularity of the presenter (`x2`), and music choice (`x3`)
>
> We will create three linear regression models as follows:
> - Model 1: Three features (`x1`, `x2`, and `x3`)
> - Model 2: Two features (`x1` and `x2`)
> - Model 3: One feature (`x1`)

In [4]:
# Only include two of the independent variables in the model.
X_train_missing_1 = X[['x1', 'x2']]

# Only include one independent variable in the model.
X_train_missing_2 = X[['x1']]

In [5]:
# Train a linear regression model on the training data - all features
model_all_features = LinearRegression()
model_all_features.fit(X, y)

# Train a linear regression model on the training data - 1 missing feature
model_missing_feature_1 = LinearRegression()
model_missing_feature_1.fit(X_train_missing_1, y)

# Train a linear regression model on the training data - 2 missing features
model_missing_features_2 = LinearRegression()
model_missing_features_2.fit(X_train_missing_2, y)

In [6]:
# Make a prediction on the complete data
y_pred_all = model_all_features.predict(X)

# Make a prediction using only 2 independent variables
y_pred_missing_1 = model_missing_feature_1.predict(X_train_missing_1)

# Make a prediction using only 1 independent variable
y_pred_missing_2 = model_missing_features_2.predict(X_train_missing_2)

In [16]:
# Calculate the mean squared error to evaluate the prediction models
# Evaluate the model that made prediction based on all independent variables
mse_all = np.mean((y_pred_all - y)**2)

# Evaluate the model that made prediction based on 2 independent variables
mse_missing_1 = np.mean((y_pred_missing_1 - y)**2)


# Evaluate the model that made prediction based on 1 independent variable
mse_missing_2 = np.mean((y_pred_missing_2 - y)**2)



In [10]:
# Calculate the R-squared
# R-squared for the model based on all independent variables
r2_all = r2_score(y, y_pred_all)

# R-squared for the model based on 2 independent variables
r2_missing_1 = r2_score(y, y_pred_missing_1)

# R-squared for the model based on 1 independent variable
r2_missing_2 = r2_score(y, y_pred_missing_2)

In [17]:
# Publish the evaluation metrics

# Evaluate model with all independent variables
print("Model with all independent variables:")
print("Mean Squared Error:", mse_all)
print("R-squared:", r2_all)
print()

# Evaluate the model with 2 independent variables
print("Model with 2 independent variables:")
print("Mean Squared Error:", mse_missing_1)
print("R-squared:", r2_missing_1)
print()

# Evaluate the model with 1 indepenedent variable
print("Model with 1 independent variable:")
print("Mean Squared Error:", mse_missing_2)
print("R-squared:", r2_missing_2)

Model with all independent variables:
Mean Squared Error: 0.10153555204559928
R-squared: 0.9983938107350903

Model with 2 independent variables:
Mean Squared Error: 19.311539904453003
R-squared: 0.6945110608205607

Model with 1 independent variable:
Mean Squared Error: 61.496802104042516
R-squared: 0.027183076510646442


## Conclusion

The results clearly demonstrate the importance of identifying and selecting relevant input features when building regression models.

The model that included all three questionnaire categories—discussion topics, presenter popularity, and music choice—achieved the strongest performance, indicating that listener satisfaction is influenced by a combination of factors rather than a single aspect of the show.
* When one or more features were removed, model performance deteriorated substantially, with a marked increase in prediction error and a sharp reduction in explained variance.
* This highlights that excluding relevant features can lead to underfitting and misleading conclusions about what drives listener satisfaction.

Overall, the evaluation shows that careful feature selection is critical to accurately capture the underlying drivers of satisfaction and to support reliable, data-driven decision-making.