<a href="https://colab.research.google.com/github/poonamaswani/DataScienceAndAI/blob/main/CAM_DS_C101_Demo_4_1_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**First things first** - please go to 'File' and select 'Save a copy in Drive' so that you have your own version of this activity set up and ready to use.
Remember to update the portfolio index link to your own work once completed!

## Demonstration 4.1.2 Identifying types of data and features

A radio station deployed an online questionnaire to determine listener satisfaction with a new radio show. The questionnaire consisted of three categories: discussion topics, the popularity of the presenter, and music choice. A total of 100 questionnaires were completed. Each category had five questions, and an average was captured for analysis.

Follow the demonstration to explore the first step in feature engineering – identifying and selecting features/inputs that are relevant to the problem. In this demonstration, you will learn:
*   the benefit of identifying the relevant input features
*   the impact of identifying the wrong input features.

In [None]:
# Import the necessary libraries.
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# Import the identify_features.csv file (data set) from GitHub with a url.
url = "https://raw.githubusercontent.com/fourthrevlxd/cam_dsb/main/identify_features.csv"

# Read the CSV file into a new DataFrame.
df_read = pd.read_csv(url)

# Display the first few rows of the DataFrame.
df_read.head()

Unnamed: 0,x1,x2,x3,y
0,2.133727,0.088822,0.863038,5.932389
1,2.950237,2.546523,0.579258,15.258863
2,4.25996,1.9868,1.654105,17.519463
3,4.848423,2.490406,4.351971,28.220118
4,2.813239,1.299174,0.470025,10.409317


In [None]:
# Specify the independent (X) and dependent (y) variables.
X = df_read
y = X['y']

# Drop the y column from X.
X.drop('y', axis=1, inplace=True)

# View the DataFrame.
X.head()

Unnamed: 0,x1,x2,x3
0,2.133727,0.088822,0.863038
1,2.950237,2.546523,0.579258
2,4.25996,1.9868,1.654105
3,4.848423,2.490406,4.351971
4,2.813239,1.299174,0.470025


> **Variables:**
> - `y`: Target or dependent representing listener satisfaction
> - `X`: Features representing discussion topics (`x1`), popularity of the presenter (`x2`), and music choice (`x3`)
>
> We will create three linear regression models as follows:
> - Model 1: Three features (`x1`, `x2`, and `x3`)
> - Model 2: Two features (`x1` and `x2`)
> - Model 3: One feature (`x1`)

In [None]:
# Only include two of the independent variables in the model.
X_train_missing_1 = X[['x1', 'x2']]

# Only include one of the independent variables in the model.
X_train_missing_2 = X[['x1']]

In [None]:
# Train a linear regression model on the training data - all features.
model_all_features = LinearRegression()
model_all_features.fit(X, y)

# Train a linear regression model on the training data - 1 missing feature.
model_missing_features_1 = LinearRegression()
model_missing_features_1.fit(X_train_missing_1, y)

# Train a linear regression model on the training data - 2 missing features.
model_missing_features_2 = LinearRegression()
model_missing_features_2.fit(X_train_missing_2, y)

In [None]:
# Make a prediction on the complete data.
# All the independent variables
y_pred_all = model_all_features.predict(X)

# Only two independent variables
y_pred_missing_1 = model_missing_features_1.predict(X_train_missing_1)

# Only one independent variable
y_pred_missing_2 = model_missing_features_2.predict(X_train_missing_2)

In [None]:
# Calculate the mean squared error.
# All the independent variables
mse_all = np.mean((y_pred_all - y)**2)

# Only two independent variables
mse_missing_1 = np.mean((y_pred_missing_1 - y)**2)

# Only one independent variable
mse_missing_2 = np.mean((y_pred_missing_2 - y)**2)

In [None]:
# Calculate the R-squared.
# All the independent variables
r2_all = r2_score(y, y_pred_all)

# Only two independent variables
r2_missing_1 = r2_score(y, y_pred_missing_1)

# Only one independent variable
r2_missing_2 = r2_score(y, y_pred_missing_2)

In [None]:
# Print the mean squared error.
# All the independent variables
print('MSE ALL Features:', mse_all)

# Only two independent variables
print('MSE Missing 1 Feature:', mse_missing_1)

# Only one independent variable
print('MSE Missing 2 Features:', mse_missing_2)

MSE ALL Features: 0.10153555204559928
MSE Missing 1 Feature: 19.311539904453003
MSE Missing 2 Features: 61.496802104042516


In [None]:
# Print R-squared.
# All the independent variables
print('R-squared ALL Features:', r2_all)

# Only two independent variables
print('R-squared Missing 1 Feature:', r2_missing_1)

# Only one independent variable
print('R-squared Missing 2 Features:', r2_missing_2)

R-squared ALL Features: 0.9983938107350903
R-squared Missing 1 Feature: 0.6945110608205607
R-squared Missing 2 Features: 0.027183076510646442


# Key information
This demonstration illustrated the importance of identifying and selecting the correct features/inputs based on sound business knowledge. The number of features you select will have an impact on the machine learning model accuracy and predictions.

## Reflect
**What are the practical applications of this technique?**

This demonstration highlights that missing relevant input features significantly limits the model's capability, no matter how much we try to fine-tune it.

If an important feature is omitted at the start, whether knowingly or unknowingly, it's not the model or the advanced techniques that cause poor performance.

In conclusion, selecting the appropriate set of features is a crucial first step before proceeding with the rest of the feature engineering steps.

It can directly influence a model's accuracy, complexity, and interpretability.