<a href="https://colab.research.google.com/github/muskan-asudani/customer-satisfaction-score-prediction/blob/main/project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Customer Satisfaction Score

Project made by:

*  Muskan Asudani  roll no. 06
*  Aditya Shukla roll no. 42


                

The Customer Feedback and Satisfaction Dataset is a  dataset designed to analyze and predict customer satisfaction based on various demographic and behavioral factors. It contains data for 38,444 customers, capturing their feedback on products and services in a structured format.

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns   #visualisation
import matplotlib.pyplot as plt    #visualisation
%matplotlib inline
#to display the plots immediatedly below the code
sns.set(color_codes=True)
#to enable us to use shorthand color codes

## Load dataset

Here the data was downloaded from kaggle and uploaded from the device.

In [None]:
data=pd.read_csv('customer_feedback_satisfaction.csv')

## Exploratory data analysis

In [None]:
data.shape

In [None]:
data.describe() #all the details about the dataset

In [None]:
data.head() #to show the first five entries in the dataset

In [None]:
data.tail() #to show the last entries in the dataset  default=5

In [None]:
data.dtypes #check the types of data present

In [None]:
data.values

## Data Preprocessing

**Dropping irrelevant columns**

In [None]:
data=data.drop(['CustomerID'],axis=1)  #dropping the customer id column here because it won't factor in for the prediction #axis=1 implies we are dropping a coloumn
data.head()

**Renaming the rows**

In [None]:
#this can be used to give the columns short alternative names so they are easier to access
data=data.rename(columns={'SatisfactionScore':'SS'})
data.head()

**Dropping duplicate rows**

In [None]:
if data.duplicated().any():
  data=data.drop_duplicates()

**Dropping null values**

data.isna(): Returns a DataFrame of the same shape as data, with True indicating the presence of a NaN in each cell.

data.isna().any(): For each column, it checks if there are any True values (i.e., any NaN values). This results in a Series where each entry is True if the column contains at least one NaN.

data.isna().any().any(): The second .any() checks across the entire Series of columns, returning True if any column has at least one missing value. Essentially, this checks if there's any missing value in the entire DataFrame.

data.dropna(): If any missing value is found, dropna() is called to remove all rows containing NaNs.

In [None]:
if data.isna().any().any():
  data=data.dropna()

**OR Run this to see what each function does**

In [None]:
#data.isna()

In [None]:
#data.isna().any()

In [None]:
#na=data.isna().any().any()

In [None]:
#if na:
  #data = data.dropna()

**Classifying the data**

Linear Regression requires numeric inputs

Import necessary library

In [None]:
from sklearn.preprocessing import OneHotEncoder

Encodeing the Gender Column

In [None]:
# Create a OneHotEncoder object
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')  # sparse=False for array output

# Fit the encoder to the 'Gender' column
encoder.fit(data[['Gender']])  # Assuming your DataFrame is named 'data'

# Transform the 'Gender' column
encoded_gender = encoder.transform(data[['Gender']])

# Create new columns from the encoded data
gender_df = pd.DataFrame(encoded_gender, columns=encoder.get_feature_names_out(['Gender']))

# Concatenate the encoded columns with the original DataFrame
data = pd.concat([data, gender_df], axis=1)

# Drop the original 'Gender' column
data = data.drop('Gender', axis=1)

Encoding the Country column

In [None]:
# Create a OneHotEncoder object
encoder_country = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# Fit the encoder to the 'Country' column
encoder_country.fit(data[['Country']])

# Transform the 'Country' column
encoded_country = encoder_country.transform(data[['Country']])

# Create new columns from the encoded data
country_df = pd.DataFrame(encoded_country, columns=encoder_country.get_feature_names_out(['Country']))

# Concatenate the encoded columns with the original DataFrame
data = pd.concat([data, country_df], axis=1)

# Drop the original 'Country' column
data = data.drop('Country', axis=1)

In [None]:
feedback_mapping = {'Low': 1, 'Medium': 2, 'High': 3}

# Apply the mapping to the 'FeedbackScore' column
data['FeedbackScore'] = data['FeedbackScore'].map(feedback_mapping)

In [None]:
loyalty_mapping={'Gold':3,'Silver':2,'Bronze':1}
data['LoyaltyLevel']=data['LoyaltyLevel'].map(loyalty_mapping)

## Data Visualisation

In [None]:
plt.scatter(data['ServiceQuality'], data['SS'])  # Create scatter plot
plt.xlabel('Service Quality')  # Set x-axis label
plt.ylabel('Satisfaction Score')  # Set y-axis label
plt.title('Service Quality vs Satisfaction Score')  # Set plot title
plt.show()  # Display the plot

In [None]:
plt.scatter(data['ProductQuality'], data['SS'])  # Create scatter plot
plt.xlabel('Product Quality')  # Set x-axis label
plt.ylabel('Satisfaction Score')  # Set y-axis label
plt.title('Product Quality vs Satisfaction Score')  # Set plot title
plt.show()  # Display the plot

## Test-Train-Split

**This function randomly splits the data into the desired proportions (e.g., 80% for training, 20% for testing).**

In [None]:
#importing necessary modules and functions
from sklearn.model_selection import train_test_split

In [None]:
# Features and target variable
x = data.drop('SS', axis=1) #enter target variable here and in the code below
y = data['SS']


In [None]:
## Split the data
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=1,test_size=0.2)
x_train.count(),x_test.count(),y_train.count(),y_test.count()

In [None]:
x_train.head(),y_train.head()

## ModelTraining

Importing libraries

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

**Linear Regression**

In [None]:
modellr = LinearRegression()  # Create a Linear Regression object
modellr.fit(x_train, y_train)  # Train the model

**Decision Tree Regression**

In [None]:
modeldt = DecisionTreeRegressor()
modeldt.fit(x_train, y_train)

**Random Forest Regression**

In [None]:
modelrf = RandomForestRegressor()
modelrf.fit(x_train, y_train)

**Support Vector Regression (SVR)**

In [None]:
modelsvr = SVR()
modelsvr.fit(x_train, y_train)

## Model Testing

In [None]:
from sklearn import metrics

In [None]:
# Make predictions on the test data
y_pred_lr = modellr.predict(x_test)
y_pred_dt = modeldt.predict(x_test)
y_pred_rt = modelrf.predict(x_test)
y_pred_svr = modelsvr.predict(x_test)

In [None]:
print(f'Linear Regression prediction: {y_pred_lr}')
print(f'Decision Tree Regression prediction: {y_pred_dt}')
print(f'Random Forest Regression prediction: {y_pred_rt}')
print(f'Support Vector Regression prediction: {y_pred_svr}')

In [None]:
print(y_test)

Cross-Validation

In [None]:
from sklearn.model_selection import cross_val_score
# Perform 5-fold cross-validation
cv_scores = cross_val_score(model, x, y, cv=5)

print("Cross-validation scores:", cv_scores)
print("Average cross-validation score:", cv_scores.mean())

Mean-squared-error

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
mse1 = mean_squared_error(y_test, y_pred_lr)  # Calculate Mean Squared Error for linear regression
r2_1 = r2_score(y_test, y_pred_lr)  # Calculate R-squared

mse2 = mean_squared_error(y_test, y_pred_dt)  # Calculate Mean Squared Error for decision tree regression
r2_2 = r2_score(y_test, y_pred_dt)  # Calculate R-squared

mse3 = mean_squared_error(y_test, y_pred_rt)  # Calculate Mean Squared Error for random forest regression
r2_3 = r2_score(y_test, y_pred_rt)  # Calculate R-squared

mse4 = mean_squared_error(y_test, y_pred_svr)  # Calculate Mean Squared Error for Support vector regression
r2_4 = r2_score(y_test, y_pred_svr)  # Calculate R-squared


mse={'Linear Regression':mse1,'Decision Tree Regression':mse2,'Random Forest Regression':mse3,'Support Vector Regression':mse4}

r2_score={'Linear Regression':r2_1,'Decision Tree Regression':r2_2,'Random Forest Regression':r2_3,'Support Vector Regression':r2_4}

print(mean_squared_error)
for key, value in mse.items():
    print(f"{key}: {value}")

print('\nr2_score: ')
for key, value in r2_score.items():
  print(key , value)

##Hyperparameter Tuning

Best Model: The model with the lowest MSE and the highest R2 is generally considered the best-performing model for regression tasks.

We chose the random forest regressor model and linear regression here.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# Define the model
model = RandomForestRegressor()

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create GridSearchCV object
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error')

# Fit to training data
grid_search.fit(x_train, y_train)

# Get the best model
best_rf_model = grid_search.best_estimator_
print(best_rf_model)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV

# Define the model
model = LinearRegression()

# Define the hyperparameter grid
param_grid = {
    'fit_intercept': [True, False],
    'copy_X': [True, False],
    'positive': [True, False]
}

# Create GridSearchCV object
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error')

# Fit to training data
grid_search.fit(X_train, y_train)

# Get the best model
best_linear_model = grid_search.best_estimator_
print(best_linear_model)