## Assignment 1: Customer's Shopping Data

The goal of this assignment is to develop machine learning models that can predict the spending behavior of customers in a shop based on their characteristics. This prediction can help the shop in targeting its customers with customized strategies based on their spending behaviors, thus increasing its profits. Towards this goal, you should load the customer’s dataset attached to this assignment, develop linear and logistic regression models for spending predictions, and finally evaluate the performance of your trained models on test data.

*Note: Please include comments to your code so it can be easily followed and understood.*

### Loading the Data

The customer’s shopping dataset is split into training and testing data.  The training data will be used to build the linear and logistic regression models, and the test data will be used to evaluate their performance in customer’s spending predictions.


Importing the necessary python libraries to load the data.

In [1]:
#import libraries
import pandas as pd

In [2]:
#Load the train_data.csv file in python, and assign it to a variable named "train".
train = pd.read_csv('./train_data.csv')

In [3]:
#Load the test_data.csv file in python, and assign it to a variable named "test".
test = pd.read_csv('./test_data.csv')

In [4]:
# Encode 'Gender' and 'Profession' using pd.get_dummies
train = pd.get_dummies(train, columns=['Gender', 'Profession'])
test = pd.get_dummies(test, columns=['Gender', 'Profession'])
test= test.reindex(columns=train.columns, fill_value=0)

### Linear Regression

In this section, you will train a linear regression model to predict the spending score of the customers in the shop. You should use the “Spending Score” column as the target variable and all the remaining columns as the independent variables (i.e., features).

*Note: Make sure to exclude the “Spending Category” column in this section of the assignment, given that it is derived from the “Spending Score” column (i.e., to avoid data leakage).*



Importing the necessary python libraries to call the linear regression model and the MSE metric.

In [6]:
#import libraries
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Dropping the "Spending Category" column from the training and testing data.

In [7]:
#Drop Spending Category column
linear_reg_train = train.drop(columns="Spending Category")
linear_reg_test = test.drop(columns="Spending Category")

In the remaining part of this section, you should use the linear_reg_train and linear_reg_test dataframes instead of the train and test dataframes.

In [10]:
#Train a linear regression model to predict the spending score of a customer
#You should use default values for all parameters

# Separate target values for training
target_values_train = linear_reg_train.pop("Spending Score")

# Create and train the Linear Regression model
model = LinearRegression()
model.fit(linear_reg_train, target_values_train)

# Separate target values for testing
target_values_test = linear_reg_test.pop("Spending Score")

In [11]:
#Compute the MSE metric to evaluate the trained model using the test data

# Predict on test data
linear_reg_test_predict = model.predict(linear_reg_test)

# Compute MSE using actual target values of test data and predicted values
mse = mean_squared_error(target_values_test, linear_reg_test_predict)
print('MSE:', mse)

MSE: 780.7824887124046


### Logistic Regression

In this section, you will train a logistic regression model to predict the spending category of a customer in the shop, as either High or Low.  The target variable will be the “Spending Category” column, and all the remaining columns will be the independent variables.

*Note: Don’t use the “Spending Score” column in this section of the assignment, given that “Spending Category” column is derived from it.*




Importing the necessary python libraries to call the logistic regression model and the Accuracy metric.



In [12]:
#import libraries
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Dropping the Spending Score column from the training and testing data.

In [13]:
#Drop Spending Score column
logistic_reg_train = train.drop(columns="Spending Score")
logistic_reg_test = test.drop(columns="Spending Score")

In the remaining part of this section, you should use the logistic_reg_train and logistic_reg_test dataframes instead of the train and test dataframes.

In [14]:
#Train a logistic regression model to predict the spending category of a customer
#You should use default values for all parameters

# Separate target values and features for training
target_values_train = logistic_reg_train.pop("Spending Category")

# Create and train the Logistic Regression model
logistic_model = LogisticRegression()
logistic_model.fit(logistic_reg_train, target_values_train)

# Prepare test data by separating target values
target_values_test = logistic_reg_test.pop("Spending Category")

# Predict on test data
logistic_reg_test_predict = logistic_model.predict(logistic_reg_test)

In [15]:
#Compute the Accuracy metric to evaluate trained model on the test data

# Compute Accuracy using actual target values of test data and predicted values
accuracy = accuracy_score(target_values_test, logistic_reg_test_predict)
print('Accuracy:', accuracy)

Accuracy: 0.455470737913486
