# A Market Basket Analysis: Mall Customers in the U.S.

## Introduction

Several business entities are developing a prominent business strategy to target specific groups of customers and effectively allocate marketing resources. One of such business strategies is customer segmentation, which is the partitions customers into groups of individuals that have similar characteristics. This partitioning helps business target the specific groups of customers and effectively allocate marketing resources. This project develops a customer segmentation model based on unsupervised learning(clustering) for a mall in the U.S. to find the hidden data patterns or structures which can be used to target the right audience and hence increase profit margin. Typically, mall members might contain customers who are high-profit and low-risk, that is, more likely to purchase products or subscribe for a service; another group might include customers from non-profit organizations. The overall goal of this project is to help mall business how it can retain those customers based on machine learning models.
The specific project goals are following -
1. Learn customer segmentation concepts
2. Apply unsupervised machine learning skills/technique
3. Identify customers who are likely to converge
4. Explore marketing strategy from a real-world perspective

In [1]:
## Importing the Required Libraries. This will be updated in the following sections as needed. 
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt #libraries for visualization

In [2]:
#Import dataset
data = pd.read_csv('Mall_Customers.csv')

### Data Exploration
I begin the project by exploring data. This is done by data wrangling (reading in dataset,data types, any missing data, null data, etc.). 

In [3]:
data.head() 

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


### Next session will cover preprocessing and modeling.

## Pre-Processing

In [4]:
#scaling the dataset on the same scale
from sklearn.preprocessing import StandardScaler
df = data.copy() #a copy of the original dataset
df_num = df.select_dtypes(np.number) #selecting column with numerical value (in this case 'Gender')
scaler = StandardScaler()
scaled_df = scaler.fit_transform(df_num) #calling fit transform on the numerical columns of the data. 
scaled_df = pd.DataFrame(scaled_df, columns= df_num.columns)
scaled_df

Unnamed: 0,CustomerID,Age,Annual Income (k$),Spending Score (1-100)
0,-1.723412,-1.424569,-1.738999,-0.434801
1,-1.706091,-1.281035,-1.738999,1.195704
2,-1.688771,-1.352802,-1.700830,-1.715913
3,-1.671450,-1.137502,-1.700830,1.040418
4,-1.654129,-0.563369,-1.662660,-0.395980
...,...,...,...,...
195,1.654129,-0.276302,2.268791,1.118061
196,1.671450,0.441365,2.497807,-0.861839
197,1.688771,-0.491602,2.497807,0.923953
198,1.706091,-0.491602,2.917671,-1.250054


In [5]:
# converting the gender column to numerical value for decoding
df['Gender'] = df['Gender'].astype('category').cat.codes

In [6]:
final_df = pd.concat([scaled_df, df['Gender']], axis=1) #rejoin the two separate columns to get a final dataset for further analysis
final_df

Unnamed: 0,CustomerID,Age,Annual Income (k$),Spending Score (1-100),Gender
0,-1.723412,-1.424569,-1.738999,-0.434801,1
1,-1.706091,-1.281035,-1.738999,1.195704,1
2,-1.688771,-1.352802,-1.700830,-1.715913,0
3,-1.671450,-1.137502,-1.700830,1.040418,0
4,-1.654129,-0.563369,-1.662660,-0.395980,0
...,...,...,...,...,...
195,1.654129,-0.276302,2.268791,1.118061,0
196,1.671450,0.441365,2.497807,-0.861839,0
197,1.688771,-0.491602,2.497807,0.923953,1
198,1.706091,-0.491602,2.917671,-1.250054,1


In [7]:
#Renaming columns and dropping unused column - customerID
final_df.rename(index=str, columns={'Annual Income (k$)': 'Income',
                              'Spending Score (1-100)': 'Score'}, inplace=True)

final_df.drop('CustomerID',axis=1,inplace=True)
final_df

Unnamed: 0,Age,Income,Score,Gender
0,-1.424569,-1.738999,-0.434801,1
1,-1.281035,-1.738999,1.195704,1
2,-1.352802,-1.700830,-1.715913,0
3,-1.137502,-1.700830,1.040418,0
4,-0.563369,-1.662660,-0.395980,0
...,...,...,...,...
195,-0.276302,2.268791,1.118061,0
196,0.441365,2.497807,-0.861839,0
197,-0.491602,2.497807,0.923953,1
198,-0.491602,2.917671,-1.250054,1
