Nama Dataset : Supermarket Sales

Sumber : Kaggle

Link : https://www.kaggle.com/datasets/lovishbansal123/sales-of-a-supermarket

Deskripsi dataset : Dataset ini berisi 1.000 transaksi dari sebuah supermarket dengan informasi pelanggan, produk, dan pembayaran. Kolom-kolom utama mencakup Invoice ID, Cabang, Kota, Jenis Pelanggan, Gender, Kategori Produk, Harga, Jumlah, Pajak, Total Pembayaran, Tanggal & Waktu, Metode Pembayaran, Keuntungan, dan Rating Pelanggan.

# **BUSINESS UNDERSTANDING**

## **Business Objective**

Supermarket memiliki berbagai metode pembayaran, seperti tunai, kartu kredit, dan e-wallet. Namun, belum diketahui metode pembayaran mana yang paling sering digunakan dan apakah ada pola tertentu dalam penggunaannya. Analisis ini bertujuan untuk memahami bagaimana pelanggan memilih metode pembayaran dan bagaimana supermarket bisa mengoptimalkan sistem pembayaran agar lebih efektif. Dengan memahami pola transaksi pelanggan, supermarket dapat meningkatkan pendapatan, mengurangi biaya operasional, dan meningkatkan kepuasan pelanggan.

## **Assess Situation**

Di era digital, tren pembayaran di supermarket semakin beragam. Banyak pelanggan mulai beralih ke pembayaran non-tunai karena dianggap lebih cepat dan praktis. Namun, masih ada pelanggan yang tetap menggunakan pembayaran tunai. Persaingan antar supermarket juga semakin ketat, di mana beberapa pesaing menawarkan promo khusus untuk metode pembayaran tertentu. Oleh karena itu, penting untuk memahami faktor-faktor yang memengaruhi pilihan metode pembayaran pelanggan agar supermarket dapat mengambil keputusan bisnis yang tepat.

## **Data Mining Goals**

Analisis ini bertujuan untuk mengidentifikasi pola penggunaan metode pembayaran di supermarket. Fokus utama adalah mengetahui metode pembayaran yang paling sering digunakan dan apakah ada hubungan antara metode pembayaran dengan total belanja pelanggan. Selain itu, analisis ini juga akan mengeksplorasi bagaimana waktu transaksi memengaruhi pilihan metode pembayaran dan apakah ada perbedaan preferensi berdasarkan jenis pelanggan.

## **Project Plan**

Analisis ini dimulai dengan mengumpulkan data transaksi supermarket yang mencakup informasi metode pembayaran, waktu transaksi, total belanja, dan data pelanggan. Data yang terkumpul kemudian diproses dengan membersihkan nilai yang hilang atau tidak valid, serta menyesuaikan format agar siap untuk dianalisis.

Setelah data siap, dilakukan eksplorasi untuk mengidentifikasi pola penggunaan metode pembayaran. Analisis ini mencakup distribusi metode pembayaran yang digunakan pelanggan, hubungan antara metode pembayaran dengan total belanja, serta kemungkinan adanya perbedaan preferensi berdasarkan waktu transaksi atau jenis pelanggan.

Hasil eksplorasi divisualisasikan menggunakan grafik dan tabel untuk mempermudah interpretasi pola yang ditemukan. Berdasarkan temuan yang diperoleh, akan dibuat rekomendasi bagi supermarket. Supermarket dapat mengambil keputusan strategis, seperti meningkatkan promosi untuk metode pembayaran tertentu atau menambah fasilitas pembayaran digital.

# **DATA UNDERSTANDING**

In [1]:
import pandas as pd

In [4]:
file_path = "/content/supermarket_sales.csv"

df = pd.read_csv(file_path)
df

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08,Ewallet,522.83,4.761905,26.1415,9.1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.8200,80.2200,3/8/2019,10:29,Cash,76.40,4.761905,3.8200,9.6
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,324.31,4.761905,16.2155,7.4
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.2880,489.0480,1/27/2019,20:33,Ewallet,465.76,4.761905,23.2880,8.4
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,604.17,4.761905,30.2085,5.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,233-67-5758,C,Naypyitaw,Normal,Male,Health and beauty,40.35,1,2.0175,42.3675,1/29/2019,13:46,Ewallet,40.35,4.761905,2.0175,6.2
996,303-96-2227,B,Mandalay,Normal,Female,Home and lifestyle,97.38,10,48.6900,1022.4900,3/2/2019,17:16,Ewallet,973.80,4.761905,48.6900,4.4
997,727-02-1313,A,Yangon,Member,Male,Food and beverages,31.84,1,1.5920,33.4320,2/9/2019,13:22,Cash,31.84,4.761905,1.5920,7.7
998,347-56-2442,A,Yangon,Normal,Male,Home and lifestyle,65.82,1,3.2910,69.1110,2/22/2019,15:33,Cash,65.82,4.761905,3.2910,4.1


## Pemeriksaan Struktur Data

In [23]:
print("\nMenampilkan 5 baris pertama dengan df.head():")
df.head()


Menampilkan 5 baris pertama dengan df.head():


Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08,Ewallet,522.83,4.761905,26.1415,9.1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,3/8/2019,10:29,Cash,76.4,4.761905,3.82,9.6
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,324.31,4.761905,16.2155,7.4
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.288,489.048,1/27/2019,20:33,Ewallet,465.76,4.761905,23.288,8.4
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,604.17,4.761905,30.2085,5.3


In [24]:
print("\nMenampilkan 5 baris terakhir dengan df.tail():")
df.tail()


Menampilkan 5 baris terakhir dengan df.tail():


Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating
995,233-67-5758,C,Naypyitaw,Normal,Male,Health and beauty,40.35,1,2.0175,42.3675,1/29/2019,13:46,Ewallet,40.35,4.761905,2.0175,6.2
996,303-96-2227,B,Mandalay,Normal,Female,Home and lifestyle,97.38,10,48.69,1022.49,3/2/2019,17:16,Ewallet,973.8,4.761905,48.69,4.4
997,727-02-1313,A,Yangon,Member,Male,Food and beverages,31.84,1,1.592,33.432,2/9/2019,13:22,Cash,31.84,4.761905,1.592,7.7
998,347-56-2442,A,Yangon,Normal,Male,Home and lifestyle,65.82,1,3.291,69.111,2/22/2019,15:33,Cash,65.82,4.761905,3.291,4.1
999,849-09-3807,A,Yangon,Member,Female,Fashion accessories,88.34,7,30.919,649.299,2/18/2019,13:28,Cash,618.38,4.761905,30.919,6.6


In [27]:
print("\nMenampilkan ringkasan informasi dataset dengan df.info():")
df.info()


Menampilkan ringkasan informasi dataset dengan df.info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Invoice ID               1000 non-null   object 
 1   Branch                   1000 non-null   object 
 2   City                     1000 non-null   object 
 3   Customer type            1000 non-null   object 
 4   Gender                   1000 non-null   object 
 5   Product line             1000 non-null   object 
 6   Unit price               1000 non-null   float64
 7   Quantity                 1000 non-null   int64  
 8   Tax 5%                   1000 non-null   float64
 9   Total                    1000 non-null   float64
 10  Date                     1000 non-null   object 
 11  Time                     1000 non-null   object 
 12  Payment                  1000 non-null   object 
 13  cogs                

In [28]:
print("\nMenampilkan jumlah baris dan kolom dengan df.shape:")
print(df.shape)


Menampilkan jumlah baris dan kolom dengan df.shape:
(1000, 17)


In [32]:
print("\nMenampilkan tipe data setiap kolom dengan df.dtypes:")
print(df.dtypes)


Menampilkan tipe data setiap kolom dengan df.dtypes:
Invoice ID                  object
Branch                      object
City                        object
Customer type               object
Gender                      object
Product line                object
Unit price                 float64
Quantity                     int64
Tax 5%                     float64
Total                      float64
Date                        object
Time                        object
Payment                     object
cogs                       float64
gross margin percentage    float64
gross income               float64
Rating                     float64
dtype: object


In [33]:
print("\nMenampilkan statistik deskriptif dengan df.describe():")
df.describe()


Menampilkan statistik deskriptif dengan df.describe():


Unnamed: 0,Unit price,Quantity,Tax 5%,Total,cogs,gross margin percentage,gross income,Rating
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,55.67213,5.51,15.379369,322.966749,307.58738,4.761905,15.379369,6.9727
std,26.494628,2.923431,11.708825,245.885335,234.17651,6.131498e-14,11.708825,1.71858
min,10.08,1.0,0.5085,10.6785,10.17,4.761905,0.5085,4.0
25%,32.875,3.0,5.924875,124.422375,118.4975,4.761905,5.924875,5.5
50%,55.23,5.0,12.088,253.848,241.76,4.761905,12.088,7.0
75%,77.935,8.0,22.44525,471.35025,448.905,4.761905,22.44525,8.5
max,99.96,10.0,49.65,1042.65,993.0,4.761905,49.65,10.0


In [37]:
print("\nMenghitung rata-rata dengan df.mean():")
print(df.mean(numeric_only=True))

print("\nMenghitung median dengan df.median():")
print(df.median(numeric_only=True))

print("\nMenghitung standar deviasi dengan df.std():")
print(df.std(numeric_only=True))


Menghitung rata-rata dengan df.mean():
Unit price                  55.672130
Quantity                     5.510000
Tax 5%                      15.379369
Total                      322.966749
cogs                       307.587380
gross margin percentage      4.761905
gross income                15.379369
Rating                       6.972700
dtype: float64

Menghitung median dengan df.median():
Unit price                  55.230000
Quantity                     5.000000
Tax 5%                      12.088000
Total                      253.848000
cogs                       241.760000
gross margin percentage      4.761905
gross income                12.088000
Rating                       7.000000
dtype: float64

Menghitung standar deviasi dengan df.std():
Unit price                 2.649463e+01
Quantity                   2.923431e+00
Tax 5%                     1.170883e+01
Total                      2.458853e+02
cogs                       2.341765e+02
gross margin percentage    6.131498e-1

In [38]:
print("\nMenghitung korelasi antar kolom numerik dengan df.corr():")
print(df.corr(numeric_only=True))


Menghitung korelasi antar kolom numerik dengan df.corr():
                         Unit price  Quantity    Tax 5%     Total      cogs  \
Unit price                 1.000000  0.010778  0.633962  0.633962  0.633962   
Quantity                   0.010778  1.000000  0.705510  0.705510  0.705510   
Tax 5%                     0.633962  0.705510  1.000000  1.000000  1.000000   
Total                      0.633962  0.705510  1.000000  1.000000  1.000000   
cogs                       0.633962  0.705510  1.000000  1.000000  1.000000   
gross margin percentage         NaN       NaN       NaN       NaN       NaN   
gross income               0.633962  0.705510  1.000000  1.000000  1.000000   
Rating                    -0.008778 -0.015815 -0.036442 -0.036442 -0.036442   

                         gross margin percentage  gross income    Rating  
Unit price                                   NaN      0.633962 -0.008778  
Quantity                                     NaN      0.705510 -0.015815  
Tax 

# **DATA PREPARATION**

In [9]:
import pandas as pd

In [10]:
df = pd.read_csv("/content/supermarket_sales.csv")
df

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08,Ewallet,522.83,4.761905,26.1415,9.1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.8200,80.2200,3/8/2019,10:29,Cash,76.40,4.761905,3.8200,9.6
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,324.31,4.761905,16.2155,7.4
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.2880,489.0480,1/27/2019,20:33,Ewallet,465.76,4.761905,23.2880,8.4
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,604.17,4.761905,30.2085,5.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,233-67-5758,C,Naypyitaw,Normal,Male,Health and beauty,40.35,1,2.0175,42.3675,1/29/2019,13:46,Ewallet,40.35,4.761905,2.0175,6.2
996,303-96-2227,B,Mandalay,Normal,Female,Home and lifestyle,97.38,10,48.6900,1022.4900,3/2/2019,17:16,Ewallet,973.80,4.761905,48.6900,4.4
997,727-02-1313,A,Yangon,Member,Male,Food and beverages,31.84,1,1.5920,33.4320,2/9/2019,13:22,Cash,31.84,4.761905,1.5920,7.7
998,347-56-2442,A,Yangon,Normal,Male,Home and lifestyle,65.82,1,3.2910,69.1110,2/22/2019,15:33,Cash,65.82,4.761905,3.2910,4.1


## **Data Cleaning**

Data cleaning merupakan proses menghapus atau memodifikasi data yang tidak lengkap, duplikat, tidak akurat, dan salah format. Data-data tersebut dihapus atau dimodifikasi untuk memastikan data yang sedang diolah adalah data berkualitas agar dapat menghasilkan keputusan yang lebih akurat.

### Missing Values

Missing Values adalah sebuah kondisi dimana data hilang atau tidak terbaca.

#### Melihat Missing Values

In [113]:
print((df.isna().sum() / len(df)) * 100)

Invoice ID                 0.0
Branch                     0.0
City                       0.0
Customer type              0.0
Gender                     0.0
Product line               0.0
Unit price                 0.0
Quantity                   0.0
Tax 5%                     0.0
Total                      0.0
Date                       0.0
Time                       0.0
Payment                    0.0
cogs                       0.0
gross margin percentage    0.0
gross income               0.0
Rating                     0.0
dtype: float64


Tidak ada missing values dalam dataset ini.

### Duplicated Values

Duplicated Values adalah sebuah kondisi dimana ada data yang muncul beberapa kali dalam satu dataset.

In [115]:
df[df.duplicated()]

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating


Tidak ada duplicated values dalam dataset ini.

### Outliers

Outliers adalah nilai yang jauh berbeda dari nilai lainnya dalam dataset. Nilai Outlier bisa jauh lebih rendah atau lebih tinggi. Outlier bisa terjadi karena berbagai alasan seperti faktor kesalahan maupun kejadian lain yang tidak terduga.

Melakukan pengecekan outliers

In [118]:
results = []

cols = df.select_dtypes(include=['float64', 'int64'])

for col in cols:
  q1 = df[col].quantile(0.25)
  q3 = df[col].quantile(0.75)
  iqr = q3 - q1
  lower_bound = q1 - 1.5*iqr
  upper_bound = q3 + 1.5*iqr
  outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
  percent_outliers = (len(outliers)/len(df))*100
  results.append({'Kolom': col, 'Persentase Outliers': percent_outliers})

# Dataframe dari list hasil
results_df = pd.DataFrame(results)
results_df.set_index('Kolom', inplace=True)
results_df = results_df.rename_axis(None, axis=0).rename_axis('Kolom', axis=1)

# Tampilkan dataframe
display(results_df)

Kolom,Persentase Outliers
Unit price,0.0
Quantity,0.0
Tax 5%,0.9
Total,0.9
cogs,0.9
gross margin percentage,0.0
gross income,0.9
Rating,0.0


In [15]:
df = df.drop(['Tax 5%', 'Total', 'cogs', 'gross income'], axis=1)


Melakukan pengecekan ulang

In [121]:
results = []

cols = df.select_dtypes(include=['float64', 'int64'])

for col in cols:
  q1 = df[col].quantile(0.25)
  q3 = df[col].quantile(0.75)
  iqr = q3 - q1
  lower_bound = q1 - 1.5*iqr
  upper_bound = q3 + 1.5*iqr
  outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
  percent_outliers = (len(outliers)/len(df))*100
  results.append({'Kolom': col, 'Persentase Outliers': percent_outliers})

# Dataframe dari list hasil
results_df = pd.DataFrame(results)
results_df.set_index('Kolom', inplace=True)
results_df = results_df.rename_axis(None, axis=0).rename_axis('Kolom', axis=1)

# Tampilkan dataframe
display(results_df)

Kolom,Persentase Outliers
Unit price,0.0
Quantity,0.0
gross margin percentage,0.0
Rating,0.0


In [122]:
df

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Date,Time,Payment,gross margin percentage,Rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,1/5/2019,13:08,Ewallet,4.761905,9.1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3/8/2019,10:29,Cash,4.761905,9.6
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,3/3/2019,13:23,Credit card,4.761905,7.4
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,1/27/2019,20:33,Ewallet,4.761905,8.4
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,2/8/2019,10:37,Ewallet,4.761905,5.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,233-67-5758,C,Naypyitaw,Normal,Male,Health and beauty,40.35,1,1/29/2019,13:46,Ewallet,4.761905,6.2
996,303-96-2227,B,Mandalay,Normal,Female,Home and lifestyle,97.38,10,3/2/2019,17:16,Ewallet,4.761905,4.4
997,727-02-1313,A,Yangon,Member,Male,Food and beverages,31.84,1,2/9/2019,13:22,Cash,4.761905,7.7
998,347-56-2442,A,Yangon,Normal,Male,Home and lifestyle,65.82,1,2/22/2019,15:33,Cash,4.761905,4.1


### Inconsistent Values

Inconsistent Values adalah sebuah kondisi dimana nilai-nilai dalam suatu kolom tidak sesuai format atau memiliki tipe data yang berbeda.

In [123]:
df

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Date,Time,Payment,gross margin percentage,Rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,1/5/2019,13:08,Ewallet,4.761905,9.1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3/8/2019,10:29,Cash,4.761905,9.6
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,3/3/2019,13:23,Credit card,4.761905,7.4
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,1/27/2019,20:33,Ewallet,4.761905,8.4
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,2/8/2019,10:37,Ewallet,4.761905,5.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,233-67-5758,C,Naypyitaw,Normal,Male,Health and beauty,40.35,1,1/29/2019,13:46,Ewallet,4.761905,6.2
996,303-96-2227,B,Mandalay,Normal,Female,Home and lifestyle,97.38,10,3/2/2019,17:16,Ewallet,4.761905,4.4
997,727-02-1313,A,Yangon,Member,Male,Food and beverages,31.84,1,2/9/2019,13:22,Cash,4.761905,7.7
998,347-56-2442,A,Yangon,Normal,Male,Home and lifestyle,65.82,1,2/22/2019,15:33,Cash,4.761905,4.1


In [13]:
df["Date"] = pd.to_datetime(df["Date"], format="mixed")

In [16]:
df

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Date,Time,Payment,gross margin percentage,Rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,2019-01-05,13:08,Ewallet,4.761905,9.1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,2019-03-08,10:29,Cash,4.761905,9.6
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,2019-03-03,13:23,Credit card,4.761905,7.4
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,2019-01-27,20:33,Ewallet,4.761905,8.4
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,2019-02-08,10:37,Ewallet,4.761905,5.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,233-67-5758,C,Naypyitaw,Normal,Male,Health and beauty,40.35,1,2019-01-29,13:46,Ewallet,4.761905,6.2
996,303-96-2227,B,Mandalay,Normal,Female,Home and lifestyle,97.38,10,2019-03-02,17:16,Ewallet,4.761905,4.4
997,727-02-1313,A,Yangon,Member,Male,Food and beverages,31.84,1,2019-02-09,13:22,Cash,4.761905,7.7
998,347-56-2442,A,Yangon,Normal,Male,Home and lifestyle,65.82,1,2019-02-22,15:33,Cash,4.761905,4.1


## **Construct Data**

Construct Data merujuk pada kegiatan membangun atau menciptakan fitur (features) baru dari data yang ada atau mengubah struktur data agar sesuai dengan kebutuhan analisis atau model data mining yang akan digunakan.

In [17]:
df

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Date,Time,Payment,gross margin percentage,Rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,2019-01-05,13:08,Ewallet,4.761905,9.1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,2019-03-08,10:29,Cash,4.761905,9.6
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,2019-03-03,13:23,Credit card,4.761905,7.4
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,2019-01-27,20:33,Ewallet,4.761905,8.4
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,2019-02-08,10:37,Ewallet,4.761905,5.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,233-67-5758,C,Naypyitaw,Normal,Male,Health and beauty,40.35,1,2019-01-29,13:46,Ewallet,4.761905,6.2
996,303-96-2227,B,Mandalay,Normal,Female,Home and lifestyle,97.38,10,2019-03-02,17:16,Ewallet,4.761905,4.4
997,727-02-1313,A,Yangon,Member,Male,Food and beverages,31.84,1,2019-02-09,13:22,Cash,4.761905,7.7
998,347-56-2442,A,Yangon,Normal,Male,Home and lifestyle,65.82,1,2019-02-22,15:33,Cash,4.761905,4.1


In [19]:
df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y", errors="coerce")

df["Day"] = df["Date"].dt.day_name()

df["Hour"] = pd.to_datetime(df["Time"], format="%H:%M", errors="coerce").dt.hour

df.head()


Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Date,Time,Payment,gross margin percentage,Rating,Day,Hour
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,2019-01-05,13:08,Ewallet,4.761905,9.1,Saturday,13
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,2019-03-08,10:29,Cash,4.761905,9.6,Friday,10
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,2019-03-03,13:23,Credit card,4.761905,7.4,Sunday,13
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,2019-01-27,20:33,Ewallet,4.761905,8.4,Sunday,20
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,2019-02-08,10:37,Ewallet,4.761905,5.3,Friday,10


## **Data Reduction**

Data reduction merupakan teknik yang digunakan dalam data mining untuk mengurangi ukuran dataset tetapi tetap menjaga informasi yang penting.

In [20]:
df = df.drop(['Invoice ID', 'Branch', 'City', 'Customer type', 'Gender', 'Product line', 'Unit price', 'Quantity', 'gross margin percentage', 'Rating'], axis=1)

In [21]:
df

Unnamed: 0,Date,Time,Payment,Day,Hour
0,2019-01-05,13:08,Ewallet,Saturday,13
1,2019-03-08,10:29,Cash,Friday,10
2,2019-03-03,13:23,Credit card,Sunday,13
3,2019-01-27,20:33,Ewallet,Sunday,20
4,2019-02-08,10:37,Ewallet,Friday,10
...,...,...,...,...,...
995,2019-01-29,13:46,Ewallet,Tuesday,13
996,2019-03-02,17:16,Ewallet,Saturday,17
997,2019-02-09,13:22,Cash,Saturday,13
998,2019-02-22,15:33,Cash,Friday,15
