# **The Analysis of Sales Dataset**

## **About Data**

Title       : Sales Dataset

Dataset     : [link](https://www.kaggle.com/datasets/sahilislam007/sales-dataset)

## **Import Libraries**

In [2]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## **Data Exploration**

### **Download and Load CSV**

In [3]:
# Download csv

import kagglehub

path = kagglehub.dataset_download("sahilislam007/sales-dataset")

In [4]:
# Load csv
df = pd.read_csv(path + "/Sales Dataset.csv")

### **Sneak Peak Data**

In [5]:
# See the top 5 of the data
df.head()

Unnamed: 0.1,Unnamed: 0,Date,Gender,Age,Product Category,Quantity,Price per Unit,Total Amount
0,0,2023-11-24,Male,34,Beauty,3,50,150
1,1,2023-02-27,Female,26,Clothing,2,500,1000
2,2,2023-01-13,Male,50,Electronics,1,30,30
3,3,2023-05-21,Male,37,Clothing,1,500,500
4,4,2023-05-06,Male,30,Beauty,2,50,100


In [6]:
# See the columns name
df.columns

Index(['Unnamed: 0', 'Date', 'Gender', 'Age', 'Product Category', 'Quantity',
       'Price per Unit', 'Total Amount'],
      dtype='object')

In [7]:
# See the data's shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns") 

There are 1000 rows and 8 columns


In [8]:
# See the columns details
df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Unnamed: 0        1000 non-null   int64 
 1   Date              1000 non-null   object
 2   Gender            1000 non-null   object
 3   Age               1000 non-null   int64 
 4   Product Category  1000 non-null   object
 5   Quantity          1000 non-null   int64 
 6   Price per Unit    1000 non-null   int64 
 7   Total Amount      1000 non-null   int64 
dtypes: int64(5), object(3)
memory usage: 62.6+ KB


In [9]:
# Count null values
df.isna().sum()

Unnamed: 0          0
Date                0
Gender              0
Age                 0
Product Category    0
Quantity            0
Price per Unit      0
Total Amount        0
dtype: int64

### **Findings**
1. There are 1000 rows and 8 columns
2. The columns of the dataset are: 
      
      (['Unnamed: 0', 'Date', 'Gender', 'Age', 'Product Category', 'Quantity',
       'Price per Unit', 'Total Amount'])
3. There are some columns that have wrong datatype
4. There is no missing or null value 
5. There is a unknown column's name

### **Change Column Name**

In [10]:
# Changing unknown column's name
df.rename(columns={'Unnamed: 0' : 'Row Number'}, inplace=True)

In [11]:
# Check changing
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Row Number        1000 non-null   int64 
 1   Date              1000 non-null   object
 2   Gender            1000 non-null   object
 3   Age               1000 non-null   int64 
 4   Product Category  1000 non-null   object
 5   Quantity          1000 non-null   int64 
 6   Price per Unit    1000 non-null   int64 
 7   Total Amount      1000 non-null   int64 
dtypes: int64(5), object(3)
memory usage: 62.6+ KB


### **Change Columns Datatype**

In [12]:
# Change Date column datatype from object to datetime
df['Date'] = pd.to_datetime(df['Date'])

In [13]:
# Check changing
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Row Number        1000 non-null   int64         
 1   Date              1000 non-null   datetime64[ns]
 2   Gender            1000 non-null   object        
 3   Age               1000 non-null   int64         
 4   Product Category  1000 non-null   object        
 5   Quantity          1000 non-null   int64         
 6   Price per Unit    1000 non-null   int64         
 7   Total Amount      1000 non-null   int64         
dtypes: datetime64[ns](1), int64(5), object(2)
memory usage: 62.6+ KB


### **Change Column Values**

The row number start with 0, so I want to change the number with +1 for all the row number

In [14]:
# Change row number values
df['Row Number'] = df['Row Number'] + 1

In [None]:
# Check row number changing
df.head()

Unnamed: 0,Row Number,Date,Gender,Age,Product Category,Quantity,Price per Unit,Total Amount
0,1,2023-11-24,Male,34,Beauty,3,50,150
1,2,2023-02-27,Female,26,Clothing,2,500,1000
2,3,2023-01-13,Male,50,Electronics,1,30,30
3,4,2023-05-21,Male,37,Clothing,1,500,500
4,5,2023-05-06,Male,30,Beauty,2,50,100
