# Introduction to Data Analysis & EDA

EDA (Exploratory Data Analysis) is the first step of any data science workflow.

Before cleaning or modeling data, we must understand:

- structure of dataset
- number of records
- column types
- missing values
- patterns in data

In this notebook we only observe â€” we do NOT modify data.

In [7]:
# Import pandas
import pandas as pd

## Loading Dataset

We start with a small sales dataset.

In [8]:
# Load CSV file
df = pd.read_csv("../data/raw/sales.csv")

df

Unnamed: 0,order_id,customer_id,product,category,price,quantity,city,date
0,1001,C101,Laptop,Electronics,55000,1,Delhi,2024-01-05
1,1002,C102,Phone,Electronics,20000,2,Mumbai,2024-01-06
2,1003,C103,Shoes,Fashion,3000,1,Pune,2024-01-07
3,1004,C101,Headphones,Electronics,2000,3,Delhi,2024-01-07
4,1005,C104,Tshirt,Fashion,800,2,Bangalore,2024-01-08
5,1006,C105,Watch,Accessories,2500,1,Chennai,2024-01-09
6,1007,C102,Laptop,Electronics,60000,1,Mumbai,2024-01-10
7,1008,C106,Backpack,Accessories,1500,2,Delhi,2024-01-10
8,1009,C103,Phone,Electronics,18000,1,Pune,2024-01-11
9,1010,C104,Shoes,Fashion,3500,1,Bangalore,2024-01-11


## Dataset Shape

Shows total rows and columns

In [9]:
# (rows, columns)
df.shape

(10, 8)

## Column Names

In [10]:
df.columns

Index(['order_id', 'customer_id', 'product', 'category', 'price', 'quantity',
       'city', 'date'],
      dtype='str')

## First & Last Records

In [11]:
df.head()

Unnamed: 0,order_id,customer_id,product,category,price,quantity,city,date
0,1001,C101,Laptop,Electronics,55000,1,Delhi,2024-01-05
1,1002,C102,Phone,Electronics,20000,2,Mumbai,2024-01-06
2,1003,C103,Shoes,Fashion,3000,1,Pune,2024-01-07
3,1004,C101,Headphones,Electronics,2000,3,Delhi,2024-01-07
4,1005,C104,Tshirt,Fashion,800,2,Bangalore,2024-01-08


In [12]:
df.tail()

Unnamed: 0,order_id,customer_id,product,category,price,quantity,city,date
5,1006,C105,Watch,Accessories,2500,1,Chennai,2024-01-09
6,1007,C102,Laptop,Electronics,60000,1,Mumbai,2024-01-10
7,1008,C106,Backpack,Accessories,1500,2,Delhi,2024-01-10
8,1009,C103,Phone,Electronics,18000,1,Pune,2024-01-11
9,1010,C104,Shoes,Fashion,3500,1,Bangalore,2024-01-11


## Basic Dataset Information

Shows:<br>
- data types
- non-null values
- memory usage

In [13]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   order_id     10 non-null     int64
 1   customer_id  10 non-null     str  
 2   product      10 non-null     str  
 3   category     10 non-null     str  
 4   price        10 non-null     int64
 5   quantity     10 non-null     int64
 6   city         10 non-null     str  
 7   date         10 non-null     str  
dtypes: int64(3), str(5)
memory usage: 772.0 bytes


## Statistical Summary

Works only for numeric columns

In [14]:
df.describe()

Unnamed: 0,order_id,price,quantity
count,10.0,10.0,10.0
mean,1005.5,16630.0,1.5
std,3.02765,22651.516996,0.707107
min,1001.0,800.0,1.0
25%,1003.25,2125.0,1.0
50%,1005.5,3250.0,1.0
75%,1007.75,19500.0,2.0
max,1010.0,60000.0,3.0


## Unique Values in Columns

In [17]:
df["city"].unique()

<StringArray>
['Delhi', 'Mumbai', 'Pune', 'Bangalore', 'Chennai']
Length: 5, dtype: str

In [16]:
df["category"].unique()

<StringArray>
['Electronics', 'Fashion', 'Accessories']
Length: 3, dtype: str

## Value Counts

Frequency of categorical data

In [18]:
df["category"].value_counts()

category
Electronics    5
Fashion        3
Accessories    2
Name: count, dtype: int64

## Conclusion

EDA helps us understand dataset structure before performing cleaning or transformation.

Next step:<br>
Working with Series & DataFrame operations.