# Mastering Pandas for Data Analysis: A Comprehensive Guide

# What is Pandas?

Pandas is a powerful open-source Python library used for data manipulation, data cleaning, and data analysis.

Built on top of NumPy, it allows for fast operations on tabular data (data in tables), like spreadsheets or SQL tables.

## Pandas Core Data Structures

### 1. Series

A one-dimensional labeled array capable of holding any data type.

In [52]:
import pandas as pd

s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s)

a    10
b    20
c    30
d    40
dtype: int64


### Key Features:

- Automatically assigns labels (indices) unless specified.

- Can hold any data type (int, float, str, etc.)

### 2. DataFrame

A two-dimensional labeled data structure (like a table in Excel or a database).

In [53]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['Lagos', 'Abuja', 'Ibadan']
}
df = pd.DataFrame(data)
print(df)

      Name  Age    City
0    Alice   25   Lagos
1      Bob   30   Abuja
2  Charlie   35  Ibadan


### Key Features:

- Think of it like a dictionary of Series.

- Labeled rows (index) and columns.

## Creating DataFrames

In [54]:
# From a dictionary:

pd.DataFrame({'A': [1, 2], 'B': [3, 4]})

Unnamed: 0,A,B
0,1,3
1,2,4


In [55]:
# From a list of lists:

pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])

Unnamed: 0,A,B
0,1,2
1,3,4


In [56]:
# From a CSV file:

df = pd.read_csv('sales.csv')

## DataFrame Basic Operations

In [57]:
# 1. Viewing Data

df.head()      # first 5 rows
df.tail()      # last 5 rows
df.info()      # summary info
df.describe()  # statistical summary

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   sale_id        20 non-null     int64 
 1   product_id     20 non-null     int64 
 2   employee_id    20 non-null     int64 
 3   sale_date      20 non-null     object
 4   quantity_sold  20 non-null     int64 
 5   total_price    20 non-null     int64 
dtypes: int64(5), object(1)
memory usage: 1.1+ KB


Unnamed: 0,sale_id,product_id,employee_id,quantity_sold,total_price
count,20.0,20.0,20.0,20.0,20.0
mean,10.5,3.2,2.9,5.55,984.5
std,5.91608,1.361114,1.333772,2.723678,483.480202
min,1.0,1.0,1.0,1.0,221.0
25%,5.75,2.75,2.0,3.75,698.75
50%,10.5,3.5,2.5,5.5,926.0
75%,15.25,4.0,4.0,7.25,1297.0
max,20.0,5.0,5.0,9.0,1993.0


In [58]:
# 2. Accessing Columns and Rows

df['total_price']         # Access column
df[['product_id', 'quantity_sold']]  # Multiple columns
df.iloc[0]         # First row (by position)
df.loc[0]          # First row (by label/index)

sale_id                   1
product_id                3
employee_id               2
sale_date        2023-03-01
quantity_sold             3
total_price            1993
Name: 0, dtype: object

In [59]:
# 3. Filtering

df[df['quantity_sold'] > 5]

Unnamed: 0,sale_id,product_id,employee_id,sale_date,quantity_sold,total_price
1,2,4,2,2023-03-11,9,912
4,5,1,2,2023-04-10,8,490
5,6,4,3,2023-04-20,9,717
6,7,1,4,2023-04-30,7,903
7,8,3,4,2023-05-10,7,1094
14,15,4,2,2023-07-19,7,1074
15,16,2,2,2023-07-29,9,221
17,18,3,5,2023-08-18,7,644
18,19,5,5,2023-08-28,9,728
19,20,4,3,2023-09-07,6,1552


## Data Cleaning with Pandas

In [60]:
# 1. Handling Missing Data

df.isnull().sum()              # Count missing
df.fillna(0)                   # Fill with 0
df.dropna()                    # Drop missing rows

Unnamed: 0,sale_id,product_id,employee_id,sale_date,quantity_sold,total_price
0,1,3,2,2023-03-01,3,1993
1,2,4,2,2023-03-11,9,912
2,3,4,3,2023-03-21,3,940
3,4,1,1,2023-03-31,5,963
4,5,1,2,2023-04-10,8,490
5,6,4,3,2023-04-20,9,717
6,7,1,4,2023-04-30,7,903
7,8,3,4,2023-05-10,7,1094
8,9,1,3,2023-05-20,1,1321
9,10,3,2,2023-05-30,5,1506


In [61]:
# 2. Renaming Columns

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['Lagos', 'Abuja', 'Ibadan']
}
df = pd.DataFrame(data)

# renaming the Name column
df.rename(columns={'Name': 'FullName'}, inplace=True)

# Printing the updated DataFrame
print(df)


  FullName  Age    City
0    Alice   25   Lagos
1      Bob   30   Abuja
2  Charlie   35  Ibadan


In [62]:
# 3. Changing Data Types

df['Age'] = df['Age'].astype(float)

## DataFrame Operations

In [63]:
# 1. Adding New Columns

df['Salary'] = [5000, 6000, 7000]

In [64]:
# 2. Applying Functions

df['AgePlusTen'] = df['Age'].apply(lambda x: x + 10)

In [65]:
# 3. Sorting

df.sort_values('Age', ascending=False)

Unnamed: 0,FullName,Age,City,Salary,AgePlusTen
2,Charlie,35.0,Ibadan,7000,45.0
1,Bob,30.0,Abuja,6000,40.0
0,Alice,25.0,Lagos,5000,35.0


## Combining DataFrames

In [None]:
# 1. Concatenation

pd.concat([df1, df2])

In [None]:
# 2. Merging (like SQL join)

pd.merge(df1, df2, on='ID', how='inner')

In [None]:
# 3. Joining

df1.join(df2, how='outer')

## GroupBy and Aggregations

In [None]:
df.groupby('City')['Age'].mean()

City
Abuja     30.0
Ibadan    35.0
Lagos     25.0
Name: Age, dtype: float64

Common aggregation functions:

- sum()

- mean()

- count()

- min()

- max()

## Pivot Tables

In [None]:
df.pivot_table(index='City', columns='Gender', values='Salary', aggfunc='mean')

## Reshaping and Melting

In [None]:
# Reshape:

df.pivot(index='Name', columns='City', values='Salary')

In [None]:
# Melt:

pd.melt(df, id_vars=['Name'], value_vars=['Salary', 'Age'])

## Read:

**pd.read_csv()**

**pd.read_excel()**

**pd.read_json()**

## Write:

**df.to_csv('filename.csv')**

**df.to_excel('filename.xlsx')**