# **Pandas Basics**

### **Install pandas package**

In [47]:
%pip install pandas

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### **Import pandas**

In [1]:
import pandas as pd

## **DataFrames**
A DataFrame is a two-dimensional labeled data structure with columns of potentially 
different data types, similar to a spreadsheet or SQL table. 
It provides a powerful and flexible way to manipulate and analyze structured data in Python, 
offering functionalities for data analysis.

In [2]:
df = pd.DataFrame()
df

In [3]:
# Create pass data inside of list
# List of lists
row_data = [["John", 30], ["Jane", 28], ["Smith", 26]]
headers = ["Name", "Age"]
df = pd.DataFrame(row_data, columns=headers)
df

Unnamed: 0,Name,Age
0,John,30
1,Jane,28
2,Smith,26


In [4]:
# Data inside of a dictionary
# Dictionary of list
# {key:value}
data = {
    "Name": ["John", "Jane", "Smith"],
    "Age": [30,28,26]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,John,30
1,Jane,28
2,Smith,26


In [5]:
type(df)

pandas.core.frame.DataFrame

## **Series**

A pandas Series is a one-dimensional labeled array capable of 
holding data of any type (integer, string, float, etc.). 
It's similar to a one-column table or an array with associated labels, 
providing powerful indexing and manipulation capabilities in Python.

In [6]:
series = pd.Series([1,2,3,4,5])
series

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [7]:
type(series)

pandas.core.series.Series

### **Pandas Data Types**

Numeric:
- Integer (int64): Represents whole numbers (e.g., 10, -5). 
    This is the default integer type in pandas. (64 bit integer)
- Float (float64): Represents numbers with decimals (e.g., 3.14, -12.5).
- Boolean (bool): Represents logical True or False values.
- Object: This is a versatile but less efficient type that can store various data types 
like strings, lists, or custom objects. 
    Pandas uses this type when it cannot infer a more specific data type.

In [8]:
# Integer (int64)
int_series =pd.Series([1,2,3,4,5])
int_series

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [9]:
# Float (float64)
float_series = pd.Series([3.14, -3.14, 0.0001, -0.0001])
float_series

0    3.1400
1   -3.1400
2    0.0001
3   -0.0001
dtype: float64

In [10]:
# Boolean (bool)
boolean_series = pd.Series([True, False, False, True])
boolean_series

0     True
1    False
2    False
3     True
dtype: bool

In [11]:
# Object (Object  Mixed Data Types)
object_series = pd.Series([30, 3.14, True, "John"])
object_series

0      30
1    3.14
2    True
3    John
dtype: object

Specialized Data Types:
- Categorical: Represents categorical data with predefined categories. 
    Efficient for storing limited sets of categories.
- Sparse: Represents sparse data with many missing values. 
    Stores data efficiently by only keeping non-zero values.

In [12]:
# [('Marketing',), ('Sales',), ('Operations',), ('IT',), ('Finance',), ('HR',)]
categorical_list = pd.Categorical(["Marketing", "Sales", "Operations", "IT", "Finance", "HR"])
categorical_series = pd.Series(categorical_list)
categorical_series

0     Marketing
1         Sales
2    Operations
3            IT
4       Finance
5            HR
dtype: category
Categories (6, object): ['Finance', 'HR', 'IT', 'Marketing', 'Operations', 'Sales']

In [13]:
sparse_series = pd.Series(pd.arrays.SparseArray([30,31,32,pd.NA, 29, 42, pd.NA]))
sparse_series

0     30
1     31
2     32
3    NaN
4     29
5     42
6    NaN
dtype: Sparse[object, nan]

### **Changing Data Types**

In [14]:
# astype()
# convert integer to float
# Step 2:
float_series = int_series.astype('float64')
float_series

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

In [15]:
string_series = float_series.astype('string')
string_series

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: string

**Example: Sales Data Analysis**

You have a dataset of sales transactions that includes the product name, quantity sold, and sale price. 
You want to analyze the data to find the total revenue per product.

In [16]:
# Create a DataFrame using Dictionary of List
data = {
    'Product Name':['Iced Tea', 'Hot Chocolate' , 'Lemonade', 'Coffee', 'Milkshake', 'Tea', 'Smoothie', 'Soda', 'Protein Shake', 'Matcha Latte'],
    'Type': ['Cold', 'Hot', 'Cold', 'Hot', 'Cold', 'Hot', 'Cold', 'Cold', 'Cold', 'Hot'],
    'Stock': [15, 15, 15, 15, 15, 15, 15, 15, 15, 15],
    'Quantity Sold':[6, 9, 13, 11, 8, 6, 14, 10, 8, 10],
    'Manufacturing Cost':[7, 10, 6, 8, 9, 7, 10, 11, 8, 9],
    'Market Price':[13, 20, 11, 15, 19, 14, 17, 18, 20, 12],
    'Rating': [1, 3, 5, 4, 3, 2, 5, 3, 2, 3]
}

# Create a dataframe
sales_df = pd.DataFrame(data)
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating
0,Iced Tea,Cold,15,6,7,13,1
1,Hot Chocolate,Hot,15,9,10,20,3
2,Lemonade,Cold,15,13,6,11,5
3,Coffee,Hot,15,11,8,15,4
4,Milkshake,Cold,15,8,9,19,3
5,Tea,Hot,15,6,7,14,2
6,Smoothie,Cold,15,14,10,17,5
7,Soda,Cold,15,10,11,18,3
8,Protein Shake,Cold,15,8,8,20,2
9,Matcha Latte,Hot,15,10,9,12,3


In [17]:
# Step 2: Get the specific column Product Name
sales_df['Product Name']

0         Iced Tea
1    Hot Chocolate
2         Lemonade
3           Coffee
4        Milkshake
5              Tea
6         Smoothie
7             Soda
8    Protein Shake
9     Matcha Latte
Name: Product Name, dtype: object

In [18]:
# Step 3: Calculate the total revenue
sales_df['Total Revenue'] = sales_df['Quantity Sold'] * sales_df['Market Price']
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue
0,Iced Tea,Cold,15,6,7,13,1,78
1,Hot Chocolate,Hot,15,9,10,20,3,180
2,Lemonade,Cold,15,13,6,11,5,143
3,Coffee,Hot,15,11,8,15,4,165
4,Milkshake,Cold,15,8,9,19,3,152
5,Tea,Hot,15,6,7,14,2,84
6,Smoothie,Cold,15,14,10,17,5,238
7,Soda,Cold,15,10,11,18,3,180
8,Protein Shake,Cold,15,8,8,20,2,160
9,Matcha Latte,Hot,15,10,9,12,3,120


### **Data Selection**

Pandas provides numerous methods for selecting and indexing data in Series and DataFrames, 
including label-based indexing with .loc, integer-position based indexing with .iloc, and conditional selection.

In [19]:
# Total Cost of Goods Sold (COGS)
# Multiply the quantity sold and manufacturing cost
sales_df['COGS'] = sales_df['Quantity Sold'] * sales_df['Manufacturing Cost']
# Gross Profit
# Total revenue minus to cost of goods sold
sales_df['Gross Profit'] = sales_df['Total Revenue'] - sales_df['COGS']
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue,COGS,Gross Profit
0,Iced Tea,Cold,15,6,7,13,1,78,42,36
1,Hot Chocolate,Hot,15,9,10,20,3,180,90,90
2,Lemonade,Cold,15,13,6,11,5,143,78,65
3,Coffee,Hot,15,11,8,15,4,165,88,77
4,Milkshake,Cold,15,8,9,19,3,152,72,80
5,Tea,Hot,15,6,7,14,2,84,42,42
6,Smoothie,Cold,15,14,10,17,5,238,140,98
7,Soda,Cold,15,10,11,18,3,180,110,70
8,Protein Shake,Cold,15,8,8,20,2,160,64,96
9,Matcha Latte,Hot,15,10,9,12,3,120,90,30


### **Data Selection in Series**

In [20]:
# [starting_index:ending_index(excluded):step]
sales_df['Product Name'][0:2]

0         Iced Tea
1    Hot Chocolate
Name: Product Name, dtype: object

In [21]:
sales_df['Product Name'][::2]

0         Iced Tea
2         Lemonade
4        Milkshake
6         Smoothie
8    Protein Shake
Name: Product Name, dtype: object

### **Data Selection in DataFrame**

#### **Index Location (.iloc)**
- Will get rows based on a number/index.
- Will output into a DataFrame instead of a Series.
> Syntax: [starting_index:ending_index(excluded):step]

In [22]:
# .iloc[starting_index:ending_index(excluded):step]
sales_df.iloc[0:3]

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue,COGS,Gross Profit
0,Iced Tea,Cold,15,6,7,13,1,78,42,36
1,Hot Chocolate,Hot,15,9,10,20,3,180,90,90
2,Lemonade,Cold,15,13,6,11,5,143,78,65


In [23]:
# Access rows and columns using slicing parameters
# .iloc[row_starting_index:row_ending_index(excluded), column_starting_index:column_ending_index(excluded)]
sales_df.iloc[0:3, 0:2]

Unnamed: 0,Product Name,Type
0,Iced Tea,Cold
1,Hot Chocolate,Hot
2,Lemonade,Cold


In [24]:
# Access rows and columns using list
sales_df.iloc[[0,1], [0,1,2]]

Unnamed: 0,Product Name,Type,Stock
0,Iced Tea,Cold,15
1,Hot Chocolate,Hot,15


#### **Location (.loc)**
- Access a group of rows and columns by label(s) or a boolean array.
> Syntax: [starting_index:ending_index(included):step]

In [25]:
sales_df.loc[0:4, ["Product Name", "Quantity Sold", "Market Price", "Total Revenue"]]

Unnamed: 0,Product Name,Quantity Sold,Market Price,Total Revenue
0,Iced Tea,6,13,78
1,Hot Chocolate,9,20,180
2,Lemonade,13,11,143
3,Coffee,11,15,165
4,Milkshake,8,19,152


In [26]:
sales_df.loc[0:5, "Product Name":"Market Price"]

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price
0,Iced Tea,Cold,15,6,7,13
1,Hot Chocolate,Hot,15,9,10,20
2,Lemonade,Cold,15,13,6,11
3,Coffee,Hot,15,11,8,15
4,Milkshake,Cold,15,8,9,19
5,Tea,Hot,15,6,7,14


## **Conditional Filtering** 

In [27]:
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue,COGS,Gross Profit
0,Iced Tea,Cold,15,6,7,13,1,78,42,36
1,Hot Chocolate,Hot,15,9,10,20,3,180,90,90
2,Lemonade,Cold,15,13,6,11,5,143,78,65
3,Coffee,Hot,15,11,8,15,4,165,88,77
4,Milkshake,Cold,15,8,9,19,3,152,72,80
5,Tea,Hot,15,6,7,14,2,84,42,42
6,Smoothie,Cold,15,14,10,17,5,238,140,98
7,Soda,Cold,15,10,11,18,3,180,110,70
8,Protein Shake,Cold,15,8,8,20,2,160,64,96
9,Matcha Latte,Hot,15,10,9,12,3,120,90,30


In [28]:
# Get all products that have total revenue greater than or equal to 150
# df[df[column_name] condition]
sales_df[sales_df["Total Revenue"] >= 150] 

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue,COGS,Gross Profit
1,Hot Chocolate,Hot,15,9,10,20,3,180,90,90
3,Coffee,Hot,15,11,8,15,4,165,88,77
4,Milkshake,Cold,15,8,9,19,3,152,72,80
6,Smoothie,Cold,15,14,10,17,5,238,140,98
7,Soda,Cold,15,10,11,18,3,180,110,70
8,Protein Shake,Cold,15,8,8,20,2,160,64,96


In [29]:
# Task: Get all the cold beverages that have a total revenue greater than or equal to 150
# AND - &, OR - |, NOT - !
sales_df[(sales_df["Type"] == "Cold") & (sales_df["Total Revenue"] >= 150)]

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue,COGS,Gross Profit
4,Milkshake,Cold,15,8,9,19,3,152,72,80
6,Smoothie,Cold,15,14,10,17,5,238,140,98
7,Soda,Cold,15,10,11,18,3,180,110,70
8,Protein Shake,Cold,15,8,8,20,2,160,64,96


## **Apply**

The apply function in pandas is a powerful tool for working with DataFrames. 
It allows you to apply a custom function to each element (row or column) of the DataFrame 
and return a new DataFrame or Series based on the results.

In [30]:
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue,COGS,Gross Profit
0,Iced Tea,Cold,15,6,7,13,1,78,42,36
1,Hot Chocolate,Hot,15,9,10,20,3,180,90,90
2,Lemonade,Cold,15,13,6,11,5,143,78,65
3,Coffee,Hot,15,11,8,15,4,165,88,77
4,Milkshake,Cold,15,8,9,19,3,152,72,80
5,Tea,Hot,15,6,7,14,2,84,42,42
6,Smoothie,Cold,15,14,10,17,5,238,140,98
7,Soda,Cold,15,10,11,18,3,180,110,70
8,Protein Shake,Cold,15,8,8,20,2,160,64,96
9,Matcha Latte,Hot,15,10,9,12,3,120,90,30


In [31]:
def discount(market_price):
    discount_rate = 0.10
    discounted_amount = market_price * discount_rate
    discounted_price = market_price - discounted_amount
    return discounted_price

sales_df["10% Discount"] = sales_df['Market Price'].apply(discount)
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue,COGS,Gross Profit,10% Discount
0,Iced Tea,Cold,15,6,7,13,1,78,42,36,11.7
1,Hot Chocolate,Hot,15,9,10,20,3,180,90,90,18.0
2,Lemonade,Cold,15,13,6,11,5,143,78,65,9.9
3,Coffee,Hot,15,11,8,15,4,165,88,77,13.5
4,Milkshake,Cold,15,8,9,19,3,152,72,80,17.1
5,Tea,Hot,15,6,7,14,2,84,42,42,12.6
6,Smoothie,Cold,15,14,10,17,5,238,140,98,15.3
7,Soda,Cold,15,10,11,18,3,180,110,70,16.2
8,Protein Shake,Cold,15,8,8,20,2,160,64,96,18.0
9,Matcha Latte,Hot,15,10,9,12,3,120,90,30,10.8


## Pandas Operators
Data Analysis:

- sum(): Calculates the sum of a Series or DataFrame
- mean(): Calculates the mean of a Series or DataFrame
- median(): Calculates the median of a Series or DataFrame
- std(): Calculates the standard deviation of a Series or DataFrame
- var(): Calculates the variance of a Series or DataFrame

Data Loading and Exploration:

- head(): Shows the first few rows of a DataFrame
- tail(): Shows the last few rows of a DataFrame
- describe(): Generates summary statistics for each column (mean, standard deviation, etc.)
- info(): Displays information about the DataFrame, including data types and memory usage

In [32]:
print("Total revenue: ", sales_df['Total Revenue'].sum())

Total revenue:  1500


In [33]:
print("Mean of Rating Column: ", sales_df['Rating'].mean())

Mean of Rating Column:  3.1


In [34]:
print("Median of Rating: ", sales_df["Rating"].median())

Median of Rating:  3.0


In [35]:
print("Standard Deviation of Rating: ", sales_df["Rating"].std().round(2))
print("Variance of Rating: ", sales_df["Rating"].var().round(2))

Standard Deviation of Rating:  1.29
Variance of Rating:  1.66


In [36]:
sales_df.head()
sales_df.head(3)

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue,COGS,Gross Profit,10% Discount
0,Iced Tea,Cold,15,6,7,13,1,78,42,36,11.7
1,Hot Chocolate,Hot,15,9,10,20,3,180,90,90,18.0
2,Lemonade,Cold,15,13,6,11,5,143,78,65,9.9


In [37]:
sales_df.tail()
sales_df.tail(3)

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue,COGS,Gross Profit,10% Discount
7,Soda,Cold,15,10,11,18,3,180,110,70,16.2
8,Protein Shake,Cold,15,8,8,20,2,160,64,96,18.0
9,Matcha Latte,Hot,15,10,9,12,3,120,90,30,10.8


In [38]:
sales_df.describe()

Unnamed: 0,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue,COGS,Gross Profit,10% Discount
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,15.0,9.5,8.5,15.9,3.1,150.0,81.6,68.4,14.31
std,0.0,2.677063,1.581139,3.3483,1.286684,47.56516,29.721672,24.829194,3.01347
min,15.0,6.0,6.0,11.0,1.0,78.0,42.0,30.0,9.9
25%,15.0,8.0,7.25,13.25,2.25,125.75,66.0,47.75,11.925
50%,15.0,9.5,8.5,16.0,3.0,156.0,83.0,73.5,14.4
75%,15.0,10.75,9.75,18.75,3.75,176.25,90.0,87.5,16.875
max,15.0,14.0,11.0,20.0,5.0,238.0,140.0,98.0,18.0


In [39]:
# Information of the table
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Product Name        10 non-null     object 
 1   Type                10 non-null     object 
 2   Stock               10 non-null     int64  
 3   Quantity Sold       10 non-null     int64  
 4   Manufacturing Cost  10 non-null     int64  
 5   Market Price        10 non-null     int64  
 6   Rating              10 non-null     int64  
 7   Total Revenue       10 non-null     int64  
 8   COGS                10 non-null     int64  
 9   Gross Profit        10 non-null     int64  
 10  10% Discount        10 non-null     float64
dtypes: float64(1), int64(8), object(2)
memory usage: 1012.0+ bytes


### **Aggregating Data** (.groupby)

Aggregating data involves summarizing data points into meaningful statistics, 
such as averages, sums, or counts, which can be achieved using GroupBy operations or pivot tables. 
This helps in understanding the dataset at a higher level.

In [40]:
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue,COGS,Gross Profit,10% Discount
0,Iced Tea,Cold,15,6,7,13,1,78,42,36,11.7
1,Hot Chocolate,Hot,15,9,10,20,3,180,90,90,18.0
2,Lemonade,Cold,15,13,6,11,5,143,78,65,9.9
3,Coffee,Hot,15,11,8,15,4,165,88,77,13.5
4,Milkshake,Cold,15,8,9,19,3,152,72,80,17.1
5,Tea,Hot,15,6,7,14,2,84,42,42,12.6
6,Smoothie,Cold,15,14,10,17,5,238,140,98,15.3
7,Soda,Cold,15,10,11,18,3,180,110,70,16.2
8,Protein Shake,Cold,15,8,8,20,2,160,64,96,18.0
9,Matcha Latte,Hot,15,10,9,12,3,120,90,30,10.8


In [41]:
# Step 1: Get the unique values
sales_df['Type'].unique()


array(['Cold', 'Hot'], dtype=object)

In [42]:
# Step 2: Change the data type to category
sales_df['Type'] = sales_df['Type'].astype('category')

In [43]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   Product Name        10 non-null     object  
 1   Type                10 non-null     category
 2   Stock               10 non-null     int64   
 3   Quantity Sold       10 non-null     int64   
 4   Manufacturing Cost  10 non-null     int64   
 5   Market Price        10 non-null     int64   
 6   Rating              10 non-null     int64   
 7   Total Revenue       10 non-null     int64   
 8   COGS                10 non-null     int64   
 9   Gross Profit        10 non-null     int64   
 10  10% Discount        10 non-null     float64 
dtypes: category(1), float64(1), int64(8), object(1)
memory usage: 1.0+ KB


In [44]:
# Task: Get the total revenue for cold and hot beverages
sales_df.groupby('Type')['Total Revenue'].sum()

  sales_df.groupby('Type')['Total Revenue'].sum()


Type
Cold    951
Hot     549
Name: Total Revenue, dtype: int64

In [45]:
total_revenue_df = pd.DataFrame()
total_revenue_df['Total Revenue'] = sales_df.groupby('Type')['Total Revenue'].sum()
total_revenue_df

  total_revenue_df['Total Revenue'] = sales_df.groupby('Type')['Total Revenue'].sum()


Unnamed: 0_level_0,Total Revenue
Type,Unnamed: 1_level_1
Cold,951
Hot,549


In [46]:
total_revenue_df['Total Quantity Sold'] = sales_df.groupby('Type')['Quantity Sold'].sum()
total_revenue_df

  total_revenue_df['Total Quantity Sold'] = sales_df.groupby('Type')['Quantity Sold'].sum()


Unnamed: 0_level_0,Total Revenue,Total Quantity Sold
Type,Unnamed: 1_level_1,Unnamed: 2_level_1
Cold,951,59
Hot,549,36
