# **Pandas Basics**

### **Install pandas package**

In [77]:
%pip install pandas





[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### **Import pandas**

In [78]:
import pandas as pd

## **DataFrames**
A DataFrame is a two-dimensional labeled data structure with columns of potentially 
different data types, similar to a spreadsheet or SQL table. 
It provides a powerful and flexible way to manipulate and analyze structured data in Python, 
offering functionalities for data analysis.

In [79]:
# Empty DataFrame
df = pd.DataFrame()
df

In [80]:
# Create a DataFrame from a list of lists
row_data = [["Anam", 18], ["Jericho", 18], ["Justin", 18]]
columns = ["Name", "Age"]
df = pd.DataFrame(row_data, columns=columns)
df

Unnamed: 0,Name,Age
0,Anam,18
1,Jericho,18
2,Justin,18


In [81]:
# Create a dataframe using list of dictionaries
data = [{"Name": "Anam", "Age": 18}, 
 {"Name": "Jericho", "Age": 18}, 
 {"Name": "Justin", "Age": 18}]
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Anam,18
1,Jericho,18
2,Justin,18


In [82]:
# Create a dictionary of lists
data = {
   "Name": ["Anam", "Jericho", "Justin"], 
   "Age" : [18, 18, 18]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Anam,18
1,Jericho,18
2,Justin,18


## **Series**

A pandas Series is a one-dimensional labeled array capable of 
holding data of any type (integer, string, float, etc.). 
It's similar to a one-column table or an array with associated labels, 
providing powerful indexing and manipulation capabilities in Python.

In [83]:
series = pd.Series([1, 2, 3, 4, 5])
series

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [84]:
print(type(series))

<class 'pandas.core.series.Series'>


### **Pandas Data Types**

Numeric:
- Integer (int64): Represents whole numbers (e.g., 10, -5). 
    This is the default integer type in pandas. (64 bit integer)
- Float (float64): Represents numbers with decimals (e.g., 3.14, -12.5).
- Boolean (bool): Represents logical True or False values.
- Object: This is a versatile but less efficient type that can store various data types 
like strings, lists, or custom objects. 
    Pandas uses this type when it cannot infer a more specific data type.

In [85]:
# Integer (int64)
int_series = pd.Series([1, 2, 3, 4, 5])
int_series

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [86]:
float_series = pd.Series([4.245, 76.563])
float_series

0     4.245
1    76.563
dtype: float64

In [87]:
# Boolean (bool): True = 1, False = 0
boolean_series = pd.Series([True, False, True, False])
boolean_series

0     True
1    False
2     True
3    False
dtype: bool

In [88]:
# Object (Mixed Data Types)
object_series = pd.Series([30, 3.14, True, False, "Anam"])
object_series

0       30
1     3.14
2     True
3    False
4     Anam
dtype: object

Specialized Data Types:
- Datetime (datetime64[ns]): Represents dates and times with nanosecond precision. 
    Useful for time-series data analysis.
- Timedelta (timedelta64[ns]): Represents durations between timestamps.
- Categorical: Represents categorical data with predefined categories. 
    Efficient for storing limited sets of categories.
- Sparse: Represents sparse data with many missing values. 
    Stores data efficiently by only keeping non-zero values.

In [89]:
# Datetime (datetime64[ns])
datetime_series = pd.Series([
        pd.to_datetime("2024-11-05 02:00:00"),
        pd.to_datetime("2024-11-05 05:00:00"),
        pd.to_datetime("2024-11-05 05:30:00")
    ])
datetime_series

0   2024-11-05 02:00:00
1   2024-11-05 05:00:00
2   2024-11-05 05:30:00
dtype: datetime64[ns]

In [90]:
# Datetime (datetime64[ns])
datetime_series = pd.Series([
        pd.to_datetime("2024-11-05 02:00:00"),
        pd.to_datetime("2024-11-05 05:00:00"),
        pd.to_datetime("2024-11-05 05:30:00"),
        pd.to_datetime("05-11-2024 05:30:00")
    ])
datetime_series

0   2024-11-05 02:00:00
1   2024-11-05 05:00:00
2   2024-11-05 05:30:00
3   2024-05-11 05:30:00
dtype: datetime64[ns]

In [91]:
# Datetime (datetime64[ns])
datetime_series = pd.Series([
        pd.to_datetime("2024-11-05 02:00:00"),
        pd.to_datetime("2024-11-05 05:00:00"),
        pd.to_datetime("2024-11-05 05:30:00"),
        pd.to_datetime("05-11-2024 05:30:00", format="%d-%m-%Y %H:%M:%S"),
        pd.to_datetime("05-2024-11 05:30:00", format="%d-%Y-%m %H:%M:%S"),
    ])
datetime_series

0   2024-11-05 02:00:00
1   2024-11-05 05:00:00
2   2024-11-05 05:30:00
3   2024-11-05 05:30:00
4   2024-11-05 05:30:00
dtype: datetime64[ns]

In [92]:
# Timedelta - duration between timestamps
timedelta_series = pd.Series([
    pd.Timedelta(days=8, hours=3, minutes=15),
    pd.Timedelta(days=4, hours=3, minutes=15),
    pd.Timedelta(days=1, hours=3, minutes=15)])
timedelta_series

0   8 days 03:15:00
1   4 days 03:15:00
2   1 days 03:15:00
dtype: timedelta64[ns]

In [93]:
# Categorical
categorial_series = pd.Series(
    pd.Categorical(["Marketing", "Sales", "Operations", "IT", "Finance", "HR"])
    )
categorial_series

0     Marketing
1         Sales
2    Operations
3            IT
4       Finance
5            HR
dtype: category
Categories (6, object): ['Finance', 'HR', 'IT', 'Marketing', 'Operations', 'Sales']

In [94]:
# Sparse - missing/null values inside of a series
sparse_series = pd.Series(
    pd.arrays.SparseArray([30, 31, 32, pd.NA, 29, 42, pd.NA])
)
sparse_series
# NaN - Not a Number

0     30
1     31
2     32
3    NaN
4     29
5     42
6    NaN
dtype: Sparse[object, nan]

### **Changing Data Types**

In [95]:
# Step 1: Check the datatype
int_series.dtype

dtype('int64')

In [96]:
# Step 2: Change the DataType
# It will apply the new data type on a shallow copy of the variable.
# It doesn't affect the original variable.
float_series = int_series.astype('float64')
float_series.dtype

dtype('float64')

In [97]:
string_series = float_series.astype('string')
string_series

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: string

In [98]:
# This will remove decimal points
float_series = pd.Series([3.14, -3.14])
integer_series = float_series.astype('int64')
integer_series

0    3
1   -3
dtype: int64

In [99]:
another_float_series = string_series.astype('float64')
another_float_series

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

In [100]:
another_integer_series = another_float_series.astype('int64')
another_integer_series

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [101]:
# This will remove decimal points
float_series = pd.Series([3.14, -3.14])
integer_series = float_series.astype('int64')
integer_series

0    3
1   -3
dtype: int64

**Example: Sales Data Analysis**

You have a dataset of sales transactions that includes the product name, quantity sold, and sale price. 
You want to analyze the data to find the total revenue per product.

In [102]:
# DataFrame using Dictionary of List
data = {
    'Product Name':['Iced Tea', 'Hot Chocolate', 'Lemonade', 'Coffee', 'Milkshake', 'Tea', 'Smoothie', 'Soda', 'Protein Shake', 'Matcha Latte'],
    'Type': ['Cold', 'Hot', 'Cold', 'Hot', 'Cold', 'Hot', 'Cold', 'Hot', 'Cold', 'Hot'],
    'Stock': [15, 15, 15, 15, 15, 15, 15, 15, 15, 15],
    'Quantity Sold':[6, 9, 13, 11, 8, 6, 14, 10, 8, 10],
    'Cost of Goods Sold':[7, 10, 6, 8, 9, 7, 10, 11, 8, 9],
    'Sale Price':[13, 20, 11, 15, 19, 14, 17, 18, 20, 12],
    'Rating': [1, 3, 5, 4, 3, 2, 5, 3, 3, 3]
}

sales_df = pd.DataFrame(data)
sales_df


Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating
0,Iced Tea,Cold,15,6,7,13,1
1,Hot Chocolate,Hot,15,9,10,20,3
2,Lemonade,Cold,15,13,6,11,5
3,Coffee,Hot,15,11,8,15,4
4,Milkshake,Cold,15,8,9,19,3
5,Tea,Hot,15,6,7,14,2
6,Smoothie,Cold,15,14,10,17,5
7,Soda,Hot,15,10,11,18,3
8,Protein Shake,Cold,15,8,8,20,3
9,Matcha Latte,Hot,15,10,9,12,3


In [103]:
# Step 3: Calculate the Total Revenue
sales_df["Total Revenue"] = sales_df["Quantity Sold"] * sales_df["Sale Price"]
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue
0,Iced Tea,Cold,15,6,7,13,1,78
1,Hot Chocolate,Hot,15,9,10,20,3,180
2,Lemonade,Cold,15,13,6,11,5,143
3,Coffee,Hot,15,11,8,15,4,165
4,Milkshake,Cold,15,8,9,19,3,152
5,Tea,Hot,15,6,7,14,2,84
6,Smoothie,Cold,15,14,10,17,5,238
7,Soda,Hot,15,10,11,18,3,180
8,Protein Shake,Cold,15,8,8,20,3,160
9,Matcha Latte,Hot,15,10,9,12,3,120


In [104]:
# Gross Profit: difference between the total revenue and cost of goods sold.
sales_df["Gross Profit"] = sales_df["Total Revenue"] - sales_df["Cost of Goods Sold"]
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue,Gross Profit
0,Iced Tea,Cold,15,6,7,13,1,78,71
1,Hot Chocolate,Hot,15,9,10,20,3,180,170
2,Lemonade,Cold,15,13,6,11,5,143,137
3,Coffee,Hot,15,11,8,15,4,165,157
4,Milkshake,Cold,15,8,9,19,3,152,143
5,Tea,Hot,15,6,7,14,2,84,77
6,Smoothie,Cold,15,14,10,17,5,238,228
7,Soda,Hot,15,10,11,18,3,180,169
8,Protein Shake,Cold,15,8,8,20,3,160,152
9,Matcha Latte,Hot,15,10,9,12,3,120,111


In [105]:
sales_df["Gross Profit V2"] = sales_df["Total Revenue"] - (sales_df["Quantity Sold"] * sales_df["Cost of Goods Sold"])
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue,Gross Profit,Gross Profit V2
0,Iced Tea,Cold,15,6,7,13,1,78,71,36
1,Hot Chocolate,Hot,15,9,10,20,3,180,170,90
2,Lemonade,Cold,15,13,6,11,5,143,137,65
3,Coffee,Hot,15,11,8,15,4,165,157,77
4,Milkshake,Cold,15,8,9,19,3,152,143,80
5,Tea,Hot,15,6,7,14,2,84,77,42
6,Smoothie,Cold,15,14,10,17,5,238,228,98
7,Soda,Hot,15,10,11,18,3,180,169,70
8,Protein Shake,Cold,15,8,8,20,3,160,152,96
9,Matcha Latte,Hot,15,10,9,12,3,120,111,30


### **Data Selection**

Pandas provides numerous methods for selecting and indexing data in Series and DataFrames, 
including label-based indexing with .loc, integer-position based indexing with .iloc, and conditional selection.

In [106]:
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue,Gross Profit,Gross Profit V2
0,Iced Tea,Cold,15,6,7,13,1,78,71,36
1,Hot Chocolate,Hot,15,9,10,20,3,180,170,90
2,Lemonade,Cold,15,13,6,11,5,143,137,65
3,Coffee,Hot,15,11,8,15,4,165,157,77
4,Milkshake,Cold,15,8,9,19,3,152,143,80
5,Tea,Hot,15,6,7,14,2,84,77,42
6,Smoothie,Cold,15,14,10,17,5,238,228,98
7,Soda,Hot,15,10,11,18,3,180,169,70
8,Protein Shake,Cold,15,8,8,20,3,160,152,96
9,Matcha Latte,Hot,15,10,9,12,3,120,111,30


### **Data Selection in Series**

In [107]:
# [start:end(excluded):step]
sales_df["Product Name"][0:2]

0         Iced Tea
1    Hot Chocolate
Name: Product Name, dtype: object

In [108]:
# [start:end(excluded):step]
sales_df["Product Name"][::2]

0         Iced Tea
2         Lemonade
4        Milkshake
6         Smoothie
8    Protein Shake
Name: Product Name, dtype: object

### **Data Selection in DataFrame**

#### **Index Location (.iloc)**
- Will get rows based on a number/index.
- Will output into a DataFrame instead of a Series.
> Syntax: [starting_index:ending_index(excluded):step]

In [109]:
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue,Gross Profit,Gross Profit V2
0,Iced Tea,Cold,15,6,7,13,1,78,71,36
1,Hot Chocolate,Hot,15,9,10,20,3,180,170,90
2,Lemonade,Cold,15,13,6,11,5,143,137,65
3,Coffee,Hot,15,11,8,15,4,165,157,77
4,Milkshake,Cold,15,8,9,19,3,152,143,80
5,Tea,Hot,15,6,7,14,2,84,77,42
6,Smoothie,Cold,15,14,10,17,5,238,228,98
7,Soda,Hot,15,10,11,18,3,180,169,70
8,Protein Shake,Cold,15,8,8,20,3,160,152,96
9,Matcha Latte,Hot,15,10,9,12,3,120,111,30


In [110]:
# [start:end(excluded):step]
# N: Get the first 3 rows/records inside the DataFrame
sales_df.iloc[0:3]

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue,Gross Profit,Gross Profit V2
0,Iced Tea,Cold,15,6,7,13,1,78,71,36
1,Hot Chocolate,Hot,15,9,10,20,3,180,170,90
2,Lemonade,Cold,15,13,6,11,5,143,137,65


In [111]:
# [start:end(excluded):step]
# Get the first 5 rows/records and only the product name up until the sale price
sales_df.iloc[0:5, 0:6]

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price
0,Iced Tea,Cold,15,6,7,13
1,Hot Chocolate,Hot,15,9,10,20
2,Lemonade,Cold,15,13,6,11
3,Coffee,Hot,15,11,8,15
4,Milkshake,Cold,15,8,9,19


#### **Location (.loc)**
- Access a group of rows and columns by label(s) or a boolean array.
> Syntax: [starting_index:ending_index(included):step]

In [112]:
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue,Gross Profit,Gross Profit V2
0,Iced Tea,Cold,15,6,7,13,1,78,71,36
1,Hot Chocolate,Hot,15,9,10,20,3,180,170,90
2,Lemonade,Cold,15,13,6,11,5,143,137,65
3,Coffee,Hot,15,11,8,15,4,165,157,77
4,Milkshake,Cold,15,8,9,19,3,152,143,80
5,Tea,Hot,15,6,7,14,2,84,77,42
6,Smoothie,Cold,15,14,10,17,5,238,228,98
7,Soda,Hot,15,10,11,18,3,180,169,70
8,Protein Shake,Cold,15,8,8,20,3,160,152,96
9,Matcha Latte,Hot,15,10,9,12,3,120,111,30


In [113]:
# [start:end(included):step]
sales_df.loc[1:5, "Product Name":"Sale Price"]

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price
1,Hot Chocolate,Hot,15,9,10,20
2,Lemonade,Cold,15,13,6,11
3,Coffee,Hot,15,11,8,15
4,Milkshake,Cold,15,8,9,19
5,Tea,Hot,15,6,7,14


In [114]:
# [start:end(included):step]
sales_df.loc[1:5, ["Product Name", "Quantity Sold", "Sale Price"]]

Unnamed: 0,Product Name,Quantity Sold,Sale Price
1,Hot Chocolate,9,20
2,Lemonade,13,11
3,Coffee,11,15
4,Milkshake,8,19
5,Tea,6,14


## **Conditional Filtering** 

In [115]:
sales_df[sales_df["Total Revenue"] >= 150]


Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue,Gross Profit,Gross Profit V2
1,Hot Chocolate,Hot,15,9,10,20,3,180,170,90
3,Coffee,Hot,15,11,8,15,4,165,157,77
4,Milkshake,Cold,15,8,9,19,3,152,143,80
6,Smoothie,Cold,15,14,10,17,5,238,228,98
7,Soda,Hot,15,10,11,18,3,180,169,70
8,Protein Shake,Cold,15,8,8,20,3,160,152,96


In [116]:
# Task: Get all the cold beverages that have a total revenue greater than or equal to 150
sales_df[(sales_df["Type"] == "Cold") & (sales_df["Total Revenue"] >= 150)]
# & - Amphersand
# | - Pipe
# ! - Exclamation Mark

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue,Gross Profit,Gross Profit V2
4,Milkshake,Cold,15,8,9,19,3,152,143,80
6,Smoothie,Cold,15,14,10,17,5,238,228,98
8,Protein Shake,Cold,15,8,8,20,3,160,152,96


## **Apply**

The apply function in pandas is a powerful tool for working with DataFrames. 
It allows you to apply a custom function to each element (row or column) of the DataFrame 
and return a new DataFrame or Series based on the results.

In [117]:
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue,Gross Profit,Gross Profit V2
0,Iced Tea,Cold,15,6,7,13,1,78,71,36
1,Hot Chocolate,Hot,15,9,10,20,3,180,170,90
2,Lemonade,Cold,15,13,6,11,5,143,137,65
3,Coffee,Hot,15,11,8,15,4,165,157,77
4,Milkshake,Cold,15,8,9,19,3,152,143,80
5,Tea,Hot,15,6,7,14,2,84,77,42
6,Smoothie,Cold,15,14,10,17,5,238,228,98
7,Soda,Hot,15,10,11,18,3,180,169,70
8,Protein Shake,Cold,15,8,8,20,3,160,152,96
9,Matcha Latte,Hot,15,10,9,12,3,120,111,30


In [118]:
def discount(original_price):
    discount_rate = 0.1
    discount_amount = original_price * discount_rate
    discounted_price = original_price - discount_amount
    return discounted_price
sales_df["10% Discounted Price"]= sales_df["Sale Price"].apply(discount)
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue,Gross Profit,Gross Profit V2,10% Discounted Price
0,Iced Tea,Cold,15,6,7,13,1,78,71,36,11.7
1,Hot Chocolate,Hot,15,9,10,20,3,180,170,90,18.0
2,Lemonade,Cold,15,13,6,11,5,143,137,65,9.9
3,Coffee,Hot,15,11,8,15,4,165,157,77,13.5
4,Milkshake,Cold,15,8,9,19,3,152,143,80,17.1
5,Tea,Hot,15,6,7,14,2,84,77,42,12.6
6,Smoothie,Cold,15,14,10,17,5,238,228,98,15.3
7,Soda,Hot,15,10,11,18,3,180,169,70,16.2
8,Protein Shake,Cold,15,8,8,20,3,160,152,96,18.0
9,Matcha Latte,Hot,15,10,9,12,3,120,111,30,10.8


## Pandas Operators
Data Analysis:

- sum(): Calculates the sum of a Series or DataFrame
- mean(): Calculates the mean of a Series or DataFrame
- median(): Calculates the median of a Series or DataFrame
- std(): Calculates the standard deviation of a Series or DataFrame
- var(): Calculates the variance of a Series or DataFrame

Data Loading and Exploration:

- head(): Shows the first few rows of a DataFrame
- tail(): Shows the last few rows of a DataFrame
- info(): Displays information about the DataFrame, including data types and memory usage
- describe(): Generates summary statistics for each column (mean, standard deviation, etc.)

In [119]:
sales_df["Total Revenue"].sum()

np.int64(1500)

In [120]:
print("Sum of Total Revenue:", sales_df["Total Revenue"].sum())

Sum of Total Revenue: 1500


In [121]:
print("Average of Total Revenue:", sales_df["Total Revenue"].mean())

Average of Total Revenue: 150.0


In [122]:
total_revenue_list = sales_df["Total Revenue"].tolist()
total_revenue_list.sort()
print(total_revenue_list)

[78, 84, 120, 143, 152, 160, 165, 180, 180, 238]


In [None]:
print("Median of Total Revenue:", sales_df["Total Revenue"].median())

Median of Total Revenue: 150.0


In [126]:
print("Median of Total Revenue:", sales_df["Rating"].mean())

Median of Total Revenue: 3.2


In [None]:
print("Standard Deviation for Rating:", sales_df["Rating"].std())

Standard Deviation for Rating: 1.2292725943057183


In [129]:
print("Variance for Rating:", sales_df["Rating"].var())

Variance for Rating: 1.5111111111111113


In [131]:
sales_df.head()

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue,Gross Profit,Gross Profit V2,10% Discounted Price
0,Iced Tea,Cold,15,6,7,13,1,78,71,36,11.7
1,Hot Chocolate,Hot,15,9,10,20,3,180,170,90,18.0
2,Lemonade,Cold,15,13,6,11,5,143,137,65,9.9
3,Coffee,Hot,15,11,8,15,4,165,157,77,13.5
4,Milkshake,Cold,15,8,9,19,3,152,143,80,17.1


In [132]:
sales_df.head(3)

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue,Gross Profit,Gross Profit V2,10% Discounted Price
0,Iced Tea,Cold,15,6,7,13,1,78,71,36,11.7
1,Hot Chocolate,Hot,15,9,10,20,3,180,170,90,18.0
2,Lemonade,Cold,15,13,6,11,5,143,137,65,9.9


In [133]:
# Last 5 rows by default.
sales_df.tail()
sales_df.tail(3)

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue,Gross Profit,Gross Profit V2,10% Discounted Price
7,Soda,Hot,15,10,11,18,3,180,169,70,16.2
8,Protein Shake,Cold,15,8,8,20,3,160,152,96,18.0
9,Matcha Latte,Hot,15,10,9,12,3,120,111,30,10.8


In [134]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Product Name          10 non-null     object 
 1   Type                  10 non-null     object 
 2   Stock                 10 non-null     int64  
 3   Quantity Sold         10 non-null     int64  
 4   Cost of Goods Sold    10 non-null     int64  
 5   Sale Price            10 non-null     int64  
 6   Rating                10 non-null     int64  
 7   Total Revenue         10 non-null     int64  
 8   Gross Profit          10 non-null     int64  
 9   Gross Profit V2       10 non-null     int64  
 10  10% Discounted Price  10 non-null     float64
dtypes: float64(1), int64(8), object(2)
memory usage: 1012.0+ bytes


In [135]:
sales_df.describe()

Unnamed: 0,Stock,Quantity Sold,Cost of Goods Sold,Sale Price,Rating,Total Revenue,Gross Profit,Gross Profit V2,10% Discounted Price
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,15.0,9.5,8.5,15.9,3.2,150.0,141.5,68.4,14.31
std,0.0,2.677063,1.581139,3.3483,1.229273,47.56516,46.528964,24.829194,3.01347
min,15.0,6.0,6.0,11.0,1.0,78.0,71.0,30.0,9.9
25%,15.0,8.0,7.25,13.25,3.0,125.75,117.5,47.75,11.925
50%,15.0,9.5,8.5,16.0,3.0,156.0,147.5,73.5,14.4
75%,15.0,10.75,9.75,18.75,3.75,176.25,166.0,87.5,16.875
max,15.0,14.0,11.0,20.0,5.0,238.0,228.0,98.0,18.0


### **Aggregating Data** (.groupby)

Aggregating data involves summarizing data points into meaningful statistics, 
such as averages, sums, or counts, which can be achieved using GroupBy operations or pivot tables. 
This helps in understanding the dataset at a higher level.

In [136]:
# Get the unique values
sales_df["Type"].unique()

array(['Cold', 'Hot'], dtype=object)

In [137]:
# Change the data type to category
sales_df["Type"] = sales_df["Type"].astype("category")
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   Product Name          10 non-null     object  
 1   Type                  10 non-null     category
 2   Stock                 10 non-null     int64   
 3   Quantity Sold         10 non-null     int64   
 4   Cost of Goods Sold    10 non-null     int64   
 5   Sale Price            10 non-null     int64   
 6   Rating                10 non-null     int64   
 7   Total Revenue         10 non-null     int64   
 8   Gross Profit          10 non-null     int64   
 9   Gross Profit V2       10 non-null     int64   
 10  10% Discounted Price  10 non-null     float64 
dtypes: category(1), float64(1), int64(8), object(1)
memory usage: 1.0+ KB


In [138]:
sales_df.groupby("Type")["Total Revenue"].sum()

  sales_df.groupby("Type")["Total Revenue"].sum()


Type
Cold    771
Hot     729
Name: Total Revenue, dtype: int64

In [140]:
# Create a new DataFrame that contains the total revenue per type of beverage
total_revenue_based_on_type_df = pd.DataFrame()
total_revenue_based_on_type_df["Total Revenue"] = sales_df.groupby("Type")["Total Revenue"].sum()
total_revenue_based_on_type_df

  total_revenue_based_on_type_df["Total Revenue"] = sales_df.groupby("Type")["Total Revenue"].sum()


Unnamed: 0_level_0,Total Revenue
Type,Unnamed: 1_level_1
Cold,771
Hot,729
