# **Pandas Basics**

### **Install pandas package**

In [28]:
%pip install pandas

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### **Import pandas**

In [29]:
import pandas as pd

## **DataFrames**
A DataFrame is a two-dimensional labeled data structure with columns of potentially 
different data types, similar to a spreadsheet or SQL table. 
It provides a powerful and flexible way to manipulate and analyze structured data in Python, 
offering functionalities for data analysis.

In [30]:
df = pd.DataFrame()
df

In [31]:
# Create a DataFrame using list of lists
row_data = [["Eric",18],["Kim",18],["Shane",18]]
df = pd.DataFrame(row_data,columns = ["Name","Age"])
df

Unnamed: 0,Name,Age
0,Eric,18
1,Kim,18
2,Shane,18


In [32]:
# Create a DataFrame using a dictionary of list
data = {
    "Name" : ["Eric","Kim","Shane"],
    "Age" : ["18","18","18"]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Eric,18
1,Kim,18
2,Shane,18


In [33]:
data = [
    { "Name" : "Eric", "Age": 18},
    { "Name" : "Kim", "Age": 18},
    { "Name" : "Shane", "Age": 18}
]
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Eric,18
1,Kim,18
2,Shane,18


In [34]:
print("Data type of DataFrame:", type(df))

Data type of DataFrame: <class 'pandas.core.frame.DataFrame'>


## **Series**

A pandas Series is a one-dimensional labeled array capable of 
holding data of any type (integer, string, float, etc.). 
It's similar to a one-column table or an array with associated labels, 
providing powerful indexing and manipulation capabilities in Python.

In [35]:
# pd.Series() is a constructor function that is used to create one-dimensional Series objects
series = pd.Series([1, 2, 3, 4, 5])
series

0    1
1    2
2    3
3    4
4    5
dtype: int64

### **Pandas Data Types**

Numeric:
- Integer (int64): Represents whole numbers (e.g., 10, -5). 
    This is the default integer type in pandas. (64 bit integer)
- Float (float64): Represents numbers with decimals (e.g., 3.14, -12.5).
- Boolean (bool): Represents logical True or False values.
- Object: This is a versatile but less efficient type that can store various data types 
like strings, lists, or custom objects. 
    Pandas uses this type when it cannot infer a more specific data type.

In [36]:
# You can use the type() to check the data type if it is a Series or a DataFrame
print(type(series))

<class 'pandas.core.series.Series'>


In [37]:
# Integer (int64)
series = pd.Series([1, 2, 3, 4, 5])
series

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [38]:
# Float (float64)
float_series = pd.Series([3.14, -3.14, 0.0001, -0.0001])
float_series

0    3.1400
1   -3.1400
2    0.0001
3   -0.0001
dtype: float64

In [39]:
# Boolean (bool): (True = 1 or False = 0)
bool_series = pd.Series([True, False, True, False])
bool_series

0     True
1    False
2     True
3    False
dtype: bool

In [40]:
# Objects (Object/Mixed Data Types)
object_series = pd.Series([30, 3.14, True, "John"])
object_series

0      30
1    3.14
2    True
3    John
dtype: object

Specialized Data Types:
- Datetime (datetime64[ns]): Represents dates and times with nanosecond precision. 
    Useful for time-series data analysis.
- Timedelta (timedelta64[ns]): Represents durations between timestamps.
- Categorical: Represents categorical data with predefined categories. 
    Efficient for storing limited sets of categories.
- Sparse: Represents sparse data with many missing values. 
    Stores data efficiently by only keeping non-zero values.

In [41]:
# DateTime (Timestamp/ datetime64)
# to_datetime - convert string to datetime
datetime_series = pd.Series([
  pd.to_datetime("2024-08-27 07:00:00"),
  pd.to_datetime("2024-08-27 10:00:00"),
  pd.to_datetime("2024-08-27 12:00:00"),
  pd.to_datetime('now'),
  pd.to_datetime('today'),
])
datetime_series

0   2024-08-27 07:00:00.000000
1   2024-08-27 10:00:00.000000
2   2024-08-27 12:00:00.000000
3   2024-12-03 15:14:45.408469
4   2024-12-03 15:14:45.408652
dtype: datetime64[ns]

In [42]:
# Timedelta
timedelta_series = pd.Series([
  pd.Timedelta(days=8, hours=3, minutes=15, seconds=30),
  pd.Timedelta(days=4, hours=3, minutes=15),
  pd.Timedelta(days=1, hours=3, minutes=15)
])
timedelta_series

0   8 days 03:15:30
1   4 days 03:15:00
2   1 days 03:15:00
dtype: timedelta64[ns]

In [43]:
# Sparse - missing/null values inside of series
sparse_series = pd.Series(
  pd.arrays.SparseArray([30, 31, 32, pd.NA, 29, 42, pd.NA])
)
sparse_series

0     30
1     31
2     32
3    NaN
4     29
5     42
6    NaN
dtype: Sparse[object, nan]

In [44]:
# Integer (int64)
int_series = pd.Series([1, 2, 3, 4, 5])
int_series

0    1
1    2
2    3
3    4
4    5
dtype: int64

### **Changing Data Types**

In [45]:
# Step 1: Check datatype
# dtype is short for 'data type'
int_series.dtype

dtype('int64')

In [46]:
# Step 2: Change the datatype (float)

# astype() "as type": returns a new DataFrame where 
#           data is changed to the specified data type.

# Convert the integer to float
float_series = int_series.astype('float64')
float_series

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

In [47]:
# Converting a float series to string
string_series = float_series.astype('string')
string_series

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: string

In [48]:
converted_to_float = string_series.astype('float64')
converted_to_float

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

**Example: Sales Data Analysis**

You have a dataset of sales transactions that includes the product name, quantity sold, and sale price. 
You want to analyze the data to find the total revenue per product.

In [49]:
data = {
    'Product Name':['Iced Tea', 'Hot Chocolate' , 'Lemonade', 'Coffee', 'Milkshake', 'Tea', 'Smoothie', 'Soda', 'Protein Shake', 'Matcha Latte'],
    'Type': ['Cold', 'Hot', 'Cold', 'Hot', 'Cold', 'Hot', 'Cold', 'Cold', 'Cold', 'Hot'],
    'Stock': [15, 15, 15, 15, 15, 15, 15, 15, 15, 15],
    'Quantity Sold':[6, 9, 13, 11, 8, 6, 14, 10, 8, 10],
    'Manufacturing Cost':[7, 10, 6, 8, 9, 7, 10, 11, 8, 9],
    'Market Price':[13, 20, 11, 15, 19, 14, 17, 18, 20, 12],
    'Rating': [1, 3, 5, 4, 3, 2, 5, 3, 2, 3]
}

In [52]:
sales_df = pd.DataFrame(data)
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating
0,Iced Tea,Cold,15,6,7,13,1
1,Hot Chocolate,Hot,15,9,10,20,3
2,Lemonade,Cold,15,13,6,11,5
3,Coffee,Hot,15,11,8,15,4
4,Milkshake,Cold,15,8,9,19,3
5,Tea,Hot,15,6,7,14,2
6,Smoothie,Cold,15,14,10,17,5
7,Soda,Cold,15,10,11,18,3
8,Protein Shake,Cold,15,8,8,20,2
9,Matcha Latte,Hot,15,10,9,12,3


In [51]:
# Sparse - missing/null values inside of series
sparse_series = pd.Series(
  pd.arrays.SparseArray([30, 31, 32, pd.NA, 29, 42, pd.NA])
)
sparse_series

0     30
1     31
2     32
3    NaN
4     29
5     42
6    NaN
dtype: Sparse[object, nan]

In [53]:
# Step 2: Get the Product Name column
sales_df['Product Name']

0         Iced Tea
1    Hot Chocolate
2         Lemonade
3           Coffee
4        Milkshake
5              Tea
6         Smoothie
7             Soda
8    Protein Shake
9     Matcha Latte
Name: Product Name, dtype: object

In [54]:
# Step 3: Calculate the Total Revenue.
sales_df['Total Revenue'] = sales_df['Quantity Sold'] * sales_df['Market Price']
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue
0,Iced Tea,Cold,15,6,7,13,1,78
1,Hot Chocolate,Hot,15,9,10,20,3,180
2,Lemonade,Cold,15,13,6,11,5,143
3,Coffee,Hot,15,11,8,15,4,165
4,Milkshake,Cold,15,8,9,19,3,152
5,Tea,Hot,15,6,7,14,2,84
6,Smoothie,Cold,15,14,10,17,5,238
7,Soda,Cold,15,10,11,18,3,180
8,Protein Shake,Cold,15,8,8,20,2,160
9,Matcha Latte,Hot,15,10,9,12,3,120


In [55]:
# Gross Profit: difference between total revenue and the cost of goods (Cost per product multiply to the quantity sold).
sales_df["Gross Profit"] = sales_df["Total Revenue"] - (sales_df["Quantity Sold"] * sales_df["Manufacturing Cost"])
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue,Gross Profit
0,Iced Tea,Cold,15,6,7,13,1,78,36
1,Hot Chocolate,Hot,15,9,10,20,3,180,90
2,Lemonade,Cold,15,13,6,11,5,143,65
3,Coffee,Hot,15,11,8,15,4,165,77
4,Milkshake,Cold,15,8,9,19,3,152,80
5,Tea,Hot,15,6,7,14,2,84,42
6,Smoothie,Cold,15,14,10,17,5,238,98
7,Soda,Cold,15,10,11,18,3,180,70
8,Protein Shake,Cold,15,8,8,20,2,160,96
9,Matcha Latte,Hot,15,10,9,12,3,120,30


### **Data Selection**

Pandas provides numerous methods for selecting and indexing data in Series and DataFrames, 
including label-based indexing with .loc, integer-position based indexing with .iloc, and conditional selection.

In [57]:
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue,Gross Profit
0,Iced Tea,Cold,15,6,7,13,1,78,36
1,Hot Chocolate,Hot,15,9,10,20,3,180,90
2,Lemonade,Cold,15,13,6,11,5,143,65
3,Coffee,Hot,15,11,8,15,4,165,77
4,Milkshake,Cold,15,8,9,19,3,152,80
5,Tea,Hot,15,6,7,14,2,84,42
6,Smoothie,Cold,15,14,10,17,5,238,98
7,Soda,Cold,15,10,11,18,3,180,70
8,Protein Shake,Cold,15,8,8,20,2,160,96
9,Matcha Latte,Hot,15,10,9,12,3,120,30


### **Data Selection in Series**

In [58]:
# [starting_index:ending_index(excluded):step]
# Get the first two cells of Product Name column
sales_df["Product Name"][0:2]

0         Iced Tea
1    Hot Chocolate
Name: Product Name, dtype: object

In [61]:
# Check custom elements
sales_df['Product Name'][1:4]

1    Hot Chocolate
2         Lemonade
3           Coffee
Name: Product Name, dtype: object

In [62]:
# Check in twos
sales_df['Product Name'][::2]

0         Iced Tea
2         Lemonade
4        Milkshake
6         Smoothie
8    Protein Shake
Name: Product Name, dtype: object

In [63]:
# OPTIONAL
# Sum of the first two rows in Quantity Sold Column
print("Sum:", sales_df['Quantity Sold'][0:2].sum())


Sum: 15


### **Data Selection in DataFrame**

#### **Index Location (.iloc)**
- Will get rows based on a number/index.
- Will output into a DataFrame instead of a Series.
> Syntax: [starting_index:ending_index(excluded):step]

In [64]:
# Note: iloc is a property
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue,Gross Profit
0,Iced Tea,Cold,15,6,7,13,1,78,36
1,Hot Chocolate,Hot,15,9,10,20,3,180,90
2,Lemonade,Cold,15,13,6,11,5,143,65
3,Coffee,Hot,15,11,8,15,4,165,77
4,Milkshake,Cold,15,8,9,19,3,152,80
5,Tea,Hot,15,6,7,14,2,84,42
6,Smoothie,Cold,15,14,10,17,5,238,98
7,Soda,Cold,15,10,11,18,3,180,70
8,Protein Shake,Cold,15,8,8,20,2,160,96
9,Matcha Latte,Hot,15,10,9,12,3,120,30


#### **Location (.loc)**
- Access a group of rows and columns by label(s) or a boolean array.
> Syntax: [starting_index:ending_index(included):step]

In [65]:
# [starting_index:ending_index(excluded):step]
# Getting the first three rows of the DataFrame
sales_df.iloc[0:3]

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue,Gross Profit
0,Iced Tea,Cold,15,6,7,13,1,78,36
1,Hot Chocolate,Hot,15,9,10,20,3,180,90
2,Lemonade,Cold,15,13,6,11,5,143,65


In [66]:
# Without using loc, this can also work.
sales_df[0:2]


Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue,Gross Profit
0,Iced Tea,Cold,15,6,7,13,1,78,36
1,Hot Chocolate,Hot,15,9,10,20,3,180,90


## **Conditional Filtering** 

In [67]:
# Get all products that have total revenue greater than or equal 150
sales_df[sales_df["Total Revenue"] >= 150]

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue,Gross Profit
1,Hot Chocolate,Hot,15,9,10,20,3,180,90
3,Coffee,Hot,15,11,8,15,4,165,77
4,Milkshake,Cold,15,8,9,19,3,152,80
6,Smoothie,Cold,15,14,10,17,5,238,98
7,Soda,Cold,15,10,11,18,3,180,70
8,Protein Shake,Cold,15,8,8,20,2,160,96


In [68]:
# Task: Get all cold beverages that have a total revenue greater than or equal to 150
sales_df[(sales_df["Type"] == "Cold") & (sales_df["Total Revenue"] >= 150)]

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue,Gross Profit
4,Milkshake,Cold,15,8,9,19,3,152,80
6,Smoothie,Cold,15,14,10,17,5,238,98
7,Soda,Cold,15,10,11,18,3,180,70
8,Protein Shake,Cold,15,8,8,20,2,160,96


In [69]:
# Task: Get all hot beverages that have a total revenue greater than or equal to 150
sales_df[(sales_df['Type'] == 'Hot') & (sales_df['Total Revenue'] >= 150)]

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue,Gross Profit
1,Hot Chocolate,Hot,15,9,10,20,3,180,90
3,Coffee,Hot,15,11,8,15,4,165,77


## **Apply**

The apply function in pandas is a powerful tool for working with DataFrames. 
It allows you to apply a custom function to each element (row or column) of the DataFrame 
and return a new DataFrame or Series based on the results.

In [70]:
def discount(orignal_price):
  discount_rate = 0.1
  discount_amount = orignal_price * discount_rate
  discounted_price = orignal_price - discount_amount
  return discounted_price
sales_df["10% Discounted Price"] = sales_df["Market Price"].apply(discount)
sales_df

Unnamed: 0,Product Name,Type,Stock,Quantity Sold,Manufacturing Cost,Market Price,Rating,Total Revenue,Gross Profit,10% Discounted Price
0,Iced Tea,Cold,15,6,7,13,1,78,36,11.7
1,Hot Chocolate,Hot,15,9,10,20,3,180,90,18.0
2,Lemonade,Cold,15,13,6,11,5,143,65,9.9
3,Coffee,Hot,15,11,8,15,4,165,77,13.5
4,Milkshake,Cold,15,8,9,19,3,152,80,17.1
5,Tea,Hot,15,6,7,14,2,84,42,12.6
6,Smoothie,Cold,15,14,10,17,5,238,98,15.3
7,Soda,Cold,15,10,11,18,3,180,70,16.2
8,Protein Shake,Cold,15,8,8,20,2,160,96,18.0
9,Matcha Latte,Hot,15,10,9,12,3,120,30,10.8


## Pandas Operators
Data Analysis:

- sum(): Calculates the sum of a Series or DataFrame
- mean(): Calculates the mean of a Series or DataFrame
- median(): Calculates the median of a Series or DataFrame
- std(): Calculates the standard deviation of a Series or DataFrame
- var(): Calculates the variance of a Series or DataFrame

Data Loading and Exploration:

- head(): Shows the first few rows of a DataFrame
- tail(): Shows the last few rows of a DataFrame
- info(): Displays information about the DataFrame, including data types and memory usage
- describe(): Generates summary statistics for each column (mean, standard deviation, etc.)

In [71]:
print("Sum of the Total Revenue Column:", sales_df["Total Revenue"].sum())

Sum of the Total Revenue Column: 1500


In [72]:
print("Mean/Average of Total Revenue Column", sales_df["Total Revenue"].mean())

Mean/Average of Total Revenue Column 150.0


In [73]:
# 1. Sort the values
# 2. Possible Options
#   Odd - You can get the middle value.
#   Even - Get the mean of the middle two values.
print("Median of Rating Column:", sales_df["Rating"].median())

Median of Rating Column: 3.0


In [76]:
# Average of Rating
print(sales_df["Rating"].mean())

3.1


In [74]:
print("Standard Deviation for Rating Column:", sales_df["Rating"].std().round(2))

Standard Deviation for Rating Column: 1.29


### **Aggregating Data** (.groupby)

Aggregating data involves summarizing data points into meaningful statistics, 
such as averages, sums, or counts, which can be achieved using GroupBy operations or pivot tables. 
This helps in understanding the dataset at a higher level.