# 📘 Mastering Pandas: A Comprehensive Guide for Data Analysts and Data Scientists


Welcome to this comprehensive Pandas tutorial. This notebook is designed for students who have already been introduced to Python basics, data structures, functions, loops, conditional statements, and NumPy. Pandas is one of the most powerful libraries in Python for data manipulation and analysis. By the end of this notebook, you'll be proficient in using Pandas for real-world data analytics tasks.

---
**What You Will Learn:**
- Pandas Series and DataFrame
- Importing and Exporting Data
- Data Selection and Indexing
- Filtering and Conditional Logic
- Handling Missing Data
- Data Transformation
- Grouping and Aggregation
- Merging and Joining
- Time Series Analysis (Intro)
- Useful Pandas Functions
- Exercises
    

## What is Pandas?

Pandas is a powerful open-source Python library used for data manipulation, data cleaning, and data analysis.

Built on top of NumPy, it allows for fast operations on tabular data (data in tables), like spreadsheets or SQL tables.

## Importing Pandas
Once Pandas is installed, import it in your applications by adding the import keyword:

## 🧱 1. Introduction to Pandas

In [191]:

import pandas as pd
import numpy as np


Pandas provides two primary data structures:
- `Series`: One-dimensional labeled array.
- `DataFrame`: Two-dimensional labeled data structure (like a table).

## Checking Pandas Version

The version string is stored under __version__ attribute.

In [192]:
# Example

import pandas as pd

print(pd.__version__)

2.0.3


## Pandas Series

What is a Series?

A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

In [193]:
# Example

# Create a simple Pandas Series from a list:

import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)


0    1
1    7
2    2
dtype: int64


### Labels

If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

In [194]:
# Example

# Return the first value of the Series:

print(myvar[0])

1


### Create Labels

With the index argument, you can name your own labels.

In [195]:
# Example
# Create your own labels:

import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)


x    1
y    7
z    2
dtype: int64


### When you have created labels, you can access an item by referring to the label.

In [196]:
# Example

# Return the value of "y":

print(myvar["y"])

7


### Key/Value Objects as Series

You can also use a key/value object, like a dictionary, when creating a Series.

In [197]:
# Example

# Create a simple Pandas Series from a dictionary:

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

day1    420
day2    380
day3    390
dtype: int64


- Note: The keys of the dictionary become the labels.

### To select only some of the items in the dictionary, use the index argument and specify only the items you want to include in the Series.

In [198]:
# Example

# Create a Series using only data from "day1" and "day2":

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories, index = ["day1", "day2"])

print(myvar)

day1    420
day2    380
dtype: int64


# Pandas DataFrames

What is a DataFrame?

- A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

## DataFrames

Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.

In [199]:
# Example

# Create a DataFrame from two Series:

import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

myvar = pd.DataFrame(data)

print(myvar)


   calories  duration
0       420        50
1       380        40
2       390        45


In [200]:
# Create a simple Pandas DataFrame:

import pandas as pd

data = {
  "Weight": [48, 71, 55],
  "Ages": [27, 30, 37]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df) 

   Weight  Ages
0      48    27
1      71    30
2      55    37


### Locate Row

As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s)

In [201]:
#refer to the row index:
print(df.loc[0])

Weight    48
Ages      27
Name: 0, dtype: int64


- Note: This example returns a Pandas Series.

In [202]:
# Return row 0 and 1:

#use a list of indexes:
print(df.loc[[0, 1]])

   Weight  Ages
0      48    27
1      71    30


- Note: When using [], the result is a Pandas DataFrame.

## Named Indexes

With the index argument, you can name your own indexes.

In [203]:
# Example

# Add a list of names to give each row a name:

import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df)

      calories  duration
day1       420        50
day2       380        40
day3       390        45


### Locate Named Indexes

Use the named index in the loc attribute to return the specified row(s).

In [204]:
# Example

# Return "day2":

#refer to the named index:
print(df.loc["day2"])

calories    380
duration     40
Name: day2, dtype: int64


## Creating Series and DataFrame

In [205]:
### Series
s = pd.Series([10, 20, 30, 40], name="Numbers")
print(s)


### DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'Salary': [50000, 60000, 70000]}
df = pd.DataFrame(data)
print(df)

0    10
1    20
2    30
3    40
Name: Numbers, dtype: int64
      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000


## Load Files Into a DataFrame

If your data sets are stored in a file, Pandas can load them into a DataFrame.

In [206]:
# Example

# Load a comma separated file (CSV file) into a DataFrame:

import pandas as pd
df = pd.read_csv('C:/Users/DELL 5520/Desktop/My Data Analytics Training Kit/sales_data.csv')



print(df) 

    Transaction_ID        Date     Product     Category  Quantity  Unit_Price  \
0             1000  2024-02-21       Phone  Accessories         4         526   
1             1001  2024-04-02      Laptop  Electronics         1          95   
2             1002  2024-01-15  Headphones  Accessories         1         422   
3             1003  2024-03-12  Headphones  Electronics         4         567   
4             1004  2024-03-01  Headphones  Electronics         3         148   
5             1005  2024-01-21  Smartwatch  Electronics         3         941   
6             1006  2024-03-23      Laptop  Electronics         1         794   
7             1007  2024-03-27  Smartwatch  Accessories         1          86   
8             1008  2024-03-15  Smartwatch  Electronics         3         329   
9             1009  2024-03-15      Laptop  Electronics         3         398   
10            1010  2024-03-28      Laptop  Electronics         4         546   
11            1011  2024-04-

In [207]:
# Load the CSV into a DataFrame:

import pandas as pd

df = pd.read_csv('sales.csv')

print(df.to_string()) 

    sale_id  product_id  employee_id   sale_date  quantity_sold  total_price
0         1           3            2  2023-03-01              3         1993
1         2           4            2  2023-03-11              9          912
2         3           4            3  2023-03-21              3          940
3         4           1            1  2023-03-31              5          963
4         5           1            2  2023-04-10              8          490
5         6           4            3  2023-04-20              9          717
6         7           1            4  2023-04-30              7          903
7         8           3            4  2023-05-10              7         1094
8         9           1            3  2023-05-20              1         1321
9        10           3            2  2023-05-30              5         1506
10       11           4            5  2023-06-09              1          332
11       12           5            5  2023-06-19              1          765

- Tip: use to_string() to print the entire DataFrame.

### If you have a large DataFrame with many rows, Pandas will only return the first 5 rows, and the last 5 rows:

In [208]:
# Example

# Print the DataFrame without the to_string() method:

import pandas as pd

df = pd.read_csv('amazon.csv')

print(df) 

      product_id                                       product_name  \
0     B07JW9H4J1  Wayona Nylon Braided USB to Lightning Fast Cha...   
1     B098NS6PVG  Ambrane Unbreakable 60W / 3A Fast Charging 1.5...   
2     B096MSW6CT  Sounce Fast Phone Charging Cable & Data Sync U...   
3     B08HDJ86NZ  boAt Deuce USB 300 2 in 1 Type-C & Micro USB S...   
4     B08CF3B7N1  Portronics Konnect L 1.2M Fast Charging 3A 8 P...   
5     B08Y1TFSP6  pTron Solero TB301 3A Type-C Data and Fast Cha...   
6     B08WRWPM22  boAt Micro USB 55 Tangle-free, Sturdy Micro US...   
7     B08DDRGWTJ             MI Usb Type-C Cable Smartphone (Black)   
8     B008IFXQFU  TP-Link USB WiFi Adapter for PC(TL-WN725N), N1...   
9     B082LZGK39  Ambrane Unbreakable 60W / 3A Fast Charging 1.5...   
10    B08CF3D7QR  Portronics Konnect L POR-1081 Fast Charging 3A...   
11    B0789LZTCJ  boAt Rugged v3 Extra Tough Unbreakable Braided...   
12    B07KSMBL2H  AmazonBasics Flexible Premium HDMI Cable (Blac...   
13    

### max_rows

The number of rows returned is defined in Pandas option settings.

You can check your system's maximum rows with the pd.options.display.max_rows statement.

In [209]:
# Example

# Check the number of maximum returned rows:

import pandas as pd

print(pd.options.display.max_rows) 

9999


- In my system the number is 60, which means that if the DataFrame contains more than 60 rows, the print(df) statement will return only the headers and the first and last 5 rows.

You can change the maximum rows number with the same statement.

In [210]:
# Example

# Increase the maximum number of rows to display the entire DataFrame:

import pandas as pd

pd.options.display.max_rows = 9999

df = pd.read_csv('amazon.csv')

print(df) 

      product_id                                       product_name  \
0     B07JW9H4J1  Wayona Nylon Braided USB to Lightning Fast Cha...   
1     B098NS6PVG  Ambrane Unbreakable 60W / 3A Fast Charging 1.5...   
2     B096MSW6CT  Sounce Fast Phone Charging Cable & Data Sync U...   
3     B08HDJ86NZ  boAt Deuce USB 300 2 in 1 Type-C & Micro USB S...   
4     B08CF3B7N1  Portronics Konnect L 1.2M Fast Charging 3A 8 P...   
5     B08Y1TFSP6  pTron Solero TB301 3A Type-C Data and Fast Cha...   
6     B08WRWPM22  boAt Micro USB 55 Tangle-free, Sturdy Micro US...   
7     B08DDRGWTJ             MI Usb Type-C Cable Smartphone (Black)   
8     B008IFXQFU  TP-Link USB WiFi Adapter for PC(TL-WN725N), N1...   
9     B082LZGK39  Ambrane Unbreakable 60W / 3A Fast Charging 1.5...   
10    B08CF3D7QR  Portronics Konnect L POR-1081 Fast Charging 3A...   
11    B0789LZTCJ  boAt Rugged v3 Extra Tough Unbreakable Braided...   
12    B07KSMBL2H  AmazonBasics Flexible Premium HDMI Cable (Blac...   
13    

## Pandas Read JSON

### Read JSON

- Big data sets are often stored, or extracted as JSON.

JSON is plain text, but has the format of an object, and is well known in the world of programming, including Pandas.



In [211]:
# Example

# Load the JSON file into a DataFrame:

import pandas as pd

df = pd.read_json('data.json')

print(df.to_string()) 

FileNotFoundError: File data.json does not exist

- Tip: use to_string() to print the entire DataFrame.

## Dictionary as JSON

JSON = Python Dictionary

JSON objects have the same format as Python dictionaries.

If your JSON code is not in a file, but in a Python Dictionary, you can load it into a DataFrame directly:

In [None]:
# Example

# Load a Python Dictionary into a DataFrame: 

import pandas as pd

data = {
  "Duration":{
    "0":60,
    "1":60,
    "2":60,
    "3":45,
    "4":45,
    "5":60
  },
  "Pulse":{
    "0":110,
    "1":117,
    "2":103,
    "3":109,
    "4":117,
    "5":102
  },
  "Maxpulse":{
    "0":130,
    "1":145,
    "2":135,
    "3":175,
    "4":148,
    "5":127
  },
  "Calories":{
    "0":409,
    "1":479,
    "2":340,
    "3":282,
    "4":406,
    "5":300
  }
}

df = pd.DataFrame(data)

print(df)

   Duration  Pulse  Maxpulse  Calories
0        60    110       130       409
1        60    117       145       479
2        60    103       135       340
3        45    109       175       282
4        45    117       148       406
5        60    102       127       300


# Pandas - Analyzing DataFrames

## Viewing the Data

One of the most used method for getting a quick overview of the DataFrame, is the head() method.

- The head() method returns the headers and a specified number of rows, starting from the top.

In [None]:
# Example

# Get a quick overview by printing the first 10 rows of the DataFrame:

import pandas as pd

df = pd.read_csv('sales.csv')

print(df.head(10))

   sale_id  product_id  employee_id   sale_date  quantity_sold  total_price
0        1           3            2  2023-03-01              3         1993
1        2           4            2  2023-03-11              9          912
2        3           4            3  2023-03-21              3          940
3        4           1            1  2023-03-31              5          963
4        5           1            2  2023-04-10              8          490
5        6           4            3  2023-04-20              9          717
6        7           1            4  2023-04-30              7          903
7        8           3            4  2023-05-10              7         1094
8        9           1            3  2023-05-20              1         1321
9       10           3            2  2023-05-30              5         1506


* **Note: if the number of rows is not specified, the head() method will return the top 5 rows.**

In [None]:
# Example

# Print the first 5 rows of the DataFrame:

import pandas as pd

df = pd.read_csv('sales.csv')

print(df.head())

   sale_id  product_id  employee_id   sale_date  quantity_sold  total_price
0        1           3            2  2023-03-01              3         1993
1        2           4            2  2023-03-11              9          912
2        3           4            3  2023-03-21              3          940
3        4           1            1  2023-03-31              5          963
4        5           1            2  2023-04-10              8          490


## There is also a tail() method for viewing the last rows of the DataFrame.

The tail() method returns the headers and a specified number of rows, starting from the bottom.

In [None]:
# Example

# Print the last 5 rows of the DataFrame:

print(df.tail()) 

    sale_id  product_id  employee_id   sale_date  quantity_sold  total_price
15       16           2            2  2023-07-29              9          221
16       17           3            1  2023-08-08              4         1289
17       18           3            5  2023-08-18              7          644
18       19           5            5  2023-08-28              9          728
19       20           4            3  2023-09-07              6         1552


## Info About the Data

The DataFrames object has a method called info(), that gives you more information about the data set.

In [None]:
# Example

# Print information about the data:

print(df.info()) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   sale_id        20 non-null     int64 
 1   product_id     20 non-null     int64 
 2   employee_id    20 non-null     int64 
 3   sale_date      20 non-null     object
 4   quantity_sold  20 non-null     int64 
 5   total_price    20 non-null     int64 
dtypes: int64(5), object(1)
memory usage: 1.1+ KB
None


### Null Values
The info() method also tells us how many Non-Null values there are present in each column,
and in our data set there are 20 Non-Null values in all the columns.

This means that there are no missing values in the dataset being used.

Empty values, or Null values, can be bad when analyzing data, and you should consider removing rows with empty values. 

# Importing and Exporting Data


### Reading from CSV
df = pd.read_csv('data.csv')

### Exporting to CSV
df.to_csv('output.csv', index=False)

### Excel
df.to_excel('output.xlsx', index=False)

## 4. Selecting and Indexing Data

In [None]:
```python
# Selecting a column
df['Name']

# Selecting multiple columns
df[['Name', 'Salary']]

# Selecting by rows with iloc and loc
df.iloc[0]  # by index
df.loc[0]   # by label/index name
```

## 🧮 5. Filtering and Conditional Logic

In [None]:
```python
# Filter rows where age > 30
df[df['Age'] > 30]

# Combine conditions
df[(df['Age'] > 25) & (df['Salary'] > 50000)]
```

## ❌ 6. Handling Missing Data

In [None]:
```python
df = pd.DataFrame({
    'A': [1, 2, np.nan],
    'B': [5, np.nan, np.nan],
})

# Check missing
df.isnull()

# Fill missing values
df.fillna(0)

# Drop missing rows
df.dropna()
```

## 🔁 7. Data Transformation

In [None]:
```python
# Apply a function
df['Salary'] = df['Salary'].apply(lambda x: x * 1.1)

# Rename columns
df.rename(columns={'Salary': 'Updated Salary'}, inplace=True)

# Change data types
df['Age'] = df['Age'].astype('float')
```

## 🔄 8. Grouping and Aggregation

In [None]:
```python
grouped = df.groupby('Name')['Salary'].sum()
print(grouped)

# Multiple aggregations
df.groupby('Name').agg({'Age': 'mean', 'Salary': 'max'})
```

## 🔗 9. Merging, Joining and Concatenation

In [None]:
```python
df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'ID': [1, 2], 'Score': [90, 80]})

# Merge
merged = pd.merge(df1, df2, on='ID')

# Concatenate
concat = pd.concat([df1, df2], axis=1)
```

## ⏰ 10. Time Series Basics

In [None]:
```python
dates = pd.date_range('20230101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df)
```

## 🛠️ 11. Useful Pandas Functions

In [None]:
```python
df.describe()
df.info()
df.value_counts()
df.sort_values(by='Salary', ascending=False)
df.duplicated()
df.drop_duplicates()
```

## 🧪 12. Exercises

In [None]:
1. Create a DataFrame of your own with 3 columns and 5 rows.
2. Load a CSV file and perform the following:
   - Show the first 5 rows
   - Count nulls
   - Show summary statistics
   - Filter all rows where a numeric column > 50
3. Group the data and show the average of one column.
4. Merge two datasets on a common key.
5. Create a new column based on conditional logic.
6. Drop all rows with missing data.
7. Save your final DataFrame to a new Excel file.