# **Introduction to Pandas**

## __Objective:__

Understand Pandas and its core data structures: Series and DataFrame. Learn how to create, access, and manipulate these structures.


## __Agenda__
- Fundamentals of Pandas
  * Purpose of Pandas
  * Features of Pandas
- Data Structures
- Introduction to Series
- Introduction to Pandas DataFrame
  * Creating a DataFrame Using Different Methods
  * Accessing the DataFrame
  * Understanding DataFrame Basics

**1. What is Pandas? Why is it Useful?**

**Overview:**

Pandas is a Python library built on top of NumPy and is used for data manipulation and analysis. It provides powerful tools to work with structured data efficiently. Businesses, researchers, and data professionals use Pandas to clean, transform, and analyze datasets.

- It introduces data structures like DataFrame and Series that make working with structured data more efficient.

**Key Benefits:**

* Easy data handling with DataFrames and Series

* Fast and efficient data manipulation

* Supports data import/export from multiple file formats

* Integrated with NumPy and other data science libraries


**2. Installing Pandas**

Ensure you have Pandas installed before proceeding.

In [None]:
pip install pandas

In [1]:
import pandas as pd

**3. Understanding Pandas Data Structures**

Pandas provides two main data structures:

The two main libraries of Pandas data structure are:
![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Lesson_04_Working_with_Pandas/1_Introduction_to_Pandas/Data_Structures.png)

**A. Series (1D Data Structure)**

A Series is a one-dimensional labeled array that can hold any data type (integers, floats, strings, etc.). Think of it as a single column in a spreadsheet.

It can be created with different data inputs:
![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Updated_Images/Lesson_4/4_01/Introduction_to_Series.png)

`Creating a series from a list`

In [5]:
data = [10, 20, 30, 40, 50]
data

[10, 20, 30, 40, 50]

In [6]:
import numpy as np

In [7]:
array = np.array(data)
array

array([10, 20, 30, 40, 50])

In [9]:
series = pd.Series(data)
print(series)

0    10
1    20
2    30
3    40
4    50
dtype: int64


Creating a Series with Custom Indexes:

In [10]:
stock_prices = pd.Series([150, 155, 160, 162, 158], 
                         index=["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"])
print(stock_prices)

Monday       150
Tuesday      155
Wednesday    160
Thursday     162
Friday       158
dtype: int64


Creating a Series from a Dictionary:

In [11]:
sales = {"Jan": 2000, "Feb": 3000, "Mar": 2500}
sales

{'Jan': 2000, 'Feb': 3000, 'Mar': 2500}

In [12]:
sales_series = pd.Series(sales)
print(sales_series)

Jan    2000
Feb    3000
Mar    2500
dtype: int64


Creating a Series from a NumPy Array:

In [13]:
import numpy as np
arr = np.array([5, 10, 15, 20])
arr

array([ 5, 10, 15, 20])

In [14]:
num_series = pd.Series(arr)
print(num_series)

0     5
1    10
2    15
3    20
dtype: int32


#### __B. DataFrame (2D Data Structure)__

A DataFrame is a two-dimensional table similar to an Excel spreadsheet or SQL table. It consists of rows and columns.

It is a primary data structure in the Pandas library, providing a versatile and efficient way to handle and manipulate data in Python.

![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Lesson_04_Working_with_Pandas/2_Introduction_to_DataFrame/Introduction_to_Pandas_DataFrame.png)

### __Key Features:__
- __Tabular structure:__ The DataFrame is organized as a table with rows and columns, similar to a spreadsheet or SQL table.

- __Labeled axes:__ Both rows and columns are labeled, allowing for easy indexing and referencing of data.

- __Heterogeneous data types:__ Each column in a DataFrame can contain different types of data, such as integers, floats, strings, or even complex objects.

- __Versatility:__ DataFrames can store and handle a wide range of data formats, including CSV, Excel, SQL databases, and more.

- __Data alignment:__ Operations on DataFrames are designed to handle missing values gracefully, aligning data based on labels.

### __1.1 Creating a DataFrame Using Different Methods__
Creating a Pandas DataFrame is a fundamental step in data analysis and manipulation.
- Diverse methods are available within Pandas to generate a DataFrame, addressing various data sources and structures.
- Data, whether in Python dictionaries, lists, NumPy arrays, or external files such as CSV and Excel, can be seamlessly transformed into a structured tabular format by Pandas.

Creating a DataFrame from a Dictionary:

In [16]:
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["New York", "Los Angeles", "Chicago"]
}
data

{'Name': ['Alice', 'Bob', 'Charlie'],
 'Age': [25, 30, 35],
 'City': ['New York', 'Los Angeles', 'Chicago']}

In [18]:
df = pd.DataFrame(data)
print(df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


In [19]:
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


In [21]:
data = [["Alice", 25, "New York"], 
        ["Bob", 30, "Los Angeles"], 
        ["Charlie", 35, "Chicago"]]

data

[['Alice', 25, 'New York'],
 ['Bob', 30, 'Los Angeles'],
 ['Charlie', 35, 'Chicago']]

In [30]:
df = pd.DataFrame(data, columns=["Name", "Age", "City"])
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


`Zip`

In [53]:
names = ["Alice", "Bob", "Charlie"]
ages = [25, 30, 35]

list(zip(names,ages))

[('Alice', 25), ('Bob', 30), ('Charlie', 35)]

In [34]:
df = pd.DataFrame(list(zip(names, ages)), columns=["Name", "Age"])
print(df)

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


In [70]:
employee_data = {
    "Employee ID": [101, 102, 103, 104, 105,106,107,108,109,110],
    "Name": ["John Doe", "Jane Smith", "Emily Davis", "Michael Brown", "Mary Hey",
             "John Mark", "Joy Williams", "Esther Solomon", "Mattew Philips", "Solomon Promise" ],
    "Department": ["IT", "HR", "Finance", "Marketing", "IT", "HR", "Finance", "Marketing"
                  "IT", "HR", 'IT'],
    "Salary": [75000, 80000, 65000, 72000, 75000, 80000, 65000, 72000,75000, 80000]
}


employee_data

{'Employee ID': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
 'Name': ['John Doe',
  'Jane Smith',
  'Emily Davis',
  'Michael Brown',
  'Mary Hey',
  'John Mark',
  'Joy Williams',
  'Esther Solomon',
  'Mattew Philips',
  'Solomon Promise'],
 'Department': ['IT',
  'HR',
  'Finance',
  'Marketing',
  'IT',
  'HR',
  'Finance',
  'MarketingIT',
  'HR',
  'IT'],
 'Salary': [75000,
  80000,
  65000,
  72000,
  75000,
  80000,
  65000,
  72000,
  75000,
  80000]}

In [71]:
employees_df = pd.DataFrame(employee_data)
print(employees_df)

   Employee ID             Name   Department  Salary
0          101         John Doe           IT   75000
1          102       Jane Smith           HR   80000
2          103      Emily Davis      Finance   65000
3          104    Michael Brown    Marketing   72000
4          105         Mary Hey           IT   75000
5          106        John Mark           HR   80000
6          107     Joy Williams      Finance   65000
7          108   Esther Solomon  MarketingIT   72000
8          109   Mattew Philips           HR   75000
9          110  Solomon Promise           IT   80000


In [72]:
employees_df

Unnamed: 0,Employee ID,Name,Department,Salary
0,101,John Doe,IT,75000
1,102,Jane Smith,HR,80000
2,103,Emily Davis,Finance,65000
3,104,Michael Brown,Marketing,72000
4,105,Mary Hey,IT,75000
5,106,John Mark,HR,80000
6,107,Joy Williams,Finance,65000
7,108,Esther Solomon,MarketingIT,72000
8,109,Mattew Philips,HR,75000
9,110,Solomon Promise,IT,80000


In [46]:
# Creating a DataFrame from a NumPy array
import numpy as np
data_array = np.array([['Alice', 25, 50000],
                       ['Bob', 30, 60000],
                       ['Charlie', 22, 45000]])

data_array

array([['Alice', '25', '50000'],
       ['Bob', '30', '60000'],
       ['Charlie', '22', '45000']], dtype='<U11')

In [47]:
pd.DataFrame(data_array)

Unnamed: 0,0,1,2
0,Alice,25,50000
1,Bob,30,60000
2,Charlie,22,45000


In [48]:
# Defining column names
columns = ['Name', 'Age', 'Salary']
index = ['Mon', 'Tue', 'Wed']

df_array = pd.DataFrame(data_array, columns=columns, index = index)
df_array

Unnamed: 0,Name,Age,Salary
Mon,Alice,25,50000
Tue,Bob,30,60000
Wed,Charlie,22,45000


4. Basic DataFrame Operations

A. Viewing Data

In [54]:
employees_df

Unnamed: 0,Employee ID,Name,Department,Salary
0,101,John Doe,IT,75000
1,102,Jane Smith,HR,80000
2,103,Emily Davis,Finance,65000
3,104,Michael Brown,Marketing,72000


In [63]:
employees_df.size

16

In [61]:
stock_prices.shape

(5,)

In [65]:
stock_prices.size

5

In [80]:
employees_df.head()

Unnamed: 0,Employee ID,Name,Department,Salary
0,101,John Doe,IT,75000
1,102,Jane Smith,HR,80000
2,103,Emily Davis,Finance,65000
3,104,Michael Brown,Marketing,72000
4,105,Mary Hey,IT,75000


In [82]:
employees_df.tail()

Unnamed: 0,Employee ID,Name,Department,Salary
5,106,John Mark,HR,80000
6,107,Joy Williams,Finance,65000
7,108,Esther Solomon,MarketingIT,72000
8,109,Mattew Philips,HR,75000
9,110,Solomon Promise,IT,80000


In [84]:
employees_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Employee ID  10 non-null     int64 
 1   Name         10 non-null     object
 2   Department   10 non-null     object
 3   Salary       10 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 452.0+ bytes


In [87]:
employees_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Employee ID,10.0,105.5,3.02765,101.0,103.25,105.5,107.75,110.0
Salary,10.0,73900.0,5586.690533,65000.0,72000.0,75000.0,78750.0,80000.0


In [64]:
print(df.head())  # First 5 rows
print("-----------------------------------------")
print(df.tail())  # Last 5 rows
print("-----------------------------------------")
print(df.info())  # Summary of the DataFrame
print("-----------------------------------------")
print(df.describe())  # Statistical summary

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
-----------------------------------------
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
-----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 180.0+ bytes
None
-----------------------------------------
        Age
count   3.0
mean   30.0
std     5.0
min    25.0
25%    27.5
50%    30.0
75%    32.5
max    35.0


B. Selecting Columns and Rows

In [99]:
employees_df

Unnamed: 0,Employee ID,Name,Department,Salary
0,101,John Doe,IT,75000
1,102,Jane Smith,HR,80000
2,103,Emily Davis,Finance,65000
3,104,Michael Brown,Marketing,72000
4,105,Mary Hey,IT,75000
5,106,John Mark,HR,80000
6,107,Joy Williams,Finance,65000
7,108,Esther Solomon,MarketingIT,72000
8,109,Mattew Philips,HR,75000
9,110,Solomon Promise,IT,80000


In [100]:
employees_df['Department'].unique()

array(['IT', 'HR', 'Finance', 'Marketing', 'MarketingIT'], dtype=object)

In [102]:
employees_df['Department'].nunique()

5

In [103]:
employees_df['Department'].value_counts()

Department
IT             3
HR             3
Finance        2
Marketing      1
MarketingIT    1
Name: count, dtype: int64

In [104]:
employees_df.Name

0           John Doe
1         Jane Smith
2        Emily Davis
3      Michael Brown
4           Mary Hey
5          John Mark
6       Joy Williams
7     Esther Solomon
8     Mattew Philips
9    Solomon Promise
Name: Name, dtype: object

In [113]:
employees_df.head()

Unnamed: 0,Employee ID,Name,Department,Salary
0,101,John Doe,IT,75000
1,102,Jane Smith,HR,80000
2,103,Emily Davis,Finance,65000
3,104,Michael Brown,Marketing,72000
4,105,Mary Hey,IT,75000


In [111]:
employees_df.loc[1:3]

Unnamed: 0,Employee ID,Name,Department,Salary
1,102,Jane Smith,HR,80000
2,103,Emily Davis,Finance,65000
3,104,Michael Brown,Marketing,72000


In [117]:
employees_df.iloc[1:3]

Unnamed: 0,Employee ID,Name,Department,Salary,Bonus
1,102,Jane Smith,HR,80000,16000.0
2,103,Emily Davis,Finance,65000,13000.0


In [127]:
var = 100

In [128]:
var = 10000

In [129]:
var

10000

In [119]:
employees_df['Bonus'] = employees_df['Salary']*0.2
employees_df.head()

Unnamed: 0,Employee ID,Name,Department,Salary,Bonus
0,101,John Doe,IT,75000,15000.0
1,102,Jane Smith,HR,80000,16000.0
2,103,Emily Davis,Finance,65000,13000.0
3,104,Michael Brown,Marketing,72000,14400.0
4,105,Mary Hey,IT,75000,15000.0


In [125]:
employees_df.drop('Bonus', axis = 1, inplace = True)

In [None]:
employees_df = employees_df.drop('Bonus', axis = 1)

In [126]:
employees_df.head()

Unnamed: 0,Employee ID,Name,Department,Salary
0,101,John Doe,IT,75000
1,102,Jane Smith,HR,80000
2,103,Emily Davis,Finance,65000
3,104,Michael Brown,Marketing,72000
4,105,Mary Hey,IT,75000


In [17]:
print(df["Name"])  # Select a single column
print("-----------------------------------------")
print(df.loc[0])  # Select a row by label
print("-----------------------------------------")
print(df.iloc[1])  # Select a row by index

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object
-----------------------------------------
Name      Alice
Age          25
Salary    50000
Name: 0, dtype: object
-----------------------------------------
Name        Bob
Age          30
Salary    60000
Name: 1, dtype: object


C. Adding and Removing Columns

In [18]:
df["Salary"] = [50000, 60000, 55000]  # Add a new column
print(df.drop("Age", axis=1))  # Remove a column

      Name  Salary
0    Alice   50000
1      Bob   60000
2  Charlie   55000


## __5. Hands-on Exercises__

### __Exercise 1: Creating Series (Beginner)__

* Create a Series from a list of five numbers.

* Assign custom index labels (e.g., "A", "B", "C", "D", "E").

### __Exercise 2: Creating DataFrames (Intermediate)__

* Create a DataFrame with three columns: "Product", "Price", "Stock".

* Populate it with at least four rows of data.

### __Exercise 3: Using Different Data Sources (Advanced)__

* Create a DataFrame from a dictionary containing student names, grades, and subjects.

* Convert two lists into a DataFrame using zip().

### __Exercise 4: Working with Real-World Data (Advanced)__

* Create a DataFrame for an online store with columns: "Order ID", "Customer Name", "Total Amount", and "Status".

* Populate it with five sample orders.

* Retrieve only the orders where "Status" is "Completed".

### __Exercise 5: Modifying DataFrames (Expert)__

* Add a new column to an existing DataFrame with calculated values (e.g., Tax = Price * 0.08).

* Remove a column from the DataFrame.

* Retrieve the first two rows of the modified DataFrame.

### __Summary:__

* Pandas provides Series (1D) and DataFrame (2D) data structures.

* We can create DataFrames from dictionaries, lists, and external files.

* Basic operations include selecting, modifying, and summarizing data.

In the next module, we will explore how to import and export data using Pandas.

In [130]:
series_1 = pd.Series(["A", "B", "C", "D", "E"])
series_1

0    A
1    B
2    C
3    D
4    E
dtype: object

In [131]:
#1
import pandas as pd
l = [ x for x in range(5) ]
s = pd.Series(l, index=['A','B','C','D','D'])
s

A    0
B    1
C    2
D    3
D    4
dtype: int64

In [133]:
#3

import pandas as pd


data_dict = {
    'Student': ['Alice', 'Bob', 'Charlie', 'David'],
    'Grade': ['A', 'B', 'A', 'C'],
    'Subject': ['Math', 'Science', 'History', 'English']
}

df_dict = pd.DataFrame(data_dict)


names = ['Alice', 'Bob', 'Charlie', 'David']
scores = [90, 85, 95, 78]

df_zip = pd.DataFrame(list(zip(names, scores)), columns=['Student', 'Score'])

print(df_dict)
print("---------------------------------------------")
print(df_zip)

   Student Grade  Subject
0    Alice     A     Math
1      Bob     B  Science
2  Charlie     A  History
3    David     C  English
---------------------------------------------
   Student  Score
0    Alice     90
1      Bob     85
2  Charlie     95
3    David     78


In [134]:
import pandas as pd

# Define the data
data = {
    "Product": ["Laptop", "Smartphone", "Headphones", "Tablet"],
    "Price": [1200, 800, 150, 500],
    "Stock": [10, 25, 50, 15]
}

# Create the DataFrame
df = pd.DataFrame(data)

# Print the DataFrame
print(df)

      Product  Price  Stock
0      Laptop   1200     10
1  Smartphone    800     25
2  Headphones    150     50
3      Tablet    500     15


In [137]:
#3 part 2
import pandas as pd

names = ['Alice', 'Bob', 'Charlie', 'David']
scores = [85, 90, 95, 88]

# Zip the two lists and create a DataFrame
zip_data = list(zip(names, scores))

df_zip = pd.DataFrame(zip_data, columns=['Name', 'Score'])

# Display the DataFrame
print(df_zip)

      Name  Score
0    Alice     85
1      Bob     90
2  Charlie     95
3    David     88


In [140]:
 arr=[[100, 'Siby', 900.1, 'Done'], [200, 'Mathew', 890.9, 'Done'], [210, 'Siby', 40.1, 'Done'], [300, 'Math', 0.1, 'Rejected'], [400, 'Tari', 9.1, 'InProgress']]

In [141]:
df = pd.DataFrame(arr, columns=["Order ID", "Customer Name", "Total Amount","Status"])

In [142]:
df

Unnamed: 0,Order ID,Customer Name,Total Amount,Status
0,100,Siby,900.1,Done
1,200,Mathew,890.9,Done
2,210,Siby,40.1,Done
3,300,Math,0.1,Rejected
4,400,Tari,9.1,InProgress


In [143]:
df[df['Status'] == "Done"]

Unnamed: 0,Order ID,Customer Name,Total Amount,Status
0,100,Siby,900.1,Done
1,200,Mathew,890.9,Done
2,210,Siby,40.1,Done


In [144]:
data = {
    "Order ID": [1, 2, 3, 4, 5],
    "Customer Name": ["Virat Kohli", "Rohit Sharma", "MS Dhoni", "Hardik Pandya", "KL Rahul"],
    "Total Amount": [250.50, 120.75, 310.00, 99.99, 499.00],
    "Status": ["Completed", "Pending", "Completed", "Completed", "Pending"]
}

orders_df = pd.DataFrame(data)


In [146]:
completed_orders = orders_df[orders_df["Status"] == "Completed"]

completed_orders

Unnamed: 0,Order ID,Customer Name,Total Amount,Status
0,1,Virat Kohli,250.5,Completed
2,3,MS Dhoni,310.0,Completed
3,4,Hardik Pandya,99.99,Completed


In [147]:
# Define sample data
data = {
    "Order ID": [101, 102, 103, 104, 105],
    "Customer Name": ["Name1", "Name2", "Name3", "Name4", "Name5"],
    "Total Amount": [250.50, 120.00, 75.99, 300.00, 150.75],
    "Status": ["Completed", "Pending", "Completed", "Cancelled", "Completed"]
}

# Create DataFrame
df = pd.DataFrame(data)

# Retrieve only orders where "Status" is "Completed"
completed_orders = df[df["Status"] == "Completed"]

# Print the filtered DataFrame
print(completed_orders)

   Order ID Customer Name  Total Amount     Status
0       101         Name1        250.50  Completed
2       103         Name3         75.99  Completed
4       105         Name5        150.75  Completed
