`<center><h1>Introduction to Pandas: A Powerful Data Manipulation and Analysis Library</h1></center>

## Introduction
Welcome to today's lecture on **"Pandas of Python."** In this session, we will dive into the world of Pandas, a versatile library that is widely used for data manipulation and analysis in Python. Pandas provides powerful tools for handling structured data, making it an essential tool for data scientists and analysts. By the end of this lecture, you will have a solid foundation in Pandas and be equipped with the necessary skills to work with data effectively.

## Agenda
1. What is Pandas?
2. Key Data Structures in Pandas
   - Series
   - DataFrame
3. Loading and Saving Data
4. Data Manipulation with Pandas
   - Selection and Filtering
   - Data Transformation
   - Handling Missing Data
   - Merging and Joining Data
5. Exploratory Data Analysis with Pandas
   - Descriptive Statistics
   - Grouping and Aggregation
   - Sorting and Ranking
6. Advanced Functionalities
   - Time Series Analysis
   - Categorical Data Handling
   - Applying Mathematical Functions
7. Data Visualization with Pandas and Matplotlib
8. Best Practices for Working with Pandas
9. Conclusion and Next Steps

## 1. What is Pandas?
- Pandas is an open-source library for data manipulation and analysis in Python.
- It provides high-performance, easy-to-use data structures and data analysis tools.
- Pandas is built on top of NumPy, another popular library for numerical computing in Python.
- It is widely used in data science, machine learning, and data analysis projects.

## 2. Key Data Structures in Pandas
### - Series
- A one-dimensional labeled array capable of holding any data type.
- Series can be created from a list, array, or dictionary.
- It provides labeled indexing, which makes data manipulation and analysis more convenient.

### - DataFrame
- A two-dimensional labeled data structure, similar to a table in a relational database.
- DataFrame consists of rows and columns, where each column can have a different data type.
- It is widely used for data cleaning, exploration, and analysis tasks.
- DataFrames can be created from various data sources like CSV files, Excel sheets, SQL databases, and more.

## 3. Loading and Saving Data
- Pandas supports reading data from and writing data to various file formats.
- Common file formats include CSV, Excel, JSON, SQL databases, and more.
- We will explore how to load data into Pandas DataFrames and save DataFrame contents to files.

## 4. Data Manipulation with Pandas
### - Selection and Filtering
- Pandas provides powerful mechanisms to select and filter data based on conditions.
- We will learn how to extract specific rows or columns from a DataFrame.

### - Data Transformation
- Pandas allows us to transform data by applying functions, mathematical operations, or custom operations on columns or rows.
- We will explore how to perform these transformations efficiently.

### - Handling Missing Data
- Missing data is a common issue in real-world datasets.
- Pandas provides functionalities to handle missing data, including filling missing values and removing rows or columns with missing data.

### - Merging and Joining Data
- Pandas allows us to combine multiple DataFrames based on common columns or indices.
- We will learn how to merge and join DataFrames to create a unified dataset.

## 5. Exploratory Data Analysis with Pandas
### - Descriptive Statistics
- Pandas provides easy-to-use methods for calculating descriptive statistics of numeric data.
- We will explore functions like mean, median, standard deviation, and more.

### - Grouping and Aggregation
- Pandas supports grouping data based on one or more columns, enabling aggregation operations.
- We will learn how to group data and perform aggregation functions like sum, count, and mean.

### - Sorting and Ranking
- Sorting data is crucial for understanding patterns and trends.
- Pandas allows us to sort data based on one or more columns and rank the data accordingly.

## 6. Advanced Functionalities
### - Time Series Analysis
- Pandas provides specialized tools for handling time series data.
- We will explore how to work with dates, perform time-based indexing, and calculate rolling statistics.

### - Categorical Data Handling
- Categorical data requires special treatment during data analysis.
- Pandas provides categorical data types and functions for handling categorical data effectively.

### - Applying Mathematical Functions
- Pandas allows us to apply mathematical functions or custom functions to data efficiently.
- We will learn how to apply functions to columns or rows to derive meaningful insights.

## 7. Data Visualization with Pandas and Matplotlib
- While Pandas itself doesn't have built-in visualization capabilities, it integrates well with Matplotlib and other libraries.
- We will explore how to create visualizations directly from Pandas DataFrames using Matplotlib.

## 8. Best Practices for Working with Pandas
- We will discuss some best practices to follow while working with Pandas to ensure efficient and effective data analysis.
- These practices include optimizing memory usage, using vectorized operations, and writing efficient code.

## 9. Conclusion and Next Steps
- By the end of this lecture, you will have a solid understanding of Pandas and its capabilities for data manipulation and analysis.
- You will be equipped with the necessary skills to explore and analyze datasets efficiently.
- Take this knowledge and continue practicing and exploring more advanced topics to further enhance your data analysis skills.

I hope this lecture on Pandas provides you with valuable insights and equips you with the necessary skills to work with data effectively. Best of luck with your future data-related endeavors!


In [2]:
import pandas as pd
import numpy as np

# Step 1: Create an array
data = np.array([10, 20, 30, 40, 50])

# Step 2: Convert the array to a Series
series = pd.Series(data)
series

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [3]:

# Step 3: Save the Series as a CSV file
series.to_csv('data.csv', index=False)

# Step 4: Save the Series as an Excel file
series.to_excel('data.xlsx', index=False)


In [5]:
data=pd.read_csv("data.csv")
data

Unnamed: 0,0
0,10
1,20
2,30
3,40
4,50


In [7]:
data1=pd.read_excel("data.xlsx")
data1

Unnamed: 0,0
0,10
1,20
2,30
3,40
4,50


In [21]:
import pandas as pd
import numpy as np

# Step 1: Create an array of numbers, names, and roll numbers
numbers = np.array([10, 20, 30, 40, 50])
names = np.array(['John', 'Emma', 'Michael', 'Sophia', 'William'])
roll_numbers = np.array([101, 102, 103, 104, 105])

# Step 2: Convert arrays into Series
numbers_series = pd.Series(numbers, name='Numbers')
names_series = pd.Series(names, name='Names')
roll_numbers_series = pd.Series(roll_numbers, name='Roll Numbers')

# Step 3: Merge the Series into a single DataFrame
df = pd.concat([numbers_series, names_series, roll_numbers_series], axis=1)

# Step 4: Print the merged DataFrame
print(df)
df.ndim

   Numbers    Names  Roll Numbers
0       10     John           101
1       20     Emma           102
2       30  Michael           103
3       40   Sophia           104
4       50  William           105


2

In [35]:
import pandas as pd
import numpy as np

# Step 1: Create an array of numbers, names, and roll numbers
numbers = np.array([10, 20, 30, 40, 50])
names = np.array(['John', 'Emma', 'Michael', 'Sophia', 'William'])
roll_numbers = np.array([101, 102, 103, 104, 105])

# Step 2: Convert arrays into Series
numbers_series = pd.Series(numbers, name='Numbers')
names_series = pd.Series(names, name='Names')
roll_numbers_series = pd.Series(roll_numbers, name='Roll Numbers')

# Step 3: Merge the Series into a single DataFrame
df = pd.concat([numbers_series, names_series, roll_numbers_series], axis=1)

# Step 4: Save the DataFrame in different file formats
df.to_excel('merged_data.xlsx', index=False)  # Save as XLSX
df.to_csv('merged_data.csv', index=False)  # Save as CSV
df.to_json('merged_data.json', orient='records')  # Save as JSON
# df.to_parquet('merged_data.parquet', engine='pyarrow')  # Save as Parquet
#df.to_feather('merged_data.feather')  # Save as Feather

# Optional: Save as HTML for viewing in a web browser
df.to_html('merged_data.html', index=False)

# Optional: Save as Markdown for documentation purposes
#df.to_markdown('merged_data.md', index=False)


In [23]:
!pip install pyarrow



In [36]:
import pandas as pd

# Step 1: Create a dictionary of student data
data = {
    'Roll Number': [1, 2, 3, 4, 5],
    'Student Name': ['John', 'Emma', 'Michael', 'Sophia', 'William'],
    'Maths Marks': [85, 92, 78, 90, 88],
    'Science Marks': [90, 88, 92, 85, 80],
    'English Marks': [78, 85, 88, 90, 92]
}

# Step 2: Create a DataFrame
df = pd.DataFrame(data)
df_series=pd.Series(data)
df

Unnamed: 0,Roll Number,Student Name,Maths Marks,Science Marks,English Marks
0,1,John,85,90,78
1,2,Emma,92,88,85
2,3,Michael,78,92,88
3,4,Sophia,90,85,90
4,5,William,88,80,92


In [9]:
df_series

Roll Number                             [1, 2, 3, 4, 5]
Student Name     [John, Emma, Michael, Sophia, William]
Maths Marks                        [85, 92, 78, 90, 88]
Science Marks                      [90, 88, 92, 85, 80]
English Marks                      [78, 85, 88, 90, 92]
dtype: object

"""df[['Maths Marks', 'Science Marks', 'English Marks']] selects the columns 'Maths Marks', 'Science Marks', 
and 'English Marks' from the DataFrame df. This creates a new DataFrame with only these selected columns.

.mean() calculates the mean for each column in the selected DataFrame.
By default, it calculates the mean along the column axis, 
resulting in a Series containing the mean values for each column.

The first .mean() calculates the mean of the mean values obtained in step 2. 
This will give us the average of all the mean values, representing the class average.

So, by applying the .mean().mean() operations,
we obtain the average of the means of the columns 'Maths Marks', 'Science Marks', 
and 'English Marks', which represents the class average."""

In [39]:

# Step 3: Calculate the percentage and average
df['Percentage'] = df[['Maths Marks', 'Science Marks', 'English Marks']].mean(axis=1)
class_average = df[['Maths Marks', 'Science Marks', 'English Marks']].mean().mean()
class_average
df["Percentage"]

0    84.333333
1    88.333333
2    86.000000
3    88.333333
4    86.666667
Name: Percentage, dtype: float64

In [38]:
# Step 4: Save the DataFrame as a CSV file
df.to_csv('student_data.csv', index=False)

# Step 5: Save the DataFrame as an Excel file
df.to_excel('student_data.xlsx', index=False)


# iloc & loc

In [11]:
import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['John', 'Emma', 'Michael', 'Sophia', 'William'],
    'Age': [25, 30, 28, 32, 27],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney']
}



In [41]:
df = pd.DataFrame(data)
df


Unnamed: 0,Roll Number,Student Name,Maths Marks,Science Marks,English Marks
0,1,John,85,90,78
1,2,Emma,92,88,85
2,3,Michael,78,92,88
3,4,Sophia,90,85,90
4,5,William,88,80,92


In [42]:
# Using .iloc[] for selecting rows and columns by integer positions
print("Using iloc:")
row_1 = df.loc[1]
print(row_1)


Using iloc:
Roll Number         2
Student Name     Emma
Maths Marks        92
Science Marks      88
English Marks      85
Name: 1, dtype: object


In [28]:
# Select the first row
rows_2_to_4 = df.iloc[1:4]  # Select rows from index 1 to 3 (exclusive)
print(rows_2_to_4)



      Name  Age    City
1     Emma   30  London
2  Michael   28   Paris
3   Sophia   32   Tokyo


In [29]:
col_2 = df.iloc[:, 1]  # Select the second column
print(col_2)

0    25
1    30
2    28
3    32
4    27
Name: Age, dtype: int64


In [23]:

print("\nUsing loc:")
# Using .loc[] for selecting rows and columns by labels
row_2 = df.loc[1]  # Select the row with index label 1
print(row_2)



Using loc:
Name      Emma
Age         30
City    London
Name: 1, dtype: object


In [24]:
rows_2_to_4 = df.loc[1:3]  # Select rows from index label 1 to 3 (inclusive)
print(rows_2_to_4)


      Name  Age    City
1     Emma   30  London
2  Michael   28   Paris
3   Sophia   32   Tokyo


In [26]:
col_city = df.loc[:, 'City']  # Select the 'City' column
print(col_city)



0    New York
1      London
2       Paris
3       Tokyo
4      Sydney
Name: City, dtype: object


In [37]:
import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['John', 'Emma', 'Michael', 'Sophia', 'William'],
    'Age': [25, 30, 28, 32, 27],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney']
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Emma,30,London
2,Michael,28,Paris
3,Sophia,32,Tokyo
4,William,27,Sydney


In [39]:
# Using .iloc[] for selecting specific rows and columns by integer positions
subset_1 = df.iloc[[0, 2, 4], [1, 2]]  # Select rows 0, 2, 4 and columns 1, 2

# Using .loc[] for selecting specific rows and columns by labels
subset_2 = df.loc[[1, 3], ['Name', 'City']]  # Select rows with index label 1, 3 and columns 'Name', 'City'

print("Subset using iloc:")
# print(subset_1)

print("\nSubset using loc:")
print(subset_2)


Subset using iloc:

Subset using loc:
     Name    City
1    Emma  London
3  Sophia   Tokyo


In [40]:
import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['John', 'Emma', 'Michael', 'Sophia', 'William'],
    'Age': [25, 30, 28, 32, 27],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney']
}

df = pd.DataFrame(data)

# Select rows where Age is greater than 27 using .loc[]
subset = df.loc[df['Age'] > 27]

print("Subset of rows where Age is greater than 27:")
print(subset)


Subset of rows where Age is greater than 27:
      Name  Age    City
1     Emma   30  London
2  Michael   28   Paris
3   Sophia   32   Tokyo


In [41]:
import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['John', 'Emma', 'Michael', 'Sophia', 'William'],
    'Age': [25, 30, 28, 32, 27],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney']
}

df = pd.DataFrame(data)

# Select a subset of rows and columns using slicing with .iloc[]
subset = df.iloc[1:4, 0:2]  # Select rows 1 to 3 and columns 0 to 1

print("Subset using iloc with slicing:")
print(subset)


Subset using iloc with slicing:
      Name  Age
1     Emma   30
2  Michael   28
3   Sophia   32


In [51]:
import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['John', 'Emma', 'Michael', 'Sophia', 'William', "Asif"],
    'Age': [25, 30, 28, 32, 27,39],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney',"Lahore"],
    'Salary': [50000, 60000, 55000, 65000, 52000, 30000],
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male',"Male"],
    'Rating': [4.5, 3.8, 4.2, 3.9, 4.7,3.7]
}

df = pd.DataFrame(data)
df


Unnamed: 0,Name,Age,City,Salary,Gender,Rating
0,John,25,New York,50000,Male,4.5
1,Emma,30,London,60000,Female,3.8
2,Michael,28,Paris,55000,Male,4.2
3,Sophia,32,Tokyo,65000,Female,3.9
4,William,27,Sydney,52000,Male,4.7
5,Asif,39,Lahore,30000,Male,3.7


In [54]:
# Display the first few rows using head()
print("First few rows:")
df.head(2)



First few rows:


Unnamed: 0,Name,Age,City,Salary,Gender,Rating
0,John,25,New York,50000,Male,4.5
1,Emma,30,London,60000,Female,3.8


In [55]:
# Display the last few rows using tail()
print("\nLast few rows:")
print(df.tail())




Last few rows:
      Name  Age    City  Salary  Gender  Rating
1     Emma   30  London   60000  Female     3.8
2  Michael   28   Paris   55000    Male     4.2
3   Sophia   32   Tokyo   65000  Female     3.9
4  William   27  Sydney   52000    Male     4.7
5     Asif   39  Lahore   30000    Male     3.7

Number of unique values in each column:
Name      6
Age       6
City      6
Salary    6
Gender    2
Rating    6
dtype: int64


In [56]:
# Get the number of unique values in each column using nunique()
print("\nNumber of unique values in each column:")
print(df.nunique())



Number of unique values in each column:
Name      6
Age       6
City      6
Salary    6
Gender    2
Rating    6
dtype: int64


In [57]:

# Get the value counts of a specific column using value_counts()
print("\nValue counts of 'Gender' column:")
print(df['Gender'].value_counts())







Value counts of 'Gender' column:
Male      4
Female    2
Name: Gender, dtype: int64


In [58]:

# Get the length of the DataFrame using len()
print("\nLength of the DataFrame:")
print(len(df))

#


Length of the DataFrame:
6


In [60]:
 #Apply a custom function to a column using apply()
def add_bonus(salary):
    return salary + 1000

df['Bonus'] = df['Salary'].apply(add_bonus)
print("\nDataFrame with added 'Bonus' column:")
print(df)




DataFrame with added 'Bonus' column:
      Name  Age      City  Salary  Gender  Rating  Bonus Padded_Name
0     John   25  New York   50000    Male     4.5  51000  John******
1     Emma   30    London   60000  Female     3.8  61000  Emma******
2  Michael   28     Paris   55000    Male     4.2  56000  Michael***
3   Sophia   32     Tokyo   65000  Female     3.9  66000  Sophia****
4  William   27    Sydney   52000    Male     4.7  53000  William***
5     Asif   39    Lahore   30000    Male     3.7  31000  Asif******


In [62]:
# Get column names using columns
print("\nColumn names:")
print(df.columns)



Column names:
Index(['Name', 'Age', 'City', 'Salary', 'Gender', 'Rating', 'Bonus',
       'Padded_Name'],
      dtype='object')


In [64]:

# Get index names using index
print("\nIndex names:")
print(df.index)



Index names:
RangeIndex(start=0, stop=6, step=1)


In [65]:

# Sort values based on a specific column using sort_values()
sorted_df = df.sort_values('Salary', ascending=False)
print("\nDataFrame sorted by 'Salary':")
print(sorted_df)




DataFrame sorted by 'Salary':
      Name  Age      City  Salary  Gender  Rating  Bonus Padded_Name
3   Sophia   32     Tokyo   65000  Female     3.9  66000  Sophia****
1     Emma   30    London   60000  Female     3.8  61000  Emma******
2  Michael   28     Paris   55000    Male     4.2  56000  Michael***
4  William   27    Sydney   52000    Male     4.7  53000  William***
0     John   25  New York   50000    Male     4.5  51000  John******
5     Asif   39    Lahore   30000    Male     3.7  31000  Asif******


In [66]:

# Check for null values using isnull()
print("\nNull value check:")
print(df.isnull())



Null value check:
    Name    Age   City  Salary  Gender  Rating  Bonus  Padded_Name
0  False  False  False   False   False   False  False        False
1  False  False  False   False   False   False  False        False
2  False  False  False   False   False   False  False        False
3  False  False  False   False   False   False  False        False
4  False  False  False   False   False   False  False        False
5  False  False  False   False   False   False  False        False


In [67]:

# Replace null values with zero using fillna()
df_filled = df.fillna(0)
print("\nDataFrame with null values replaced with zero:")
print(df_filled)



DataFrame with null values replaced with zero:
      Name  Age      City  Salary  Gender  Rating  Bonus Padded_Name
0     John   25  New York   50000    Male     4.5  51000  John******
1     Emma   30    London   60000  Female     3.8  61000  Emma******
2  Michael   28     Paris   55000    Male     4.2  56000  Michael***
3   Sophia   32     Tokyo   65000  Female     3.9  66000  Sophia****
4  William   27    Sydney   52000    Male     4.7  53000  William***
5     Asif   39    Lahore   30000    Male     3.7  31000  Asif******


In [69]:

# Calculate mean of a column using mean()
mean_salary = df['Salary'].mean()
print("\nMean Salary:", mean_salary)



Mean Salary: 52000.0


In [71]:

# Calculate median of a column using median()
median_salary = df['Salary'].median()
print("Median Salary:", median_salary)


Median Salary: 53500.0


In [73]:

# Calculate mode of a column using mode()
mode_gender = df['Gender'].mode()
print("Mode of 'Gender' column:")
print(mode_gender)


Mode of 'Gender' column:
0    Male
Name: Gender, dtype: object


In [75]:

# Pad values in a column using str.pad()
df['Padded_Name'] = df['Name'].str.pad(width=10, side='right', fillchar='*')
print("\nDataFrame with padded 'Name' column:")
print(df)



DataFrame with padded 'Name' column:
      Name  Age      City  Salary  Gender  Rating  Bonus Padded_Name
0     John   25  New York   50000    Male     4.5  51000  John******
1     Emma   30    London   60000  Female     3.8  61000  Emma******
2  Michael   28     Paris   55000    Male     4.2  56000  Michael***
3   Sophia   32     Tokyo   65000  Female     3.9  66000  Sophia****
4  William   27    Sydney   52000    Male     4.7  53000  William***
5     Asif   39    Lahore   30000    Male     3.7  31000  Asif******


In [77]:

# Copy rows and columns to a new DataFrame using copy()
df_copy = df.copy()
print("\nCopied DataFrame:")
print(df_copy)



Copied DataFrame:
      Name  Age      City  Salary  Gender  Rating  Bonus Padded_Name
0     John   25  New York   50000    Male     4.5  51000  John******
1     Emma   30    London   60000  Female     3.8  61000  Emma******
2  Michael   28     Paris   55000    Male     4.2  56000  Michael***
3   Sophia   32     Tokyo   65000  Female     3.9  66000  Sophia****
4  William   27    Sydney   52000    Male     4.7  53000  William***
5     Asif   39    Lahore   30000    Male     3.7  31000  Asif******


In [80]:

# Drop rows with null values using dropna()
df_dropped_rows = df.dropna(axis=0)
print("\nDataFrame with dropped rows:")
print(df_dropped_rows)





DataFrame with dropped rows:
      Name  Age      City  Salary  Gender  Rating  Bonus Padded_Name
0     John   25  New York   50000    Male     4.5  51000  John******
1     Emma   30    London   60000  Female     3.8  61000  Emma******
2  Michael   28     Paris   55000    Male     4.2  56000  Michael***
3   Sophia   32     Tokyo   65000  Female     3.9  66000  Sophia****
4  William   27    Sydney   52000    Male     4.7  53000  William***
5     Asif   39    Lahore   30000    Male     3.7  31000  Asif******


In [81]:
# Drop columns with null values using dropna()
df_dropped_columns = df.dropna(axis=1)
print("\nDataFrame with dropped columns:")
print(df_dropped_columns)


DataFrame with dropped columns:
      Name  Age      City  Salary  Gender  Rating  Bonus Padded_Name
0     John   25  New York   50000    Male     4.5  51000  John******
1     Emma   30    London   60000  Female     3.8  61000  Emma******
2  Michael   28     Paris   55000    Male     4.2  56000  Michael***
3   Sophia   32     Tokyo   65000  Female     3.9  66000  Sophia****
4  William   27    Sydney   52000    Male     4.7  53000  William***
5     Asif   39    Lahore   30000    Male     3.7  31000  Asif******
