# 1. Introduction to Pandas

## What is Pandas?

Pandas is an open-source data manipulation and analysis library for Python. It provides high-level data structures like **Series** and **DataFrame** that make working with structured data (like tables and time series) simple and intuitive.

Pandas is especially useful for:
- Handling missing data
- Data cleaning and transformation
- Reshaping and pivoting datasets
- Time-series analysis
- Merging and joining datasets
- Input/output to CSV, Excel, SQL, etc.

## Why Pandas Was Built on Top of NumPy

While **NumPy** is excellent for numerical computations and efficiently handling arrays, it lacks the high-level functionality required for working with **tabular data** (like spreadsheets or SQL tables). 

### Key Differences:
- **Labeled Indexing**: Unlike NumPy, Pandas allows for **labeled indexing**, making it easier to access data by row and column labels rather than just numerical indices.
- **Handling Missing Data**: Pandas has built-in methods for **handling missing data** (`NaN`), which is not natively handled by NumPy.
- **Tabular Data Operations**: Pandas provides a rich set of **data manipulation tools** like grouping, filtering, merging, and reshaping, which are cumbersome to implement using NumPy alone.

**NumPy** is primarily focused on numerical computing and provides extensive functionality for:
- **Linear Algebra**: Operations like matrix multiplication, matrix inversion, and solving linear systems (`numpy.linalg`).
- **Calculus**: While NumPy does not directly provide symbolic calculus, it does offer numerical differentiation and integration through functions like `numpy.gradient()` for approximating derivatives.
- **Statistics**: NumPy provides basic statistical functions like mean, median, variance, standard deviation, etc.

**Pandas** is built on top of NumPy and primarily focuses on data manipulation and analysis rather than numerical computing. It does provide basic **statistical operations** (mean, sum, count, etc.) but relies on NumPy for more advanced mathematical operations.

In short, Pandas builds on the efficiency of NumPy while adding user-friendly, high-level functionality for manipulating structured data.




## Installing Pandas
To install Pandas, you can use pip:

```bash
pip install pandas
```

Or if you're using Anaconda:

```bash
conda install pandas
```

In [1]:
!pip install pandas

Collecting pandas
  Downloading pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2024.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.7/12.7 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading pytz-2024.2-py2.py3-none-any.whl (508 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m508.0/508.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading tzdata-2024.2-py2.py3-none-any.whl (346 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━


## Importing Pandas

Once installed, you can import the library in your Jupyter notebook or Python script using:
```python
# Importing pandas
import pandas as pd
```

The convention `pd` is widely used in the data science community for easier access to Pandas functions and methods.


# 2. Data Structures

## Series: 1D Data Structure

A Pandas **Series** is a one-dimensional labeled array that can hold any data type (integers, strings, floats, etc.). It’s similar to a column in a DataFrame.

### Creating a Series
You can create a Series by passing a list, NumPy array, or dictionary.



In [2]:

import pandas as pd

# Creating a Series from a list
s = pd.Series([1, 2, 3, 4, 5])
print(s)

0    1
1    2
2    3
3    4
4    5
dtype: int64


In [3]:
# Creating a Series with custom index
s_custom_index = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print(s_custom_index)

a    10
b    20
c    30
dtype: int64


## DataFrame: 2D Data Structure

A Pandas **DataFrame** is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). Think of it like a spreadsheet or SQL table.

### Creating a DataFrame
You can create a DataFrame from a variety of inputs, such as dictionaries, lists, or NumPy arrays.


In [40]:
# Creating a DataFrame from a dictionary
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
        'Age': [28, 24, 35, 32]}
df = pd.DataFrame(data)
print(df)
print()

# Creating a DataFrame from a list of lists
data_list = [['John', 28], ['Anna', 24], ['Peter', 35], ['Linda', 32]]
df_list = pd.DataFrame(data_list, columns=['Name', 'Age'])
print(df_list)


    Name  Age
0   John   28
1   Anna   24
2  Peter   35
3  Linda   32

    Name  Age
0   John   28
1   Anna   24
2  Peter   35
3  Linda   32


### Viewing the DataFrame
You can use various methods to view the data in a DataFrame.


In [6]:
# Viewing the first few rows
print(df.head())
print()

# Viewing the last few rows
print(df.tail())



    Name  Age
0   John   28
1   Anna   24
2  Peter   35
3  Linda   32

    Name  Age
0   John   28
1   Anna   24
2  Peter   35
3  Linda   32


# 3. Viewing and Inspecting Data

Once you have created a DataFrame, it’s essential to understand how to inspect it to get a sense of its structure, content, and data types.



## Checking the DataFrame Structure

- Use `info()` to get a summary of the DataFrame, including the number of rows, columns, non-null entries, and data types.



In [11]:

# Get a summary of the DataFrame
print(df.info())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 196.0+ bytes
None




## Descriptive Statistics

- Use `describe()` to generate summary statistics for numeric columns, such as count, mean, standard deviation, min, max, and percentiles.



In [12]:

# Get summary statistics for numeric columns
print(df.describe())



             Age
count   4.000000
mean   29.750000
std     4.787136
min    24.000000
25%    27.000000
50%    30.000000
75%    32.750000
max    35.000000



## Shape and Dimensions

- Use `shape` to get the dimensions of the DataFrame (rows, columns).
- Use `len()` to find the number of rows.

In [13]:

# Get the shape of the DataFrame
print(df.shape)

# Get the number of rows
print(len(df))


(4, 2)
4


## Data Types of Each Column

- Use `dtypes` to check the data types of all the columns in the DataFrame.


In [14]:
# Check data types of columns
print(df.dtypes)

Name    object
Age      int64
dtype: object


# 4. Selecting and Indexing Data

Selecting and indexing data from a DataFrame is a key task in data analysis. Pandas provides several ways to do this efficiently.


In [21]:

# Selecting Columns

# You can select a column from a DataFrame by passing the column name in brackets or using dot notation.

# Selecting a single column
df['Age']

# Alternatively
df.Age



0    28
1    24
2    35
3    32
Name: Age, dtype: int64

In [23]:

#Selecting Multiple Columns
#To select multiple columns, you can pass a list of column names.

# Selecting multiple columns
df[['Age', 'Name']]



Unnamed: 0,Age,Name
0,28,John
1,24,Anna
2,35,Peter
3,32,Linda


In [25]:
#Selecting Rows by Label (`loc`)
#The `loc[]` method allows you to select rows (or subsets) by label or index.

# Selecting rows by index label
df.loc[0]




Name    John
Age       28
Name: 0, dtype: object

In [26]:
# Selecting a range of rows
df.loc[0:5]

Unnamed: 0,Name,Age
0,John,28
1,Anna,24
2,Peter,35
3,Linda,32


In [27]:

# Selecting Rows by Index (`iloc`)
# The `iloc[]` method allows you to select rows by index position (integer-based).

# Selecting rows by position
df.iloc[0]


Name    John
Age       28
Name: 0, dtype: object

In [28]:

# Selecting a range of rows by position
df.iloc[0:5]


Unnamed: 0,Name,Age
0,John,28
1,Anna,24
2,Peter,35
3,Linda,32


In [29]:

# Boolean Indexing
# You can filter rows based on conditions using boolean indexing.

# Selecting rows based on a condition
df[df['Age'] > 30]



Unnamed: 0,Name,Age
2,Peter,35
3,Linda,32


In [41]:
# Setting and Resetting Index
# You can set a column as the index or reset it back to default.

# Set a column as the index
df.set_index('Age', inplace=True)
print(df)


      Name
Age       
28    John
24    Anna
35   Peter
32   Linda


In [42]:
# Reset index
df.reset_index(inplace=True)
print(df)


   Age   Name
0   28   John
1   24   Anna
2   35  Peter
3   32  Linda


# 5. Handling Missing Data

Missing data is common in real-world datasets, and Pandas provides several tools to handle it efficiently.

## Detecting Missing Data

You can use `isna()` or `isnull()` to detect missing values, which will return a DataFrame of Boolean values indicating `True` for missing data.



In [43]:
# Create a DataFrame with missing values
import pandas as pd
import numpy as np

data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
        'Age': [28, np.nan, 35, 32],
        'Score': [85, 90, np.nan, 88]}
df = pd.DataFrame(data)
print(df)

# Detect missing values
df.isna()

# Count total missing values per column
df.isna().sum()


    Name   Age  Score
0   John  28.0   85.0
1   Anna   NaN   90.0
2  Peter  35.0    NaN
3  Linda  32.0   88.0


Name     0
Age      1
Score    1
dtype: int64

## Dropping Missing Data

Use `dropna()` to remove rows or columns with missing values.



In [44]:
# Drop rows with missing values
df.dropna()

Unnamed: 0,Name,Age,Score
0,John,28.0,85.0
3,Linda,32.0,88.0


In [45]:
# Drop columns with missing values
df.dropna(axis=1)

Unnamed: 0,Name
0,John
1,Anna
2,Peter
3,Linda



## Filling Missing Data

Use `fillna()` to replace missing values with a specified value, or use forward/backward fill methods.

In [46]:
# Fill missing values with a specific value
df.fillna(value=0)


Unnamed: 0,Name,Age,Score
0,John,28.0,85.0
1,Anna,0.0,90.0
2,Peter,35.0,0.0
3,Linda,32.0,88.0


In [47]:
# Forward fill (fill missing values with the previous row's value)
df.fillna(method='ffill')


  df.fillna(method='ffill')


Unnamed: 0,Name,Age,Score
0,John,28.0,85.0
1,Anna,28.0,90.0
2,Peter,35.0,90.0
3,Linda,32.0,88.0


In [48]:

# Backward fill (fill missing values with the next row's value)
df.fillna(method='bfill')


  df.fillna(method='bfill')


Unnamed: 0,Name,Age,Score
0,John,28.0,85.0
1,Anna,35.0,90.0
2,Peter,35.0,88.0
3,Linda,32.0,88.0



## Replacing Missing Data

Use `replace()` to replace specific values (including `NaN` values) in the DataFrame.

In [49]:

# Replace NaN values with a specific value
df.replace(to_replace=np.nan, value=0)

Unnamed: 0,Name,Age,Score
0,John,28.0,85.0
1,Anna,0.0,90.0
2,Peter,35.0,0.0
3,Linda,32.0,88.0
